The SPECIALIST Lexicon

Spelling Variant Patterns - MES (Metaphone, Edit Distance, Sorted Distance)

I. Introduction

After using the normalization, most terms (87.92% in Lexicon.2014) are identified and grouped as spelling variants. However, spelling variants such as:

  • anemia|anaemia
  • anemic|anaemic
  • yuppie|yuppy
  • yuppie flu|yuppy flu
  • zoril|zorilla|zorille|zorillo
  • lamellose|lamellous
  • ...
are not identified. The following algorithm is used to further spelling variants identifications.

II. Algorithm Details

For those terms are not identified in any spelling variant group, MES algorithm is used to checks three properties:

  • Same metaphone form
  • Limited Edit distance (1: 80.26%, 2:95.41%, 3:99.07%, 4:99.75%)
  • The smallest sorted distance:
    • All terms are sorted in the alphabetical order. The difference in indexes of two terms is the sorted distance.
    • The closer the sorted distance of two terms is, the more chance that these terms are spelling variants
    • A term should be only spVar to only 1 targetSpVar (with min. sorted distance)
  • Inputs: norm-form|spVar-1|spVar-2|...
  • Outputs: base-form|spVar-1|spVar-2|...
  • Steps:
    • Save input to HashMap<String, HashSet<String>>
      key-value pairs of (norm-form, HashSet of spVars)
    • group more spVars
      • Pre-process:
        Go through all input norm-form to get:
        • singleSpVarList: the list of single terms (without spVar)
          => It is used as source to add to target spVar group
          => sorted alphabetically
        • sortedSpVarList: the list of sorted spVars
          => It is used for sorted distance
          => sorted alphabetically
        • baseSpVarsMap: the map of (base, list of (sorted spVars))
          => It is the (unsorted) result
          => SpVars is sorted by lexCheck.CheckCont.BaseCompartor
          => base is the first element in the sorted SpVars
          Same comparator used in Lexicon
          1. pure ASCII first
          2. no punctuation first
          3. shortest first
          4. case sensitive alphabetic sort
        • spVarBaseMap: the map of (spVar, base)
          => This map serves as index for adding spVar to the right base
          => All spVars are supposed to be unique
        • mpSpVarsMap: map of (metaphone, set of (spVars))
          =>Used to find all spVars with same metaphone

      • Process:
        Go through all single terms in singleSpVarList to add to spVar class if:
        • Find sameMpSpVarList for spVars with same metaphone
          • Find metaphone of each singleSpVar
          • Find spVars with same methpahone, sameMpSpVarList, by using mpSpVarsMap
        • Find withinEdSpVarList for spVars with same metaphone and within specified edit distance
          • Go through each spVar in sameMpSpVarList from above
          • Check if within specified edit distance
            if so, add all spVar to withinEdSpVarList
        • Find the tragetSpVar with the min. sorted distance to add
          • Go through each spVar in withinEdSpVarList from above
          • Find the targetSpVar with min. sorted distance in withinEdSpVarList, using sortedSpVarList
            targetSpvar is the where the single spvar to add (spVars of single spVar)
        • Add singleSpVar to baseSpVarsMap at tragetSpVar
          • Get targetKey from spVarBase using targetSpVar
          • Get targetValues from spVarBase using singleSpVar
          • Get sourceKey from spVarBase using singleSpVar
          • Get sourceValues from spVarBase using singleSpVar
          • Add sourceValues to the values of baseSpVarsMap with key is targetKey
          • remove sourceKey from baseSpVarsMap
          • update spVarBaseMap by replacing new targetKey with srcValue
        • Final formating for result baseSpVarsMapMES by sorting baseSpVarsMap
          • Go through each baseSpVarsMap
          • Sort values (using LexCheck base comparator)
          • Assign base to the sorted[0]
          • Put to the result baseSpVarsMapMES
    • Print out baseSpVarsMapMES: base-form|spVar-1|spVar-2|...
      • sorted by key (base)

III. Algorithm Examples

Example termNormMetaphoneEdit Distance
Edit distance = 1
anemia|anaemiaanemia|anaemiaANM1
anemic|anaemicanemic|anaemicANMK1
abortigenic|abortogenicabortigenic|abortogenicABRTJNK1
lamictal|lamiktallamictal|lamiktalLMKTL1
Example termNormMetaphoneEdit Distance
Edit distance = 2
yuppie|yuppyyuppie|yuppyYP2
yuppie flu|yuppy fluyuppieflu|yuppyfluYPFL2
lamellose|lamellouslamellose|lamellousLMLS2
zoril|zorilla|zorille|zorillozoril|zorilla|zorille|zorilloSRL2 (1)

Example termNormMetaphoneEdit Distance
Edit distance = 3
Adson's maneuver|Adson's manoeuvreadsonmaneuver|adsonmanoeuvreATSNSMNFR3
amylcinnamal|amyl cinnamoylamylcinnamal|amylcinnamoylAMLSNML3
directress|directricedirectress|directriceTRKTRS3
tizoprolic|tizoproliquetizoprolic|tizoproliqueTSPRLK3
type 3 deiodinase|type III deiodinasetypethreedeiodinase|typeiiideiodinaseTPTTNS3

Example termNormMetaphoneEdit Distance
Edit distance = 4
Telugu|Teloogootelugu|teloogooTLK4
bromofenofos|bromophenophosbromofenofos|bromophenophosBRMFNFS4
comradery|camaraderiecomradery|camaraderieKMRTR4
litchi nut|lychee nutlitchinut|lycheenutLXNT4
fosfomycin|phosphomycinfosfomycin|phosphomycinFSFMSN4