The SPECIALIST Lexicon

Spelling Variant Patterns - MCES (Metaphone, Caverphone Edit Distance, Sorted Distance)

I. Introduction

This enhanced MCES model is based on Double Metaphone (with maxCodeLength = 60, Caverphone 2.0, Edit distance, and min. sorted distance). They are descibed below:

II. Algorithm

This algorithm must apply on a spVar grouped file (such as from SpVarNorm). For those terms are not identified in any spelling variant group, MCES algorithm is used to add more spVars to the existing group by checks following properties:

III. Algorithm Details

  • Inputs: norm-form|spVar-1|spVar-2|...
  • Outputs: base-form|spVar-1|spVar-2|...
  • Steps:
    • Save input to HashMap<String, HashSet<String>>
      key-value pairs of (norm-form, HashSet of spVars)
    • group more spVars
      • Pre-process:
        Go through all input norm-form to get:
        • singleSpVarList: the list of single terms (without spVar)
          => It is used as source to add to target spVar group
          => sorted alphabetically
        • sortedSpVarList: the list of sorted spVars
          => It is used for sorted distance
          => sorted alphabetically
        • baseSpVarsMap: the map of (base, list of (sorted spVars))
          => It is the (unsorted) result
          => SpVars is sorted by lexCheck.CheckCont.BaseCompartor
          => base is the first element in the sorted SpVars
          Same comparator used in Lexicon
          1. pure ASCII first
          2. no punctuation first
          3. shortest first
          4. case sensitive alphabetic sort
        • spVarBaseMap: the map of (spVar, base)
          => This map serves as index for adding spVar to the right base
          => All spVars are supposed to be unique
        • mpSpVarsMap: map of (metaphone, set of (spVars))
          =>Used to find all spVars with same metaphone

      • Process:
        Go through all single terms in singleSpVarList to add to spVar class if:
        • Find sameMpSpVarList for spVars with same Metaphone
          • Find metaphone of each singleSpVar
          • Find spVars with same methpahone, sameMpSpVarList, by using mpSpVarsMap
        • Find sameCpSpVarList for spVars with same Caverphone 2.0
        • Find sameCpSpVarList for spVars are not GrecoLatin plural form (to singleSpVar)
        • Find withinEdSpVarList for spVars with same metaphone and within specified edit distance
          • Go through each spVar in sameMpSpVarList from above
          • Check if within specified edit distance
            if so, add all spVar to withinEdSpVarList
        • Find the tragetSpVar with the min. sorted distance to add
          • Go through each spVar in withinEdSpVarList from above
          • Find the targetSpVar with min. sorted distance in withinEdSpVarList, using sortedSpVarList
            targetSpvar is the where the single spvar to add (spVars of single spVar)
        • Add singleSpVar to baseSpVarsMap at tragetSpVar
          • Get targetKey from spVarBase using targetSpVar
          • Get targetValues from spVarBase using singleSpVar
          • Get sourceKey from spVarBase using singleSpVar
          • Get sourceValues from spVarBase using singleSpVar
          • Add sourceValues to the values of baseSpVarsMap with key is targetKey
          • remove sourceKey from baseSpVarsMap
          • update spVarBaseMap by replacing new targetKey with srcValue
        • Final formating for result baseSpVarsMapMES by sorting baseSpVarsMap
          • Go through each baseSpVarsMap
          • Sort values (using LexCheck base comparator)
          • Assign base to the sorted[0]
          • Put to the result baseSpVarsMapMES
    • Print out baseSpVarsMapMES: base-form|spVar-1|spVar-2|...
      • sorted by key (base)

IV. Algorithm Examples

Test CaseTermsNormMetaphoneCaverphoneGrecoLatinEdit DistanceSorted DistanceSpVar EUI
Case 1: Same Metaphone and Caverphone, not GL plural, Edit distance = 1
1.1anemia
anaemia
anemia
anaemia
ANM
ANM
ANMA111111
ANMA111111
false1E0008920
1.2anemic
anaemic
anemic
anaemic
ANMK
ANMK
ANMK111111
ANMK111111
false1E0528325
1.3abortigenic
abortogenic
abortigenic
abortogenic
APRTJNK
APRTJNK
APTKNK1111
APTKNK1111
false1E0583447
1.4lamictal
lamiktal
lamictal
lamiktal
LMKTL
LMKTL
LMKTA11111
LMKTA11111
false1E0413046
1.5aestheticise
aestheticize
aestheticise
aestheticize
AS0TSS
AS0TSS
ASTTSS1111
ASTTSS1111
false1E0547192
Case 2: Same Metaphone and Caverphone, not GL plural, Edit distance = 2
2.1yuppie
yuppy
yuppie
yuppy
AP
AP
YPA1111111
YPA1111111
false2E0520693
2.2yuppie flu
yuppy flu
yuppieflu
yuppyflu
APFL
APFL
YPFLA11111
YPFLA11111
false2E0520692
2.3lamellose
lamellous
lamellose
lamellous
LMLS
LMLS
LMLS111111
LMLS111111
false2E0587907
2.4zoril
zorilla
zoril
zorilla
SRL
SRL
SRA1111111
SRLA111111
false2E0341649
2.5zorilla
zorille
zorilla
zorille
SRL
SRL
SRLA111111
SRA1111111
false1E0341649
2.6zorille
zorillo
zorille
zorillo
SRL
SRL
SRLA111111
SRA1111111
false1E0341649
2.7zorillo
zoril
zorillo
zoril
SRL
SRL
YPA1111111
SRA1111111
false2E0341649
Case 3: Same Metaphone and Caverphone, Not GL plural, Edit distance = 3
3.1Adson's maneuver
Adson's manoeuvre
adsonmaneuver
adsonmanoeuvre
ATSNSMNFR
ATSNSMNFR
ATSNSMNFA1
ATSNSMNFA1
false3E0213214
3.2amylcinnamal
amyl cinnamoyl
amylcinnamal
amylcinnamoyl
AMLSNML
AMLSNML
AMSNMA1111
AMSNMA1111
false3E0557025
3.3directress
directrice
directress
directrice
TRKTRS
TRKTRS
TRKTRS1111
TRKTRK1111
false3E0207379
3.4tizoprolic
tizoprolique
tizoprolic
tizoprolique
TSPRLK
TSPRLK
TSPRLK1111
TSPRLKA111
false3E0566262
3.5type 3 deiodinase
type III deiodinase
typethreedeiodinase
typeiiideiodinase
TPTTNS
TPTTNS
TPTTNS1111
TPTTNS1111
false3E0681935
Case 4: Same Metaphone and Caverphone, Not GL plural, Edit distance = 4
4.1Telugu
Teloogoo
telugu
teloogoo
TLK
TLK
TLKA111111
TLKA111111
false4E0205161
4.2bromofenofos
bromophenophos
bromofenofos
bromophenophos
PRMFNFS
PRMFNFS
PRMFNFS111
PRMFNFS111
false4E0303924
4.3comradery
camaraderie
comradery
camaraderie
KMRTR
KMRTR
KMRTRA1111
KMRTRA1111
false4E0333034
4.4litchi nut
lychee nut
litchinut
lycheenut
LXNT
LXNT
LKNT111111
LKNT111111
false4E0456918
4.5fosfomycin
phosphomycin
fosfomycin
phosphomycin
FSFMSN
FSFMSN
FSFMSN1111
FSFMSN1111
false4E0028649
Case 5: Same Metaphone, different Caverphone, not GL plural, Edit distance = 2
5.1aesthetical
aesthetically
aesthetical
aesthetically
AS0TKL
AS0TKL
ASTTKA1111
ASTTKLA111
false2false
5.2zymographical
zymographically
zymographical
zymographically
SMKRFKL
SMKRFKL
SMKRFKA111
SMKRFKLA11
false2false
Case 6: Same Metaphone, Caverphone, Edit distance = 2, GrecoLatin Plural
6.1acroscleroses
acrosclerosis
acroscleroses
acrosclerosis
AKRSKRSS
AKRSKRSS
AKRSKLRSS1
AKRSKLRSS1
true1false
6.2zygomycoses
zygomycosis
zygomycoses
zygomycosis
SKMKSS
SKMKSS
SKMKSS1111
SKMKSS1111
true1false
6.3ammon's horn scleroses
ammon's horn sclerosis
ammonhornscleroses
ammonhornsclerosis
AMNSRNSKRSS
AMNSRNSKRSS
AMNSNSKLRS
AMNSNSKLRS
true1false
6.4fimbria
fimbriae
fimbria
fimbriae
FMPR
FMPR
FMPRA11111
FMPRA11111
true1false
6.5infraorbital foramen
infraorbital foramina
infraorbitalforamen
infraorbitalforamina
ANFRRPTLFRMN
ANFRRPTLFRMN
ANFRPTFRMN
ANFRPTFRMN
true2false
6.6bacterial culture media
bacterial culture medium
bacterialculturemedia
bacterialculturemedium
PKTRLKLTRMT
PKTRLKLTRMTM
PKTRKTRMTA
PKTRKTRMTM
true2false
Case 7: Same Metaphone, Caverphone, Edit distance < 2, not GrecoLatin Plural (false positive)
7.1zixoryn
zixorin
zixoryn
zixorin
SKSRN
SKSRN
SKRN111111
SKRN111111
false1false
7.2zygomycetes
zygomycetous
zygomycetes
zygomycetous
SKMSTS
SKMSTS
SKMSTS1111
SKMSTS1111
false2false