LexBuild - Close Match - CSpell
Requirements:
When new records are added to the Lexicon, linguists check the new word verse existing lexRecords to locate similar words by:
- Exact match: is there an identical entry that already exists?
- CLose match: is there an existing SpVar record which can be grouped with the word are searched for?
This feature provides all close match terms using CSpell.
The natural feature of this type of close match is to find similar words instead of fix spelling errors. Accordingly, the ranking system is somewhat different than one in the cSpell.
- LexiconLb: includes all approved base/inflVars of LEXICON
- LexiconLbTemp: includes all submitted (unapproved) base/inflVars of LEXICON
Applying CSpell algorithm:
Process
- Find the close match
- Use Reverse Church's Edit Distance to find all possible spelling within 2 edit distance.
- Filter the above possible spelling words through inflVars to retrieve candidates in the Lexicon
- Use Orthographic score in the CSpell to rank candidates and retireve top 15 candidates as close-matched words. The Orthographic score is the score without context information. It is defined as:
Orthographic score = wf1 * Token similarity score + wf2 * Phonetic similarity score + wf3 * Overlap similarity score
. The weighting factor of phonetic score is more because it is used to find spVars which have same pronunciation with different spelling.
where, wrighting factors are set up (through our analysis and tests) as:
wf1 = 0.70
wf2 = 0.80
wf3 = 1.00
- Enhanced perofrmance with spVar analysis
SpVars only differ in certain patterns. These patterns are gathered and implmented in the algorithm to increase the performance (recall) of cloase match. The distance is set to 1 for all these pattern. These patterns are described as follows:
AddSpVarOfAbbAcr()
spVars of upper case ABB/ACR often differ in period.
AddSpVarOfSpaceHyphen()
: spVars often differ in spaces, and hyphens:
- AF B1|AF-B1|AFB1
- Ara CTP|Ara-CTP|AraCTP
- Asian-American|Asian American
AddByReplaceMapToCmStrList( )
: spVars often differ in sub-words which are spVars, rank, number:
- spVar: antibody-labelled|antibody-labeled, Klatskin's tumour|Klatskin's tumor
- rank: 1st cervical vertebra|first cervical vertebra, 11th thoracic vertebra|eleventh thoracic vertebra
- number: two year survival rate|2 year survival rate, seven transmembrane domain receptor|7 transmembrane domain receptor
- spvars
- Rank & number: Spar often have different ways to represent rank and numbers:
- Implementation
- Find the exact match (by term)
- Find close match
- Display matches
Post-Process
- Reload InflVars everyday after midmight? (TBD)