CSpell

Candidate Generators

I. Introduction

Tokens that are identified as spelling errors from detectors will be corrected. The first step is to find correct spelling candidates, then use ranking system to find the best match (highest rank) from the candidate list. The Church's reverse minimum edit distance technique was used to generate non-word spelling and split candidates to avoid expensive edit distance computations between a misspelled word and all words (~0.6 million) in the dictionary. In general, two steps are implemented to find candidates for 1-to-1 and split correction:

  • Find the possible candidate from edit distance, phonetic similarity, etc.
    • Edit Distance: over 91% of correction are within 2 edit distance in our training set (baseline - 472 files, 468 instances of dictionary based corrections):

      Edit DistanceInstancePercentageAccu. Percentage
      131767.74%67.74%
      211023.50%91.24%
      3245.13%96.37%
      481.71%98.08%
      561.28%99.36%
      620.43%99.79%
      710.21%100.00%
      Total468100.00%100.00%
    • Use edit distance 2 as initial specification to cover over 91% of errors.
    • Phonetic and adjacent keys are planned to be investigated and implemented (if needed) for instances with edit distance between 2 and 3 to reach higher recall.
    • Confusion matrix with frequency of letter substitutions could increase the recall (future work).
  • Filter by suggestion dictionary
    • A domain specific (customer's) dictionary should be derive for best performance (precision and recall). In general, this dictionary can be the same (or smaller than) checking dictionary. Specific filters (such as multiwords, abbreviations, acronyms, spelling variants, Semantic types, etc.) might be used to derive this dictionary.

II. Algorithm