Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Candidate Generators

I. Introduction

Tokens that are identified as spelling errors from detectors will be corrected. The first step is to find correct spelling candidates, then use ranking system to find the best match (highest rank) from the candidate list. The Church's reverse minimum edit distance technique was used to generate non-word spelling and split candidates to avoid expensive edit distance computations between a misspelled word and all words (~0.6 million) in the dictionary. In general, two steps are implemented to find candidates for 1-to-1 and split correction:

  • Find the possible candidate from edit distance, phonetic similarity, etc.
    • Edit Distance: over 91% of correction are within 2 edit distance in our training set (baseline - 472 files, 468 instances of dictionary based corrections):

      Edit DistanceInstancePercentageAccu. Percentage
      131767.74%67.74%
      211023.50%91.24%
      3245.13%96.37%
      481.71%98.08%
      561.28%99.36%
      620.43%99.79%
      710.21%100.00%
      Total468100.00%100.00%
    • Use edit distance 2 as initial specification to cover over 91% of errors.
    • Phonetic and adjacent keys are planned to be investigated and implemented (if needed) for instances with edit distance between 2 and 3 to reach higher recall.
    • Confusion matrix with frequency of letter substitutions could increase the recall (future work).
  • Filter by suggestion dictionary
    • A domain specific (customer's) dictionary should be derive for best performance (precision and recall). In general, this dictionary can be the same (or smaller than) checking dictionary. Specific filters (such as multiwords, abbreviations, acronyms, spelling variants, Semantic types, etc.) might be used to derive this dictionary.

II. Algorithm