Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Spelling (1-to-1) Candidates

I. Introduction

A token is a single word (term without spaces). If a token is corrected to a single word for spelling is called 1-to-1 spelling correction. If the token is a spelling error, it is a non-word spelling error correction. Otherwise, it is a real-word correction. This page focus on the non-word correction.

II. Algorithm

  • Get all possible edit combinations from the spelling error words within 1 edit distance by:
    • Deletion: abc -> {bc, ac, ab}
    • Insertion: abc -> {aabc, babc, ..., zabc, aabc, abbc, ..., abcz}
      => Do not insert space (split has different suggestion dictionary filter)
    • Substitution: bc -> {abc, bbc, ..., zbc, aac, abc, ..., abz}
    • Transposition (Swap): abc -> {bac, acb}

    • Sum all above combinations
  • Get all possible combinations from the spelling error words within 2 edit distance by:
    • Apply the same algorithm on the result of above (edit distance 1)
    • Sum all combinations
  • Check all combinations within 2 edit-distance if there are dic.IsDicWord( )
    • If yes, they are suggested candidates

III. Test Examples

Error tokenCorrected wordEdit DistanceNotes
non-word 1-to-1 correction
neefneed1
thiertheir1
shudshould2Metaphone2: [XT]|[XLT]
cortcaught4Metaphone2: [KRT]|[KFT]|1
real-word 1-to-1 correction
theretheir2Metaphone2: [0R]|[OR]

IV. TBD

  • Algorithm:
    • Substitution might need to add phonetic in, such as "f" to "ph", "k" -> "ch",
    • Substitution might need to consider the adjacent keys for edit distance of three. For example, "f" -> {t, r, e, d, c, v, b, g}.