CSpell

Spelling (1-to-1) Candidates

I. Introduction

A token is a single word (term without spaces). If a token is corrected to a single word for spelling is called 1-to-1 spelling correction. If the token is a spelling error, it is a non-word spelling error correction. Otherwise, it is a real-word correction. This page focus on the non-word correction.

II. Algorithm

  • Get all possible edit combinations from the spelling error words within 1 edit distance by:
    • Deletion: abc -> {bc, ac, ab}
    • Insertion: abc -> {aabc, babc, ..., zabc, aabc, abbc, ..., abcz}
      => Do not insert space (split has different suggestion dictionary filter)
    • Substitution: bc -> {abc, bbc, ..., zbc, aac, abc, ..., abz}
    • Transposition (Swap): abc -> {bac, acb}

    • Sum all above combinations
  • Get all possible combinations from the spelling error words within 2 edit distance by:
    • Apply the same algorithm on the result of above (edit distance 1)
    • Sum all combinations
  • Check all combinations within 2 edit-distance if there are dic.IsDicWord( )
    • If yes, they are suggested candidates

III. Test Examples

Error tokenCorrected wordEdit DistanceNotes
non-word 1-to-1 correction
neefneed1
thiertheir1
shudshould2Metaphone2: [XT]|[XLT]
cortcaught4Metaphone2: [KRT]|[KFT]|1
real-word 1-to-1 correction
theretheir2Metaphone2: [0R]|[OR]

IV. TBD

  • Algorithm:
    • Substitution might need to add phonetic in, such as "f" to "ph", "k" -> "ch",
    • Substitution might need to consider the adjacent keys for edit distance of three. For example, "f" -> {t, r, e, d, c, v, b, g}.