Spelling (1-to-1) Candidates
I. Introduction
A token is a single word (term without spaces). If a token is corrected to a single word for spelling is called 1-to-1 spelling correction. If the token is a spelling error, it is a non-word spelling error correction. Otherwise, it is a real-word correction. This page focus on the non-word correction.
II. Algorithm
- Get all possible edit combinations from the spelling error words within 1 edit distance by:
- Deletion: abc -> {bc, ac, ab}
- Insertion: abc -> {aabc, babc, ..., zabc, aabc, abbc, ..., abcz}
=> Do not insert space (split has different suggestion dictionary filter)
- Substitution: bc -> {abc, bbc, ..., zbc, aac, abc, ..., abz}
- Transposition (Swap): abc -> {bac, acb}
- Sum all above combinations
- Get all possible combinations from the spelling error words within 2 edit distance by:
- Apply the same algorithm on the result of above (edit distance 1)
- Sum all combinations
- Check all combinations within 2 edit-distance if there are dic.IsDicWord( )
- If yes, they are suggested candidates
III. Test Examples
Error token | Corrected word | Edit Distance | Notes
|
---|
non-word 1-to-1 correction
|
neef | need | 1 |
|
thier | their | 1 |
|
shud | should | 2 | Metaphone2: [XT]|[XLT]
|
cort | caught | 4 | Metaphone2: [KRT]|[KFT]|1
|
real-word 1-to-1 correction
|
there | their | 2 | Metaphone2: [0R]|[OR]
|
IV. TBD
- Algorithm:
- Substitution might need to add phonetic in, such as "f" to "ph", "k" -> "ch",
- Substitution might need to consider the adjacent keys for edit distance of three. For example, "f" -> {t, r, e, d, c, v, b, g}.