Dictionary-based Correction - Design
This page describes the high level design for dictionary-based corrections of spelling errors.
- Detect a legit token:
- if legit: go to the detection process as described bellows.
- if not legit: go to the next token (no change on current token)
- space token: no need to correct a space.
- token is too long: when a token is too long (has too many characters) and is not in the dictionary (non-word), it could generate too many candidates for corrections. This process might cause slow speed performance and even use up all memory. Thus, the maximum length of a legit token (CS_MAX_LEGIT_TOKEN_LENGTH) should be specified in the configuration file, default value is set as 30. Also, there is a little chance to correctly correct such tokens.
Accordingly, any token is longer than the specified value are not a legit token and will not be corrected.
- Tokenize: Get CoreTerm from the input word (inToken)
- Detection: detect if the focused token (coreTerm) matches the correction criteria,
- if yes: go to the correction process
- if no: no correction
- Correction:
- Candidate Generator:
- Retrieve all possible candidates (Church's reverse minimum edit distance technique - all possibility by edit distance and in the dictionary)
- Generate validated candidates
- one-to-one candidates
- Split candidates
- merge candidates
- Ranker
- find the highest ranked score by different scoring systems
- Orthographic
- Frequency
- Context
- Noisy Channel
- Ensemble method
- CSpell 2-stage
- Corrector
- Update coreTerm
- If coreTerm change (there is a correction)
- Update outToken
- update operation information
- Update context (the whole in Token list) for the merge case
- Flow Chart: