CSpell

Dictionary-based Correction - Design

This page describes the high level design for dictionary-based corrections of spelling errors.

Detect a legit token:
- if legit: go to the detection process as described bellows.
- if not legit: go to the next token (no change on current token)
  - space token: no need to correct a space.
  - token is too long: when a token is too long (has too many characters) and is not in the dictionary (non-word), it could generate too many candidates for corrections. This process might cause slow speed performance and even use up all memory. Thus, the maximum length of a legit token (CS_MAX_LEGIT_TOKEN_LENGTH) should be specified in the configuration file, default value is set as 30. Also, there is a little chance to correctly correct such tokens. Accordingly, any token is longer than the specified value are not a legit token and will not be corrected.
Tokenize: Get CoreTerm from the input word (inToken)
Detection: detect if the focused token (coreTerm) matches the correction criteria,
- if yes: go to the correction process
- if no: no correction
Correction:
- Candidate Generator:
  - Retrieve all possible candidates (Church's reverse minimum edit distance technique - all possibility by edit distance and in the dictionary)
  - Generate validated candidates
    - one-to-one candidates
    - Split candidates
    - merge candidates
- Ranker
  - find the highest ranked score by different scoring systems
    - Orthographic
    - Frequency
    - Context
    - Noisy Channel
    - Ensemble method
    - CSpell 2-stage
- Corrector
  - Update coreTerm
  - If coreTerm change (there is a correction)
    - Update outToken
    - update operation information
  - Update context (the whole in Token list) for the merge case
Flow Chart: