CSpell

Dictionary-based Correction - Design

This page describes the high level design for dictionary-based corrections of spelling errors.

  • Detect a legit token:
    • if legit: go to the detection process as described bellows.
    • if not legit: go to the next token (no change on current token)
      • space token: no need to correct a space.
      • token is too long: when a token is too long (has too many characters) and is not in the dictionary (non-word), it could generate too many candidates for corrections. This process might cause slow speed performance and even use up all memory. Thus, the maximum length of a legit token (CS_MAX_LEGIT_TOKEN_LENGTH) should be specified in the configuration file, default value is set as 30. Also, there is a little chance to correctly correct such tokens. Accordingly, any token is longer than the specified value are not a legit token and will not be corrected.

  • Tokenize: Get CoreTerm from the input word (inToken)

  • Detection: detect if the focused token (coreTerm) matches the correction criteria,
    • if yes: go to the correction process
    • if no: no correction

  • Correction:
    • Candidate Generator:
      • Retrieve all possible candidates (Church's reverse minimum edit distance technique - all possibility by edit distance and in the dictionary)
      • Generate validated candidates
        • one-to-one candidates
        • Split candidates
        • merge candidates
    • Ranker
      • find the highest ranked score by different scoring systems
        • Orthographic
        • Frequency
        • Context

        • Noisy Channel
        • Ensemble method
        • CSpell 2-stage
    • Corrector
      • Update coreTerm
      • If coreTerm change (there is a correction)
        • Update outToken
        • update operation information
      • Update context (the whole in Token list) for the merge case

  • Flow Chart:

    flowChart