Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Dictionary-based Correction - Design

This page describes the high level design for dictionary-based corrections of spelling errors.

  • Detect a legit token:
    • if legit: go to the detection process as described bellows.
    • if not legit: go to the next token (no change on current token)
      • space token: no need to correct a space.
      • token is too long: when a token is too long (has too many characters) and is not in the dictionary (non-word), it could generate too many candidates for corrections. This process might cause slow speed performance and even use up all memory. Thus, the maximum length of a legit token (CS_MAX_LEGIT_TOKEN_LENGTH) should be specified in the configuration file, default value is set as 30. Also, there is a little chance to correctly correct such tokens. Accordingly, any token is longer than the specified value are not a legit token and will not be corrected.

  • Tokenize: Get CoreTerm from the input word (inToken)

  • Detection: detect if the focused token (coreTerm) matches the correction criteria,
    • if yes: go to the correction process
    • if no: no correction

  • Correction:
    • Candidate Generator:
      • Retrieve all possible candidates (Church's reverse minimum edit distance technique - all possibility by edit distance and in the dictionary)
      • Generate validated candidates
        • one-to-one candidates
        • Split candidates
        • merge candidates
    • Ranker
      • find the highest ranked score by different scoring systems
        • Orthographic
        • Frequency
        • Context

        • Noisy Channel
        • Ensemble method
        • CSpell 2-stage
    • Corrector
      • Update coreTerm
      • If coreTerm change (there is a correction)
        • Update outToken
        • update operation information
      • Update context (the whole in Token list) for the merge case

  • Flow Chart:

    flowChart