CSpell

Analysis - Impact Factors

The approach is to de-couple the relationship between impact factors. And then to optimize each factors for the optimized implementation of CSpell. Impact factors includes (but not limited to):

  • Dictionary (spelling error detection and spelling correction suggestion)
  • How big is the edit distance for candidates
  • Ranking score system (method and weight)

These factors have complicated relationship with each module. The impact factors associated with modules are summarized as follows:

ModuleAlgorithm - Factors
Pre-Correction (not dictionary based):
  • Rule-based algorithm
    • Patterns observed from Lexicon
    • Developed algorithm for all patterns
Dictionary based correction:
Spelling Checker
  • IsValidWord (Check Dictionary)
    • check word
    • check core-term
    • check possessive
    • check slash or (case/test)
    • Check parenthetic plural forms (s), (es), (ies)

      Dictionary should include words:

    • IsSpVar
    • IsProperNoun
    • IsAbbAcr
  • IsExecption
    • IsDigit
    • IsPunc
    • IsDigitPunc
    • IsUrl
    • IsEmail
    • IsEmptyString
    • IsMeasurements (Unit)
Candidates: 1-to-1
  • Possibility: Edit Distance (<= 2)
  • IsDicWord (Suggestion Dictionary)
Candidates: Split
  • Possibility: Number of Split

  • IsMultiword: (Multiword Dictionary)
  • IsDicWord: (Suggest Dictionary)
  • IsAbb/Acr + length: (Abb/Acr Dictionary, exclude Aa with small length)
Candidates: MergeTBD: most of merge cases are not typos, involves real-word correction
Ranking: Orthographic
  • EditDistance
  • phonetic (Metaphone 2)
  • Overlap

  • Weights
Ranking: FrequencyTBD
Ranking: ContextTBD

Tokenized words are split into two groups (with/without annotations) in this analysis. The analysis reports are described as in the results section.