CSpell

Real-word Correction

This page describes the algorithm for real-word correction. In general, detection and correction for real-word errors in CSpell is computed on the fly, based on context score, word frequency score, and other heuristic rules. No confusion set or assumption on the number of real-word errors were used.

I. Functions

II. Results on the Training Set

Tested different methods on the real-word included gold standard from the training set.

MethodsRaw dataPerformance
Ensemble (Use Non-Word on Real-Word)556|825|9640.6739|0.5768|0.6216
Ensemble (Real-Word)517|718|9640.7201|0.5363|0.6147
CSpell: NW609|731|9640.8331|0.6317|0.7186
CSpell: NW + RW_Merge619|742|9640.8342|0.6421|0.7257
CSpell: NW + RW_Split611|737|9640.8290|0.6338|0.7184
CSpell: NW + RW_1To1614|740|9640.8297|0.6369|0.7207
CSpell: NW + RW_Merge + RW_Split621|747|9640.8313|0.6442|0.7259
CSpell: NW + RW_Merge + RW_Split + RW_1To1626|756|9640.8280|0.6494|0.7279

  • RW_M and RW_S: ~1 min.
  • RW_1: ~4 min.
  • RW_M_S: ~1 min.
  • RW_A: ~4.5 min.

III. Examples

  • Merge:

    IDInputOutputNotes
    M-1on seton setNo merge
    M-2based on set criteriabased on set criteriaNo merge
    M-3early on setearly onsetMerged
    M-4on set dementiaonset dementiaMerged
    M-5dianosed early on set deminitadiagnosed early onset dementiaMerged with other NW corrections
    • "on set" is merged to "on set" depends on the context. In Example M-5, dianosed and deminita are also corrected to "diagnosed" and "dementia" respectively in the non-word functions before the real-word merged.

  • Split:

    IDInputOutputNotes
    S-1alongalongNo Split
    S-2for along timefor a long timeSplit
    S-3He is alongHe is alongNo split
    S-4He is a long with meHe is along with meNo split - Merge
    • Google does not correct S-2 and S-4!!

  • Spelling (1-to-1):

    IDInputOutput
    1-1foul smallfoul smell
    1-2bad smallbad smell
    1-3small an odorsmell an odor
    1-4sense of smallsense of smell
    1-5taste and smalltaste and smell
    1-6smell sizesmall size
    1-7smell amountsmall amount
    1-8a smell sip of watera small sip of water
    1-9smell intestinesmall intestine
    1-10very smellvery small
    1-11relatively smellrelatively small
    • Google does not correct 1-3, 1-5, 1-10 and 1-11!!