Real-word Correction
This page describes the algorithm for real-word correction. In general, detection and correction for real-word errors in CSpell is computed on the fly, based on context score, word frequency score, and other heuristic rules. No confusion set or assumption on the number of real-word errors were used.
I. Functions
II. Results on the Training Set
Tested different methods on the real-word included gold standard from the training set.
Methods | Raw data | Performance |
---|---|---|
Ensemble (Use Non-Word on Real-Word) | 556|825|964 | 0.6739|0.5768|0.6216 |
Ensemble (Real-Word) | 517|718|964 | 0.7201|0.5363|0.6147 |
CSpell: NW | 609|731|964 | 0.8331|0.6317|0.7186 |
CSpell: NW + RW_Merge | 619|742|964 | 0.8342|0.6421|0.7257 |
CSpell: NW + RW_Split | 611|737|964 | 0.8290|0.6338|0.7184 |
CSpell: NW + RW_1To1 | 614|740|964 | 0.8297|0.6369|0.7207 |
CSpell: NW + RW_Merge + RW_Split | 621|747|964 | 0.8313|0.6442|0.7259 |
CSpell: NW + RW_Merge + RW_Split + RW_1To1 | 626|756|964 | 0.8280|0.6494|0.7279 |
III. Examples
ID | Input | Output | Notes |
---|---|---|---|
M-1 | on set | on set | No merge |
M-2 | based on set criteria | based on set criteria | No merge |
M-3 | early on set | early onset | Merged |
M-4 | on set dementia | onset dementia | Merged |
M-5 | dianosed early on set deminita | diagnosed early onset dementia | Merged with other NW corrections |
ID | Input | Output | Notes |
---|---|---|---|
S-1 | along | along | No Split |
S-2 | for along time | for a long time | Split |
S-3 | He is along | He is along | No split |
S-4 | He is a long with me | He is along with me | No split - Merge |
ID | Input | Output |
---|---|---|
1-1 | foul small | foul smell |
1-2 | bad small | bad smell |
1-3 | small an odor | smell an odor |
1-4 | sense of small | sense of smell |
1-5 | taste and small | taste and smell |
1-6 | smell size | small size |
1-7 | smell amount | small amount |
1-8 | a smell sip of water | a small sip of water |
1-9 | smell intestine | small intestine |
1-10 | very smell | very small |
1-11 | relatively smell | relatively small |