Frequency Score
I. Introduction
This page describes the ranking by frequency score to choose a correct word from the suggested candidates for a spelling error word.
The following corpus from Ensemble is used as baseline to test different frequency score formula for the best performance.
Corpus | Sources | Statistics |
---|---|---|
Ensemble | health related articles in:
|
|
Unigram word count is used as frequency. The format of word frequency file is
word | word count |
III. Algorithm, Tests, and Observations
The following formula are used to calculate frequency score for comparison:
Method | Formula | Corpus-Baseline | Min | Max | Multiwords |
---|---|---|---|---|---|
Kenneth Church, 1991 | score = ( 1 + Wc)/ TotalWc | 530|773|774 0.6856|0.6848|0.6852 | 1/TotalWc | (1+MaxWc)/TotalWc | N/A |
Jonathan Crowell, 2004 | score = 1 + log(Wc), assign to 0.5 if Wc = 0 | 530|773|774 0.6856|0.6848|0.6852 | 0.5 | 1+log(MaxWc) | N/A |
Peter Norvig, 2007 | score = (Wc)/ TotalWc | 521|730|774 0.7137|0.6731|0.6928 | 0 | MaxWc/TotalWc | N/A |
WC | score = Wc (word count) | 521|730|774 0.7137|0.6731|0.6928 | 0 | MaxWc | N/A |
Halil Kilicoglu, 2015 | score = log(Wc/TotalWc)/log(MaxWc/TotalWc) | 414|758|774 0.5462|0.5349|0.5405 | 0 | 1 | min. WC of all words in the corpus |
CSpell Development - just frequency | |||||
CSpell-Dev-1 (no score for multiword, 0.0) | score = Wc/(MaxWc) | 521|730|774 0.7137|0.6731|0.6928 | 0 | 1 | N/A |
CSpell-Dev-2 (multiword score is the avg. of single word) |
| 536|769|774 0.6970|0.6925|0.6948 | 0 | 1 | Avg. WC of all words |
IV. Source Code: