Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov
Frequency Score
I. Introduction
This page describes the ranking by frequency score to choose a correct word from the suggested candidates for a spelling error word.
The following corpus from Ensemble is used as baseline to test different frequency score formula for the best performance.
Corpus | Sources | Statistics |
---|---|---|
Ensemble | health related articles in:
|
|
Unigram word count is used as frequency. The format of word frequency file is
word | word count |
III. Algorithm, Tests, and Observations
The following formula are used to calculate frequency score for comparison:
Method | Formula | Corpus-Baseline | Min | Max | Multiwords |
---|---|---|---|---|---|
Kenneth Church, 1991 | score = ( 1 + Wc)/ TotalWc | 530|773|774 0.6856|0.6848|0.6852 | 1/TotalWc | (1+MaxWc)/TotalWc | N/A |
Jonathan Crowell, 2004 | score = 1 + log(Wc), assign to 0.5 if Wc = 0 | 530|773|774 0.6856|0.6848|0.6852 | 0.5 | 1+log(MaxWc) | N/A |
Peter Norvig, 2007 | score = (Wc)/ TotalWc | 521|730|774 0.7137|0.6731|0.6928 | 0 | MaxWc/TotalWc | N/A |
WC | score = Wc (word count) | 521|730|774 0.7137|0.6731|0.6928 | 0 | MaxWc | N/A |
Halil Kilicoglu, 2015 | score = log(Wc/TotalWc)/log(MaxWc/TotalWc) | 414|758|774 0.5462|0.5349|0.5405 | 0 | 1 | min. WC of all words in the corpus |
CSpell Development - just frequency | |||||
CSpell-Dev-1 (no score for multiword, 0.0) | score = Wc/(MaxWc) | 521|730|774 0.7137|0.6731|0.6928 | 0 | 1 | N/A |
CSpell-Dev-2 (multiword score is the avg. of single word) |
| 536|769|774 0.6970|0.6925|0.6948 | 0 | 1 | Avg. WC of all words |
IV. Source Code: