CSpell

Frequency Score

I. Introduction

This page describes the ranking by frequency score to choose a correct word from the suggested candidates for a spelling error word.

II. Data - Ensemble Corpus

The following corpus from Ensemble is used as baseline to test different frequency score formula for the best performance.

CorpusSourcesStatistics
Ensemblehealth related articles in:
  • MedlinePlus Medical Encyclopedia
  • MedlinePlus Drug
  • Genetics Home Reference
  • Genetic and Rare Disease frequency asked questions
  • NHLBI Health Topics
  • NINDS Disorders
  • NIH Senior Health
  • Articles: 8,590
  • Tokens: 5,771,363
  • Unique Word:51,190
  • Dic Words in Corpus: 35,781|6.3115%
  • Dic Words WC: 5,637,436|98.4708%

Unigram word count is used as frequency. The format of word frequency file is

wordword count

III. Algorithm, Tests, and Observations

The following formula are used to calculate frequency score for comparison:

  • non-word 1To1 and split error correction.
MethodFormulaCorpus-Baseline MinMaxMultiwords
Kenneth Church, 1991 score = ( 1 + Wc)/ TotalWc 530|773|774
0.6856|0.6848|0.6852
1/TotalWc(1+MaxWc)/TotalWcN/A
Jonathan Crowell, 2004 score = 1 + log(Wc), assign to 0.5 if Wc = 0 530|773|774
0.6856|0.6848|0.6852
0.51+log(MaxWc)N/A
Peter Norvig, 2007 score = (Wc)/ TotalWc 521|730|774
0.7137|0.6731|0.6928
0MaxWc/TotalWcN/A
WC score = Wc (word count) 521|730|774
0.7137|0.6731|0.6928
0MaxWcN/A
Halil Kilicoglu, 2015 score = log(Wc/TotalWc)/log(MaxWc/TotalWc) 414|758|774
0.5462|0.5349|0.5405
01min. WC of all words in the corpus
CSpell Development - just frequency
CSpell-Dev-1 (no score for multiword, 0.0) score = Wc/(MaxWc) 521|730|774
0.7137|0.6731|0.6928
01N/A
CSpell-Dev-2 (multiword score is the avg. of single word)
  • score = Wc/(MaxWc)
  • Ranking: number of words, score, alphabetic
536|769|774
0.6970|0.6925|0.6948
01Avg. WC of all words

  • Issues in the Ensemble corpus:
    • Too small, only 6.31% words in the dictionary has frequency
    • The unigram formula does not work well
    • Good score range: 0.0 ~ 1.0
    • Frequency score can be improved with other knowledge sources (not to use alone)
    • Use the min. for split case: not sure if it is a good model

  • Smaller corpus: method matters
  • Hugh corpus: method does not matters for single word ranking
  • Split case need better evaluation method

IV. Source Code:

  • FrequencyScore.java
  • RankByFrequency.java
    => Get the candidate with top frequency score