CSpell

Dictionary Functions - Check Valid Word

I. Introduction

In cSpell, all tokens that are used for spelling error detection are single words. Thus, only single words are needed to be in the dictionary. This page described which dictionary should be used for the spelling erroor detection.

II. Algorithm

Both the whole token and core-term for the token are checked for the valid spellina (Is-Valid-Word):

  • Check the whole token and core-term for dictioary word (non-case snesitive)
  • Remove possessive ('s) to check the original word for dictioary word (non-case snesitive)
  • Check all words in the or-slash-term for valid word

III. Results

Test cSpell with different dictionaries:

  • Is Valid (Spell Checker):
    • Dictionary.IsValidWord
      • Possesses
      • slash or
      • hyphen words (TBD)
    • Exceptions:
      • IsDigit
      • IsPunc
      • IsDigitPunc
      • IsMeasurements (units, digit + units)
      • IsUrl
      • IsEmail
      • IsEmptyString
      • IsProperNoun (dictionary based)
      • IsAbbAcr(dictionary based)
  • Candidates:
    • 1-to-1: mainDic
    • Split:
      • if whole split is a multiword
      • if all split words are digit, unit, or noAaDic*

* noAaDic: En + Pn

eng_medical.dic:

  • element words from UMLS Strings
  • does not inlcude Proper noun
  • does not include Abbreviation or acronym
  • does not include spelling varaints

Lexicon.dic:

  • all: all words
  • sw: single words
  • mw: multiwords
  • ew: element words (unigram)
  • aa: abbreviations and acronyms
  • pn: proper noun
  • sv: spelling variants
  • noAa: english words (all exclude aa)
  • en: english words (all exclude aa and pn)
  • swEn: single word and english words (single word exclude aa and pn)

IV. Tests:

Test-1: Tests on Baseline + Lexicon (not used, result are included from above)

  • These tests use only 1 dictionary for check and suggestion
  • The pn, aa, sv check are implemented in the algorithm
  • The result show element word (ew: for better spelling error detection) with English word (en: for better candidate suggestion) are the best.
  • The result leads us to have two dictionaries: check and suggest
DictionaryTP|Ret|RelPrecisionRecallF1Notes
Lexicon (single-word + multiwords)
  • NRVAR
  • lexicon.all
535|858|8140.62350.65720.6400
  • NRVAR
  • lexicon.en
530|877|8140.60430.65110.6268
Lexicon (single-words)
  • NRVAR
  • lexicon.ew
531|808|8140.65720.65230.6547
  • NRVAR
  • lexicon.sw
535|858|8140.62350.65720.6400
  • NRVAR
  • lexicon.swEn
530|877|8140.60430.65110.6268
Combined (10 spVar are included in Lexicon)
  • eng_medical.dic
  • NRVAR
  • lexicon.ew
529|740|8140.71490.64990.6808
  • eng_medical.dic
  • NRVAR
  • lexicon.sw
533|745|8140.71540.65480.6838
  • eng_medical.dic
  • NRVAR
  • SwEn
537|745|8140.72080.65970.6889
TBD
  • eng_medical.dic
  • 10 spVar dic
  • Customer
549|745|8140.73690.67440.7043

Test 2: Tests on split Dictionaries

  • Implemented two dictinoary: check and suggestion
  • Results shows:
    • check dictionary should be validated element words (Lexicon.ew)
    • suggestion dictinoary should be the one in the testing domain, exclude pn, aa, sv? (more tests on UMLS/MedLine and TBD)

DictionaryTP|Ret|RelPrecisionRecallF1Notes
Use Baseline Dictionary for check and suggest
  • eng_medical.dic
546|820|8140.66590.67080.6683
  • eng_medical.dic
  • 10 spVar dic
  • No Pn or Aa Check
547|810|8140.67530.67200.6736Add 10 files for spVars
  • eng_medical.dic
  • 10 spVar dic
  • With Pn Check
548|765|8140.71630.67320.6941Check proper noun from Lexicon
  • eng_medical.dic
  • 10 spVar dic
  • With Aa Check
547|804|8140.68030.67200.6761Check Abb/Acr from Lexicon
  • eng_medical.dic
  • 10 spVar dic
  • With Pn and Aa Check
548|759|8140.72200.67320.6968Check proper nouns/Abb/Acr from Lexicon
  • eng_medical.dic
  • 10 spVar dic
  • With Pn, Aa, Sv Check
544|747|8140.72820.66830.6970Add SpVar from Lexicon
  • eng_medical.dic
  • With Pn, Aa, Sv Check
543|746|8140.72790.66710.6962Replace 10 files by Lexicon.spVar
  • eng_medical.dic
  • Lexicon.sv
  • With Pn, Aa, Sv Check
543|749|8140.72500.66710.6948Add SpVar to dic decreases F1 because it us also used for suggestion (need a better ranking system)
  • eng_medical.dic
  • NRVAR
  • With Pn, Aa, Sv Check
543|746|8140.72790.66710.6962Add number, no change bz of data
Implement 2 Dictionaries: Check + Suggest
Find the Check dictionary
  • CheckDic:
    • eng_medical.dic
    • NRVAR
    • Lexicon.all
    • With Pn, Aa, Sv Check
  • SuggestDic:
    • eng_medical.dic
    • NRVAR
543|739|8140.73480.66710.6993Increase check dic by adding Lexicon, but only use baseline dic for suggestion
  • CheckDic:
    • NRVAR
    • Lexicon.all
    • With Pn, Aa, Sv Check
  • SuggestDic:
    • eng_medical.dic
    • NRVAR
543|739|8140.73480.66710.6993Remove baseline dic from check, same result. Meaning baseline dictionary is a subset of Lexicon for the test da ta.
  • CheckDic:
    • NRVAR
    • Lexicon.all
  • SuggestDic:
    • eng_medical.dic
    • NRVAR
543|739|8140.73480.66710.6993Remove algorithm check on pn, aa, sv because Lexicon include all these words.
  • CheckDic:
    • NRVAR
    • Lexicon.sw
    • Lexicon.pn
    • Lexicon.Aa
  • SuggestDic:
    • eng_medical.dic
    • NRVAR
543|739|8140.73480.66710.6993Use Lexicon sw, pn, aa to replace (all), result are the same as expected.
  • CheckDic:
    • NRVAR
    • Lexicon.ew
    • Lexicon.pn
    • Lexicon.Aa
  • SuggestDic:
    • eng_medical.dic
    • NRVAR
540|734|8140.73570.66340.6977Use element word to replace single word, worse result (bz of a mistake in Lexicon)
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
    • Lexicon.pn
    • Lexicon.Aa
  • SuggestDic:
    • eng_medical.dic
    • NRVAR
543|737|8140.73680.66710.7002Fix Lexicon mistake - remove "diaphram" from ew to get better result
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
  • SuggestDic:
    • eng_medical.dic
    • NRVAR
543|737|8140.73680.66710.7002Remove pn and aa because they are included in ew
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
    • medlineDic.data
  • SuggestDic:
    • eng_medical.dic
    • NRVAR
529|695|8140.76120.64990.7011Add medlineDic, TBD: use frequency as filter to refine medlineDic.

Test-3: test on the suggestion dictionary

  • Need a dictionary that focuses on the domain
  • The performance is coupled with the candidate and ranking algorithm, need a analysis tool for better understanding.
DictionaryTP|Ret|RelPrecisionRecallF1Notes
Find the Suggest dictionary
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
  • SuggestDic:
    • Lexicon.swEn
    • NRVAR
528|812|8140.65020.64860.6494Use swEn (excludes pn and aa), too many FPs
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
  • SuggestDic:
    • Lexicon.swEn
    • Lexicon.pn
    • NRVAR
529|813|8140.65070.64990.6503Use swEn + pn (excludes aa), too many FPs
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
  • SuggestDic:
    • Lexicon.swEn
    • Lexicon.aa
    • NRVAR
532|808|8140.65840.65360.6560Use swEn + aa (exclude pn), too many FPs
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
  • SuggestDic:
    • Lexicon.swEn
    • Lexicon.pn
    • Lexicon.aa
    • NRVAR
534|808|8140.66090.65600.6584Use swEn + aa + pn, too many FPs
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
  • SuggestDic:
    • Lexicon.sw
    • NRVAR
534|808|8140.66090.65600.6584Use sw (= swEn + aa + pn), too many FPs
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
  • SuggestDic:
    • Lexicon.all
    • NRVAR
534|808|8140.66090.65600.6584Use all, the single words are the same as in sw, same results, too many FPs
  • CheckDic:
    • NRVAR
    • Lexicon.ew.fixed
  • SuggestDic:
    • Lexicon.ew.fixed
    • NRVAR
534|814|8140.65600.65600.6560Use ew, too many FPs
current best
  • CheckDic:
    • NRVAR
    • Lexicon.ew (validated)
    • Customers
  • SuggestDic:
    • eng_medical.dic
    • NRVAR
544|725|8140.75030.66830.7070Add customers dic for better results