CSpell

CSpell Dictionary Summary

This page summaries dictionary files used in CSpell.

I. CSpell Dictionary files

TypeSource filesNotes
CS_CHECK_DIC_FILEScheck.dic
  • Generic dictionary for checking valid unigrams
  • Used in NW/RW, Split/1To1 detectors
CS_SUGGEST_DIC_FILESsugg.dic (same as check.dic)
  • Generic dictionary for checking valid unigrams
  • Used in NW/RW, Split/1To1 detectors
CS_SPLIT_WORD_DIC_FILESsplit.dic
  • Used in NW/RW Merge detectors
  • Used in NW/RW Split candidates
CS_MW_DIC_FILElexicon.mw.dic
  • Used in NW/RW Split candidates (check if the split candidate is a multiword)
  • Used in NW/RW Merge candidates (check if the focus token and context tokens are multiword, then no merge)
CS_UNIT_DIC_FILEunit.data
  • Used in NW/RW Merge/Split/1To1 detector to check exceptions
  • Used in RW Split candidates (can't be unit, such as mg)
CS_SV_DIC_FILEsv.dic
  • Spelling variants
  • Not used for now
  • To be used for RW-1To1 Detector
CS_AA_DIC_FILElexicon.aa.dic
  • Abbreviation or Acronym in Lexicon
  • Used in NW/RW Merge candidates (don't merge context if Aa)
  • Used in RW 1-to-1 detector
CS_PN_DIC_FILElexicon.pn.dic
  • Proper noun in Lexicon
  • Used in RW Split candidate (split word can't be pn)
  • Used in RW 1-to-1 detector

II. Source Dictionary Files

FileSourceNotes
Lexicon Release
NRVAR.1.uSort.dataLexiconLexicon Number variants
  • NRVAR
  • field 1
  • uniquely sorted
lexicon.ew.dicLexiconLexicon Element Words
  • Unigram of all Lexical Entries
lexicon.enEwLc.dic.addRmLexiconLexicon English Element Word, Lowercase
English is Lexicon - Aa -Pn

remove:

  • amita
  • anil
  • anser
  • catacholamine
  • diaphram
  • flavanoid
  • flavanoids
  • glucoma
  • losangeles
  • palmita

add:

  • i'm
  • i've
  • medline
  • medlineplus
  • y'all
lexicon.swNoAaLc.dicLexiconLexicon Single Word, Not AA, Lowercase
  • Single Word
  • Not pure AA (aids is in because it is the 3rd-sigular for aid)
  • Lowercase
lexicon.mw.dicLexiconLexicon Multiwords
lexicon.aa.dicLexiconLexicon Abbreviations and Acronyms
lexicon.pn.dicLexiconLexicon Proper Noun
lexicon.sv.dicLexiconLexicon Spelling Variants
Consumer Health Related Data
Med.l.dicMed from UMLS-ST
  • Using Lexicon as English Dictionary
Med.cm-l.dicMed from UMLS-ST
  • cm: Consumer and Medline data
  • -l: exclude Lexicon
Others
unit.dataLexicon PreProcessUnit collection from generate Lexicon multiword process
cistomerDic.dataCSpell PreProcessEmpirical data collection (that are not from Lexicon or Consumer Health Data)