CSpell

Dictionary

I. Introduction

This page describes the source and functions of dictionaries used in CSpell. A dictionary is used for:

  • spelling error detection (check if a term is a spelling error).
  • candidate suggestions (suggest correct words).

The performance of CSpell depends on the coverage and quality of the dictionary. They are categorized into detection and correction suggestions and briefly described as follows. In general,

  • Detection:
    • The smaller the coverage of the dictionary, the higher recall of spelling error detection will be (because most of words will be detected as spelling errors). Small dictionary has low the precision for detection.
    • The goal is to increase the coverage of the dictionary, to increase the precision and preserve the recall.
    • If the coverage of a dictionary is too big, many spelling errors can not be detected and results in lower recall with a high precision (for those few detected errors are real spelling errors).
    • For example, a dictionary includes abbreviations/acronyms, spelling errors that matches an abbreviation/acronym will not be identified as non-word spelling error.
  • Correction (Suggestion):
    • The bigger the coverage of the dictionary, the more suggestions can be found for spell errors and result in higher recall in suggested candidates.
    • The precision of spelling correction depends on two factors:
      • the suggested candidates (include the corrected word)
      • the ranking algorithm is capable of choosing the right corrected word from candidates.

    The table below show the summary of above statements.

    ParametersValuesSpelling Error DetectionSpelling Error Correction
    Dictionary Coveragesmall
    • recall: large
    • precision: small
    • recall (candidates): small
    • precision (ranking): large + ranking algorithm
    large
    • recall: small
    • precision: large
    • recall (candidates): large
    • precision (ranking): large + ranking algorithm

    It is important to develop a multiple dictionary that meets different purposes and reaches the best performance of spelling error detection or suggestion. The following example show invalid corrections:

    Input TextCorrected TextCorrectionNotes
    dur ingduringvalid non-word mergedictionary excludes acronyms and abbreviations for split:
    • dur is an acronym of drug use review
    • ing is an acronym of isotope nephrogram
    humanhu maninvalid real-word splitdictionary excludes proper noun for merge
    • Hu is a proper noun

    II. Sources

    III. Components (Java Classes)

    IV. TBD

    • Multiple functions (dictionaries) is needed to take care of abbreviation/acronym, proper noun, etc..
      Current approach uses very simple model (case) to identify above, a really data should improve the performance
    • To implement in Database
    • Try Inverted files