CSpell

Non-dictionary-based Corrections

This is the first step for spelling correction. It is used to correct errors that does not need dictionary. The non-dictionary-based correction model includes handlers and splitters. They were arranged as a chain of intermediate operator to handle HTML/XML tags introduced by the software that consumer use to ask questions, informal expression. It also handle missing spaces on adjacent punctuation or digits. Pattern match (regular expression) and table lookup are used in this type of correction. Software components are developed to resolve these issues and detailed as follows:

  • XML/HTML Handler
  • Informal Expression Handler
  • Splitter:
    If errors of the input text caused by missing space(s), they can be corrected by splitter (by adding a space at the right position). This type of errors are often seen in free text from consumer data. Two types of split functions are categorized:

    • Non-dictionary-based Splitters:
      This type of error can be detected and corrected without dictionary. They are associated with leading digits, ending digits, leading punctuation, ending punctuation. Please refer to the following components for details:

      Types of SplitterErrorCorrectionFile Name
      Leading Digit Splitter 20years20 years10349
      Ending Digit Splitter disease3disease 326
      Leading Punctuation Splitter volunteers(volunteers (12353
      Ending Punctuation Splitter cancer?ifcancer? if10004

    • Dictionary-based splitter:
      This type of error can be identified and corrected with the knowledge of dictionary. They are discussed in the dictionary-based section in more details. Below shows some examples :
      File NameErrorCorrection
      14knowaboutknow about
      26diseaseanydisease any
      11841IamI am
      11186tbinthetb in the
      14849shuntfromshunt from
      10349alonga long