CSpell

Non-dictionary-based Corrections

This is the first step for spelling correction. It is used to correct errors that does not need dictionary. The non-dictionary-based correction model includes handlers and splitters. They were arranged as a chain of intermediate operator to handle HTML/XML tags introduced by the software that consumer use to ask questions, informal expression. It also handle missing spaces on adjacent punctuation or digits. Pattern match (regular expression) and table lookup are used in this type of correction. Software components are developed to resolve these issues and detailed as follows:

XML/HTML Handler
Informal Expression Handler

Splitter:
If errors of the input text caused by missing space(s), they can be corrected by splitter (by adding a space at the right position). This type of errors are often seen in free text from consumer data. Two types of split functions are categorized:

Non-dictionary-based Splitters:
This type of error can be detected and corrected without dictionary. They are associated with leading digits, ending digits, leading punctuation, ending punctuation. Please refer to the following components for details:

Types of Splitter	Error	Correction	File Name
Leading Digit Splitter	20years	20 years	10349
Ending Digit Splitter	disease3	disease 3	26
Leading Punctuation Splitter	volunteers(	volunteers (	12353
Ending Punctuation Splitter	cancer?if	cancer? if	10004

Dictionary-based splitter:
This type of error can be identified and corrected with the knowledge of dictionary. They are discussed in the dictionary-based section in more details. Below shows some examples :

File Name Error Correction
14 knowabout know about
26 diseaseany disease any
11841 Iam I am
11186 tbinthe tb in the
14849 shuntfrom shunt from
10349 along a long

File Name	Error	Correction
14	knowabout	know about
26	diseaseany	disease any
11841	Iam	I am
11186	tbinthe	tb in the
14849	shuntfrom	shunt from
10349	along	a long