CSpell

Non-dictionary-based Splitters

Description:
A splitter is used to correct agglutination (missing spaces between two or more words) by splitting a token into 2 or more tokens by adding space(s). Two types of splitters were developed:
- Non-dictionary-based splitter: [20years] -> [20 years]
- Dictionary-based splitter: [knowabout] -> [know about]
This section described the non-dictionary-based splitters. They are used to correct missing space(s) around punctuation and digits. This type of splitters is based on the shape of token. No dictionary knowledge are required for non-dictionary-based splitters. The non-dictionary-based splitters include:
Design:
Non-dictionary-based Splitter:
- splitNo <= 5 (configurable: CS_CAN_ND_MAX_SPLIT_NO)

Example Walk-through (Leading-digit):

Steps	Exmaple-1	Example-2	Notes
Input	30years	30th.
CoreTerm	30years	30th	Strip leading and ending punctuation 30th. = 30th (coreTerm) + . (suffix)
Matchers	yes	yes	detect if the token match the pattern for splitting
Filters (Exceptions)	no	yes (ordinal number)	detect if it is an exception (legit word)
Split	yes	no
Un-Core	30 years	30th.	output = prefix + coreTerm + suffix
Output	30 years	30th.

Notes:
- Both matchers and filters should be generic, (not projects specific), for generic splitters.
- Matchers and filters can be implemented by regular expression or other computer algorithm.
- Matchers are designed to be aggressive to increase recall.
- Filters (exceptions) are designed to preserve precision. In general, they are retrieved from:
  - Valid words (that matches the matcher patterns) from the Lexicon.
  - Consumer test data.
- This model is designed to ease maintaining and improving splitters by adding/modifying matchers and filters.