Non-dictionary-based Splitters
A splitter is used to correct agglutination (missing spaces between two or more words) by splitting a token into 2 or more tokens by adding space(s). Two types of splitters were developed:
This section described the non-dictionary-based splitters. They are used to correct missing space(s) around punctuation and digits. This type of splitters is based on the shape of token. No dictionary knowledge are required for non-dictionary-based splitters. The non-dictionary-based splitters include:
Non-dictionary-based Splitter:
CS_CAN_ND_MAX_SPLIT_NO
)
Steps | Exmaple-1 | Example-2 | Notes |
---|---|---|---|
Input | 30years | 30th. | |
CoreTerm | 30years | 30th |
|
Matchers | yes | yes | detect if the token match the pattern for splitting |
Filters (Exceptions) | no | yes (ordinal number) | detect if it is an exception (legit word) |
Split | yes | no | |
Un-Core | 30 years | 30th. | output = prefix + coreTerm + suffix |
Output | 30 years | 30th. |