CSpell

Non-dictionary-based Splitters

  • Description:
    A splitter is used to correct agglutination (missing spaces between two or more words) by splitting a token into 2 or more tokens by adding space(s). Two types of splitters were developed:
    • Non-dictionary-based splitter: [20years] -> [20 years]
    • Dictionary-based splitter: [knowabout] -> [know about]

    This section described the non-dictionary-based splitters. They are used to correct missing space(s) around punctuation and digits. This type of splitters is based on the shape of token. No dictionary knowledge are required for non-dictionary-based splitters. The non-dictionary-based splitters include:

  • Design:

    Non-dictionary-based Splitter:

    • splitNo <= 5 (configurable: CS_CAN_ND_MAX_SPLIT_NO)

  • Example Walk-through (Leading-digit):

    StepsExmaple-1Example-2Notes
    Input30years30th.
    CoreTerm30years30th
    • Strip leading and ending punctuation
    • 30th. = 30th (coreTerm) + . (suffix)
    Matchersyesyesdetect if the token match the pattern for splitting
    Filters (Exceptions)noyes (ordinal number)detect if it is an exception (legit word)
    Splityesno
    Un-Core30 years30th.output = prefix + coreTerm + suffix
    Output30 years30th.

  • Notes:
    • Both matchers and filters should be generic, (not projects specific), for generic splitters.
    • Matchers and filters can be implemented by regular expression or other computer algorithm.
    • Matchers are designed to be aggressive to increase recall.
    • Filters (exceptions) are designed to preserve precision. In general, they are retrieved from:
      • Valid words (that matches the matcher patterns) from the Lexicon.
      • Consumer test data.
    • This model is designed to ease maintaining and improving splitters by adding/modifying matchers and filters.