CSpell

Lazy Tokenizer

I. Introduction

This page describes the lazy implementation of tokenizer. "Lazy" means conduct process until it needs to be processed for faster speed performance. In CSpell, the input text is tokenized to words and processed sequentially. A lazy implementation of tokenization on punctuation (delay tokenizing on punctuation until the last moment) with coreTerm class were used to avoid unnecessary computation for tokenization and assembly on punctuation. This implementation save time and easier to maintain. It avoid unnecessary tokenization and fit perfectly with Java 8 stream operation.

II. Source code

  • TokenObj.java
  • TokenUtil.java

  • TextObj.java
  • TermUtil.java

III. Design and Algorithm

  • Don't tokenize punctuation until it needs to be tokenized
    • Save time and easier to maintain.
    • Avoid multiple tokenization.
    • Fit perfectly to Java 8 stream operation
  • Three steps:
    • Token by space:
      • Skip token contains punctuation
      • Such as email, digit, punctuation, url, Dr., Mr., are not tokenized and treated as a single token
    • Remove leading and ending punctuation (core-term)
      • Core-term, remove unnecessary leading and ending punctuation
      • Such as, end of sentences, parenthetic term, etc.
        example:
        • (people) => people
        • End of sentence. => sentence
      • Valid terms; AT&T, R&D, 12.34 => they are coreterm, no tokenization
    • Tokenize middle punctuation
      • Such as possessive, optional plural form, slash or (A/B)