Lexical Tools

Tokenize

  • Short Description: Tokenize the input into "words"

  • Full Description:

    Break up a string into an unique list of "words". The definition a word is depends on how the string is tokenized. It is defined to be all tokens that contain only runs of alphanumeric characters. The definition of a word is also dependent upon the minimum number of characters in the run. The default minimum number of characters in the run is two. This is configurable by the -ws:INT global option. There are applications where it is convenient to throw away single character words, and there are times when it is convenient to keep such single character words.

    No effect on the -m option. "none" is added at the end of the output.

  • Difference: None

  • Features:
    1. Breaks up the input term into tokens separated by delimiters.
    2. Delimiters include space, tab, and all punctuations.


  • Symbol: c

  • Examples:
    
    shell> lvg -f:c
    the club-foot
    the club-foot|the|2047|16777215|c|1|
    the club-foot|club|2047|16777215|c|1|
    the club-foot|foot|2047|16777215|c|1|
    
    More examples

  • Implementation Logic:
    1. Utilize Java StringTokenizer class.
    2. Delimiters include space, tab, and all punctuations.

  • Source Code: ToTokenize.java

  • Hierarchy: Object -> Transformation -> ToTokenize