Lexical Tools

Tokenize, Keep Everything

  • Short Description: Tokenize, keep everything.

  • Full Description:

    Break up a string into an unique list of words and delimiters in the list. Delimiters include punctuations, space, tab. If the multiple spaces are conjunct, they should be put together.

    No effect on the -m option. "none" is added at the end of the output.

  • Difference: None

  • Features:
    1. Breaks up the input term into tokens and delimiters.
    2. Delimiters include space, tab, and all punctuations.
    3. Combine conjunct spaces together.


  • Symbol: ca

  • Examples:

    shell> lvg -f:ca

    
    the club-foot
    the club-foot|the|2047|16777215|ca|1|
    the club-foot| |2047|16777215|ca|1|
    the club-foot|club|2047|16777215|ca|1|
    the club-foot|-|2047|16777215|ca|1|
    the club-foot|foot|2047|16777215|ca|1|
    
    More examples

  • Implementation Logic:
    1. Utilize Java StringTokenizer class.
    2. Delimiters include space, tab, and all punctuations.
    3. Combine conjunct spaces together.
    4. Put everything back into the output list.

  • Source Code: ToTokenizeKeepAll.java

  • Hierarchy: Object -> Transformation -> ToTokenizeKeepAll