Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Tokenize, Keep Everything

  • Short Description: Tokenize, keep everything.

  • Full Description:

    Break up a string into an unique list of words and delimiters in the list. Delimiters include punctuations, space, tab. If the multiple spaces are conjunct, they should be put together.

    No effect on the -m option. "none" is added at the end of the output.

  • Difference: None

  • Features:
    1. Breaks up the input term into tokens and delimiters.
    2. Delimiters include space, tab, and all punctuations.
    3. Combine conjunct spaces together.


  • Symbol: ca

  • Examples:

    shell> lvg -f:ca

    
    the club-foot
    the club-foot|the|2047|16777215|ca|1|
    the club-foot| |2047|16777215|ca|1|
    the club-foot|club|2047|16777215|ca|1|
    the club-foot|-|2047|16777215|ca|1|
    the club-foot|foot|2047|16777215|ca|1|
    
    More examples

  • Implementation Logic:
    1. Utilize Java StringTokenizer class.
    2. Delimiters include space, tab, and all punctuations.
    3. Combine conjunct spaces together.
    4. Put everything back into the output list.

  • Source Code: ToTokenizeKeepAll.java

  • Hierarchy: Object -> Transformation -> ToTokenizeKeepAll