Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Tokenize

  • Short Description: Tokenize the input into "words"

  • Full Description:

    Break up a string into an unique list of "words". The definition a word is depends on how the string is tokenized. It is defined to be all tokens that contain only runs of alphanumeric characters. The definition of a word is also dependent upon the minimum number of characters in the run. The default minimum number of characters in the run is two. This is configurable by the -ws:INT global option. There are applications where it is convenient to throw away single character words, and there are times when it is convenient to keep such single character words.

    No effect on the -m option. "none" is added at the end of the output.

  • Difference: None

  • Features:
    1. Breaks up the input term into tokens separated by delimiters.
    2. Delimiters include space, tab, and all punctuations.


  • Symbol: c

  • Examples:
    
    shell> lvg -f:c
    the club-foot
    the club-foot|the|2047|16777215|c|1|
    the club-foot|club|2047|16777215|c|1|
    the club-foot|foot|2047|16777215|c|1|
    
    More examples

  • Implementation Logic:
    1. Utilize Java StringTokenizer class.
    2. Delimiters include space, tab, and all punctuations.

  • Source Code: ToTokenize.java

  • Hierarchy: Object -> Transformation -> ToTokenize