Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Strip Stop Words

  • Short Description: Strip stop words

  • Full Description:

    Strips stop words from the input term. By definition, a stop word must be:

    1. a high frequency word, such as a preposition.
    2. a grammar word, which does not contribute the meaning of the sentence too much.

    The default stop words are listed in file "data/misc/stopWords.data". They are "of", "and", "with", "for", "nos", "to", "in", "by", "on", "the", "(non mesh)". These stop words are configurable. Cases are ignored.

    No effect on the -m option. "none" is added at the end of the output.

  • Difference:
    1. The Java version trims output terms (remove spaces at the beginning and ending of the term).
    2. The Java version applies Lvg.Util.StripToken( ) class to handle tokens conjoint with punctuations.


  • Features:
    1. Remove stop words from the input terms.
    2. Stop words are defined in a file named stopWords.data.
    3. Lvg allows users to modify the stop words list by modifying this file.


  • Symbol: t

  • Examples:
    
    shell> lvg -f:t
    Bacterial infection in conditions classified elsewhere and of unspecified site
    Bacterial infection in conditions classified elsewhere and of unspecified site|
    Bacterial infection conditions classified elsewhere unspecified site|2047|16777215|t|1|
    
    More examples

  • Implementation Logic:
    1. Tokenize (use Lvg.Util.StripToken) each words from the input term.
    2. Load the stop words from flat file.
    3. Strip all stop words (single stop word).
    4. Clean up and compose after the strip.
    5. Strip multiple stop words (such as non mesh).

  • Source Code: ToStripStopWords.java

  • Hierarchy: Object -> Transformation -> ToStripStopWords