Strip Stop Words
- Short Description:
Strip stop words
- Full Description:
Strips stop words from the input term. By definition, a stop word must be:
- a high frequency word, such as a preposition.
- a grammar word, which does not contribute the meaning of the sentence too much.
The default stop words are listed in file "data/misc/stopWords.data". They are "of", "and", "with", "for", "nos", "to", "in", "by", "on", "the", "(non mesh)". These stop words are configurable. Cases are ignored.
No effect on the -m option. "none" is added at the end of the output.
- Difference:
- The Java version trims output terms (remove spaces at the beginning and ending of the term).
- The Java version applies Lvg.Util.StripToken( ) class to handle tokens conjoint with punctuations.
- Features:
- Remove stop words from the input terms.
- Stop words are defined in a file named stopWords.data.
- Lvg allows users to modify the stop words list by modifying this file.
- Symbol:
t
- Examples:
shell> lvg -f:t
Bacterial infection in conditions classified elsewhere and of unspecified site
Bacterial infection in conditions classified elsewhere and of unspecified site|
Bacterial infection conditions classified elsewhere unspecified site|2047|16777215|t|1|
More examples
Implementation Logic:
- Tokenize (use Lvg.Util.StripToken) each words from the input term.
- Load the stop words from flat file.
- Strip all stop words (single stop word).
- Clean up and compose after the strip.
- Strip multiple stop words (such as non mesh).
Source Code: ToStripStopWords.java
Hierarchy: Object -> Transformation -> ToStripStopWords