Lexical Tools

Strip Punctuation

  • Short Description: Strip punctuation.

  • Full Description:

    This flow is used to strips punctuation from the input term. The stripped items are not replaced by spaces. Punctuations are defined in Java Character class and include:

    • DASH_PUNCTUATION (20): -
    • START_PUNCTUATION (21): ( { [
    • END_PUNCTUATION (22): ) } ]
    • CONNECTOR_PUNCTUATION (23): _
    • OTHER_PUNCTUATION (24): ! @ # % & * \ : ; " ' , . ? /
    • MATH_SYMBOL (25): ~ + = | < >
    • CURRENCY_SYMBOL (26): $
    • MODIFIER_SYMBOL (27): ` ^

    No effect on the -m option. "none" is added at the end of the output.

  • Difference:
    1. Java version trims output terms (remove spaces at the beginning and ending of the term).
    2. Different result for testing diacritics, such as \345\346... in the unit test.


  • Features:
    1. Strip a character from the input term if the character belongs to above list.


  • Symbol: p

  • Examples:
    
    shell> lvg -f:p
    St. John's
    St. John's|St Johns|2047|16777215|p|1|
    
    More examples

  • Implementation Logic:
    1. Go through every character in the input term; strip it if the character is a punctuation.

  • Source Code: ToStripPunctuation.java

  • Hierarchy: Object -> Transformation -> ToStripPunctuation