Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Strip Punctuations, Enhanced

  • Short Description: Strip punctuation, enhanced

  • Full Description:

    This flow is used to strip punctuation from the input term, except where the punctuation is between or just before numbers. This enhanced feature is an attempt to avoid breaking up tokens in floating numbers ("1.25"), negative number ("-1"), dates ("10/12/97"), telephone numbers ("301-435-3170"), or category numbers (0000-12-345").

    Punctuation is defined in Java Character class and include:

    • DASH_PUNCTUATION (20): -
    • START_PUNCTUATION (21): ( { [
    • END_PUNCTUATION (22): ) } }
    • CONNECTOR_PUNCTUATION (23): _
    • OTHER_PUNCTUATION (24): ! @ # % & * \ : ; " ' , . ? /
    • MATH_SYMBOL (25): ~ + = | < >
    • CURRENCY_SYMBOL (26): $
    • MODIFIER_SYMBOL (27): ` ^

    No effect on the -m option. "none" is added at the end of the output.

  • Difference:
    1. Java version trims output terms (remove spaces at the beginning and ending of the term).
    2. C version replace "-" with space even when "-" is conjoint to " " while Java version just strip it.
    3. Different result for testing diacritics, such as \345\346... in the unit test.
    4. "'" in genitive is stripped. This feature may need further discussed.


  • Features:

    Strip a character from the input term if the character belongs to above punctuation list except for following cases.

    1. Floating number: such as "1.25" and "-23.38".
    2. Negative integer: such as "-23".
    3. Date: such as "10/12/97" or "10-12-00".
    4. Telephone: such as "301-435-3170" or "301.435.3170"
    5. Catalog: such as "007.12.1234.07" or "007-12-1234-07".


  • Symbol: P

  • Examples:
    
    shell> lvg -f:P
    -12.3
    -12.3|-12.3|2047|16777215|p|1|
    
    10/12/97
    10/12/97|10/12/97|2047|16777215|P|1|
    
    301-435-2134
    301-435-2134|301-435-2134|2047|16777215|P|1|
    
    St. John's
    St. John's|St Johns|2047|16777215|p|1|
    
    More examples

  • Implementation Logic:
    1. Tokenize each word from the input term.
    2. Check if words are float numbers.
    3. Check if words are dates.
    4. Check if words are telephone numbers.
    5. Check if words are catalog numbers.

  • Source Code: ToStripPunctuationEnhanced.java

  • Hierarchy: Object -> Transformation -> ToStripPunctuationEnhaced