Lexical Tools

Strip or Map Unicode to ASCII

  • Short Description: Convert input Unicode characters to ASCII characters by stripping or mapping non-ASCII Unicode characters.

  • Full Description:

    This flow converts Unicode characters to ASCII characters. Some Unicode characters cannot be converted to ASCII by other Unicode normalization algorithm, such as strip diacritics, split ligatures, symbol mapping, or Unicode mapping. These characters are either:

    • stripped, because they are symbols or typos (meaningless in NLP) or
    • mapped to ASCII characters, because they are known Unicode characters in users' NLP projects
    during the normalization. This flow component is not designed to be used by itself. Instead, this process is usually used at the very end of normalization as the final tune up (along with other flow components). The mapping table is defined in the file of $LVG/data/Unicode/nonStripMap.data. Users may add/modify this file from the default set for their applications. Please refer to the design documents of Strip or Map Unicode to ASCII for details.

    When the -m flag is specified, the detail mutate operations for each characters of the input string are added after the standard set of lvg output fields. There are three basic mutate operations in this flow as shown in following table:

    NONo operationA -> A
    MPTable lookup mappingɑ -> alpha
    SPStripped™ ->

  • Difference:


  • Features:
    1. Convert Unicode characters to ASCII from the input term by stripping and mapping.

  • Symbol: q8

  • Examples:
    shell> lvg -f:q8 -m
    More examples

  • Implementation Logic:
    1. Check if the character is ASCII
      • if yes,
        => return the original input character
      • if no,
        => Check if the character is in the non-strip mapping table:
        • if yes, return the mapped ASCII character
        • if no, strip the non-ASCII Unicode

  • Source Code: ToStripMapUnicode.java

  • Hierarchy: Object -> Transformation -> ToStripMapUnicode