Lexical Tools

Map Unicode to ASCII

  • Short Description: Converts input Unicode characters to ASCII characters.

  • Full Description:

    This flow converts Unicode characters to ASCII characters. Some Unicode characters are not be able to convert to Unicode by Unicode normalization algorithm, such as strip diacritics, split ligatures, etc. These characters are normalized to ASCII by table lookup mapping. The mapping table is defined in the file of $LVG/data/Unicode/unicodeMap.data. Users may add/modify this file from the default set for their applications. Please refer to the design documents of Map Unicode to ASCII for details.

    When the -m flag is specified, the detail mutate operations for each characters of the input string are added after the standard set of lvg output fields. There are two basic mutate operations for normalize Unicode to ASCII in this flow as shown in following table:

    OperationsDescriptionsExample
    NONo operationø -> ø
    MPTable lookup mappingƼ -> 5


  • Difference:

    None.

  • Features:
    1. Get the ASCII representation of the Unicode characters from the input term.


  • Symbol: q1

  • Examples:
    
    shell> lvg -f:q1 -m
    ⅝
    ⅝|5/8|2047|16777215|q1|1|MP|
    
    More examples

  • Implementation Logic:
    1. Check if the character is in the Unicode mapping table:
      • if yes, return the mapped ASCII character
      • if no, return the original input Unicode

  • Source Code: ToMapUnicodeToAscii.java

  • Hierarchy: Object -> Transformation -> ToMapUnicodeToAscii