Lexical Tools

Map Unicode to ASCII

  • Introduction:

    Some non-ASCII Unicode characters are:

    • not normalized by Unicode normalization decomposition algorithm
    • not normalized by stripping diacritics
    • not normalized by splitting ligatures
    • not included in Unicode symbols and punctuation mapping tables

    Such as 'ɡ' (U+0261) and '⅞' (U+215E) can be mapped to 'g' and "7/8", respectively. These non-ASCII Unicode characters are converted to ASCII by table mapping.

  • Algorithm:
    Table mapping method is applied to convert non-ASCII Unicode characters (alphabet) into ASCII for the cases listed above. The mapping is a straight forward method, which replaces a non-ASCII Unicode character with an assigned mapped ASCII string. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/unicodeMap.data. This file is the default Unicode mapping table provided by lexical tools. The format is listed as below:

    UnicodeMapped ASCIICharUnicode Name
    U+01BD5ƽLATIN SMALL LETTER TONE FIVE

    Please note:

    • Field 1 must be a non-ASCII Unicode character (in Unicode Hex value)
    • Field 2 must be an ASCII String
    • Fields 3 and 4 are the Unicode character and name of field 1. They are used for notation purpose (not used in the program).

  • Java Code Implementation:
    • Perform mapping if the character is in the Unicode mapping table

  • References: