Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Map Unicode to ASCII

  • Introduction:

    Some non-ASCII Unicode characters are:

    • not normalized by Unicode normalization decomposition algorithm
    • not normalized by stripping diacritics
    • not normalized by splitting ligatures
    • not included in Unicode symbols and punctuation mapping tables

    Such as 'ɡ' (U+0261) and '⅞' (U+215E) can be mapped to 'g' and "7/8", respectively. These non-ASCII Unicode characters are converted to ASCII by table mapping.

  • Algorithm:
    Table mapping method is applied to convert non-ASCII Unicode characters (alphabet) into ASCII for the cases listed above. The mapping is a straight forward method, which replaces a non-ASCII Unicode character with an assigned mapped ASCII string. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/unicodeMap.data. This file is the default Unicode mapping table provided by lexical tools. The format is listed as below:

    UnicodeMapped ASCIICharUnicode Name
    U+01BD5ƽLATIN SMALL LETTER TONE FIVE

    Please note:

    • Field 1 must be a non-ASCII Unicode character (in Unicode Hex value)
    • Field 2 must be an ASCII String
    • Fields 3 and 4 are the Unicode character and name of field 1. They are used for notation purpose (not used in the program).

  • Java Code Implementation:
    • Perform mapping if the character is in the Unicode mapping table

  • References: