Lexical Tools

Map Unicode to ASCII

Introduction:
Some non-ASCII Unicode characters are:
- not normalized by Unicode normalization decomposition algorithm
- not normalized by stripping diacritics
- not normalized by splitting ligatures
- not included in Unicode symbols and punctuation mapping tables
Such as 'ɡ' (U+0261) and '⅞' (U+215E) can be mapped to 'g' and "7/8", respectively. These non-ASCII Unicode characters are converted to ASCII by table mapping.
Algorithm:
Table mapping method is applied to convert non-ASCII Unicode characters (alphabet) into ASCII for the cases listed above. The mapping is a straight forward method, which replaces a non-ASCII Unicode character with an assigned mapped ASCII string. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/unicodeMap.data. This file is the default Unicode mapping table provided by lexical tools. The format is listed as below:

Unicode Mapped ASCII Char Unicode Name
U+01BD 5 ƽ LATIN SMALL LETTER TONE FIVE

Please note:
- Field 1 must be a non-ASCII Unicode character (in Unicode Hex value)
- Field 2 must be an ASCII String
- Fields 3 and 4 are the Unicode character and name of field 1. They are used for notation purpose (not used in the program).
Java Code Implementation:
- Perform mapping if the character is in the Unicode mapping table
References:
- Unicode