Normalize Unicode to ASCII
I. Introduction:
Unicode is designed to be an universal character set that includes all of the major scripts of the words. It allows data to be transported through many different systems without corruption. It is very useful when dealing with multilingual NLP.
Non-ASCII Unicode are commonly seen even in English documents, such as diacritics, ligature, punctuation, and symbols. For examples, © is used for copyright and ® for registered sign.
Accordingly, UTF-8 is used as the default standard input and output format for the Lexical tools since 2005.
ASCII (American Standard Code for Information Interchange) is the most common used standard code for information interchange and communication between data processing systems. The ASCII character set contains 128 7-bit coded characters including alphabetic, numeric, control and graphic characters. Even Unicode are widely used these days, there are lots of NLP projects still only dealing with ASCII. Thus, there is a need for Lexical tools to provide users a way to convert characters from Unicode (UTF-8) to ASCII (7-bit).
II. Definition:
To provide a tool to convert non-ASCII characters from Unicode (UTF-8) to ASCII (value < 128, U+007F). The normalized result should not change the meaning of the original Unicode character. The normalization algorithm and results are described as follows.
III. Norm Guidelines:
Two fundamental principles are used as the guideline of normalizing non-ASCII Unicode to ASCII:
The table below illustrates examples for the combinations of above two guidelines. Please note that different application might apply different normalization guideline.
Semantic | Graphic | Norm? | Example | Notes |
---|---|---|---|---|
Similar | Similar | Yes | U+0100: [Ā] to [A] | Strip diacritics |
Similar | Similar | Yes | U+00BD: [½] to [1/2] | Split ligature |
Similar | Similar | Yes | U+201C: [“] to ["] | Punctuation mapping |
Similar | Similar | Yes | U+0406: [І] to [I] | Alphabet mapping |
Similar | Similar | Yes | U+FF2B: [K] to [K] | Fullwidth letters |
Semantic | Graphic | Norm? | Example | Notes |
---|---|---|---|---|
Similar | Different | Yes | U+00AB: [«] to ["] | Punctuation mapping |
Similar | Different | Yes | U+00A9: [©] to [(c)] | Symbol mapping |
Similar | Different | Yes | U+00B0: [°] to [(degree)] | Symbol mapping |
Similar | Different | Yes | U+03B1: [α] to [(alpha)] | Alphabet mapping |
Different | Similar | Yes | U+00D7: [×] to [*] | Symbol mapping |
Similar | Different | Yes | U+03BC: [μ] to [(mu)] | Alphabet mapping |
Semantic | Graphic | Norm? | Example | Notes |
---|---|---|---|---|
Different | Similar | Yes | U+2190: [←] to [<-] | Common used |
Different | Similar | Yes/No | U+00B5: [µ] to [u] | "ul" is used for microLiter |
Different | Similar | Yes/No | U+2022: [•] to [*] | Use * for bullet? |
Different | Similar | Yes/No | U+03BC: [μ] to U+00B5: [µ] | Common used synonym or typo? |
Different | Similar | Yes/No | U+00DF: [ß] to U+03B2: [β] | Common used synonym or typo? |
Different | Similar | Yes/No | U+00B6: [¶] to U+03C0: [π] | Common used synonym or typo? |
The normalization based on semantic or graphic similarity principle is the core operation for Unicode to ASCII normalization. This is called as core normalization and can be performed by:
IV. Norm Operations:
Lvg Flow | Descriptions | Abbreviation |
---|---|---|
No Operation (for ASCII) | NO | |
-f:q | Strip Diacritics | SD |
-f:q0 | Map Symbols & Punctuation to ASCII | MS |
-f:q1 | Map Unicode to ASCII | MU |
-f:q2 | Split Ligatures | SL |
-f:q3 | Get Unicode Name | UN |
-f:q4 | Get Unicode Synonym | US |
-f:q8 | Strip or Map Unicode | SM |
Lvg Flow | Descriptions | Combined Flows |
---|---|---|
-f:q7 | Unicode Core Norm (based on semantic & graphic similarity) | -f:q0:q1:q2:q, recursively |
-f:q5 | Norm Unicode to ASCII | -f:q7:q3 |
-f:q6 | Norm Unicode to ASCII with synonym option | -f:q4:q7:q3 |