Lexical Tools

Normalize Unicode to ASCII

I. Introduction:
Unicode is designed to be an universal character set that includes all of the major scripts of the words. It allows data to be transported through many different systems without corruption. It is very useful when dealing with multilingual NLP. Non-ASCII Unicode are commonly seen even in English documents, such as diacritics, ligature, punctuation, and symbols. For examples, © is used for copyright and ® for registered sign. Accordingly, UTF-8 is used as the default standard input and output format for the Lexical tools since 2005.

ASCII (American Standard Code for Information Interchange) is the most common used standard code for information interchange and communication between data processing systems. The ASCII character set contains 128 7-bit coded characters including alphabetic, numeric, control and graphic characters. Even Unicode are widely used these days, there are lots of NLP projects still only dealing with ASCII. Thus, there is a need for Lexical tools to provide users a way to convert characters from Unicode (UTF-8) to ASCII (7-bit).

II. Definition:
To provide a tool to convert non-ASCII characters from Unicode (UTF-8) to ASCII (value < 128, U+007F). The normalized result should not change the meaning of the original Unicode character. The normalization algorithm and results are described as follows.

III. Norm Guidelines:
Two fundamental principles are used as the guideline of normalizing non-ASCII Unicode to ASCII:

Similar Semantic representation: represents the same meaning
Similar Graphic representation: similar graphic appearance
The table below illustrates examples for the combinations of above two guidelines. Please note that different application might apply different normalization guideline.

Examples: similar in both semantic and graphic representations

Semantic	Graphic	Norm?	Example	Notes
Similar	Similar	Yes	U+0100: [Ā] to [A]	Strip diacritics
Similar	Similar	Yes	U+00BD: [½] to [1/2]	Split ligature
Similar	Similar	Yes	U+201C: [“] to ["]	Punctuation mapping
Similar	Similar	Yes	U+0406: [І] to [I]	Alphabet mapping
Similar	Similar	Yes	U+FF2B: [Ｋ] to [K]	Fullwidth letters

Examples: similar in semantic; not in graphic representations

Semantic	Graphic	Norm?	Example	Notes
Similar	Different	Yes	U+00AB: [«] to ["]	Punctuation mapping
Similar	Different	Yes	U+00A9: [©] to [(c)]	Symbol mapping
Similar	Different	Yes	U+00B0: [°] to [(degree)]	Symbol mapping
Similar	Different	Yes	U+03B1: [α] to [(alpha)]	Alphabet mapping
Different	Similar	Yes	U+00D7: [×] to [*]	Symbol mapping
Similar	Different	Yes	U+03BC: [μ] to [(mu)]	Alphabet mapping

Examples: similar in graphic; not in semantic representations

Semantic	Graphic	Norm?	Example	Notes
Different	Similar	Yes	U+2190: [←] to [<-]	Common used
Different	Similar	Yes/No	U+00B5: [µ] to [u]	"ul" is used for microLiter
Different	Similar	Yes/No	U+2022: [•] to [*]	Use * for bullet?
Different	Similar	Yes/No	U+03BC: [μ] to U+00B5: [µ]	Common used synonym or typo?
Different	Similar	Yes/No	U+00DF: [ß] to U+03B2: [β]	Common used synonym or typo?
Different	Similar	Yes/No	U+00B6: [¶] to U+03C0: [π]	Common used synonym or typo?

The normalization based on semantic or graphic similarity principle is the core operation for Unicode to ASCII normalization. This is called as core normalization and can be performed by:

Symbols and punctuation mapping
Unicode mapping
Split ligatures
Strip diacritics

recursively until no further normalized results can be found. This core normalization is the most robust operations and provided as a lvg flow component of Unicode Core Norm (-f:q7) in Lexical Tools.

IV. Norm Operations:

Basic Norm Operations
There are 7 basic Norm operations used in Lexical tools for normalize non-ASCII Unicode characters to ASCII. Most normalization for Unicode to ASCII can be achieved by combining different basic operations in different order. The following table shows the lvg flows and other information of these seven basic norm operations.

Lvg Flow	Descriptions	Abbreviation
	No Operation (for ASCII)	NO
-f:q	Strip Diacritics	SD
-f:q0	Map Symbols & Punctuation to ASCII	MS
-f:q1	Map Unicode to ASCII	MU
-f:q2	Split Ligatures	SL
-f:q3	Get Unicode Name	UN
-f:q4	Get Unicode Synonym	US
-f:q8	Strip or Map Unicode	SM

Combined Norm Operations
Different combination of above basic Norm operation can be used for different NLP. Lexical tools provide 3 most common used combined operation as shown in the following table:

Lvg Flow	Descriptions	Combined Flows
-f:q7	Unicode Core Norm (based on semantic & graphic similarity)	-f:q0:q1:q2:q, recursively
-f:q5	Norm Unicode to ASCII	-f:q7:q3
-f:q6	Norm Unicode to ASCII with synonym option	-f:q4:q7:q3