Lexical Tools

Unicode Core Norm

Introduction:
Non-ASCII Unicode characters are normalized to ASCII by the principles of similarity on semantic or graphic. Four basic operations are used recursively:

Symbols and punctuation mapping
Unicode mapping
Split ligatures
Strip diacritics

until:

the characters are normalized to ASCII, or
no further normalization can be done (ASCII or non-ASCII)

There are some Unicode characters require multiple normalization steps to convert to ASCII. For examples:

Input	Intermediate Unicode/ASCII Results	Final ASCII Results	Operations
¾ [U+00BE]	3 [U+0033] + ⁄ [U+2044] + 4 [U+0034]	3 [U+0033] + / [U+002F] + 4 [U+0034]	SL+AS+SM+AS
ǆ [U+01C6]	d [U+0064] + ž [U+017E]	d [U+0064] + z [U+007A]	SL+AS+SD+AS
Ǣ [U+01E2]	Æ [U+00C6]	A [U+0041] + E [U+0045]	SD+SL+AS+AS

The output of this core norm may result in non-ASCII Unicode characters. Theoretically, these non-ASCII characters have no mapping in ASCII domain. Further conversion is followed if the output is required to be pure ASCII. Such as converting non-ASCII characters to ![Unicode name]! or stripping them. On the other hand, in our implementation, this core normalization converts the most common used (~1500) Unicode characters to ASCII within the range of U+0000 to U+FFFF as shown in the table of Unicode Core Norm results. This operation is the core of Non-ASCII Unicode to ASCII normalization.

Algorithm:

If the recursive number is less than the max. recursive number
=> Go through each character (curChar) in the string:
- if ASCII (AS)
  => increase curPos
- if in the punctuation mapping table (PM)
  => perform punctuation & symbol mapping
  => replace the curChar with mapped string
  => increase curPos accordingly
- if in the Unicode mapping table (UM)
  => perform Unicode to ASCII mapping
  => replace the curChar with mapped string
  => increase curPos accordingly
- if splitable ligature (SL)
  => replace the curChar with split string
- if strippable diacritics (SD)
  => replace the curChar with stripped diacritics character
- else (NO)
  => increase curPos
- Call itself again until the curPos >= the length of the string
else
=> stop

References:

Unicode