Unicode Core Norm
until:
There are some Unicode characters require multiple normalization steps to convert to ASCII. For examples:
Input | Intermediate Unicode/ASCII Results | Final ASCII Results | Operations |
---|---|---|---|
¾ [U+00BE] | 3 [U+0033] + ⁄ [U+2044] + 4 [U+0034] | 3 [U+0033] + / [U+002F] + 4 [U+0034] | SL+AS+SM+AS |
dž [U+01C6] | d [U+0064] + ž [U+017E] | d [U+0064] + z [U+007A] | SL+AS+SD+AS |
Ǣ [U+01E2] | Æ [U+00C6] | A [U+0041] + E [U+0045] | SD+SL+AS+AS |
The output of this core norm may result in non-ASCII Unicode characters. Theoretically, these non-ASCII characters have no mapping in ASCII domain. Further conversion is followed if the output is required to be pure ASCII. Such as converting non-ASCII characters to ![Unicode name]! or stripping them. On the other hand, in our implementation, this core normalization converts the most common used (~1500) Unicode characters to ASCII within the range of U+0000 to U+FFFF as shown in the table of Unicode Core Norm results. This operation is the core of Non-ASCII Unicode to ASCII normalization.