Unicode Core Norm
This flow normalizes Unicode characters to ASCII from the input term by recursively perform following operations:
Some Unicode characters require multiple normalized operations to convert to ASCII. For examples,
Input | Intermediate Unicode/ASCII Results | Final ASCII Results | Operations |
---|---|---|---|
¾ [U+00BE] | 3 [U+0033] + ⁄ [U+2044] + 4 [U+0034] | 3 [U+0033] + / [U+002F] + 4 [U+0034] | SL+AS+SM+AS |
Dž [U+01C5] | D [U+0044] + ž [U+017E] | D [U+0044] + z [U+007A] | SL+AS+SD+AS |
ǽ [U+01FD] | æ [U+00E6] | a [U+0061] + e [U+0065] | SD+SL+AS+AS |
This flow uses four configurable mapping tables:
Please note that the output of this flow is not necessary ASCII. Flows of Get Unicode Names or Strip or map Unicode to ASCII can be followed to complete normalization on Unicode to pure ASCII. Such as the most common used Unicode normalization (to ASCII) flows of Norm Unicode to ASCII or Norm Unicode to ASCII with Synonym Option utilizes this core norm flow along with Get Unicode Names. Please refer to the design documents of Unicode core norm for details.
When the -m flag is specified, the detail mutate operations for each characters of the input string are added after the standard set of lvg output fields. There are seven basic mutate operations as shown in following table:
Operations | Descriptions |
---|---|
Operations to continue in the recursive loop | |
SL | Split Ligatures |
SD | Strip Diacritics |
Operations to exit the recursive loop | |
SM | Symbol mapping |
UM | Unicode mapping |
AS | preserve an ASCII character |
NMO | No More Operation |
ERR | Error (exceed Max. limit of recursive number) |
None.
Core normalization of converting Unicode to ASCII of the input term by
shell> lvg -f:q7 “Quote” “Quote”|"Quote"|2047|16777215|q7|1| ⅝ ⅝|5/8|2047|16777215|q7|1| Déjà Vu Déjà Vu|Deja Vu|2047|16777215|q7|1| spælsau spælsau|spaelsau|2047|16777215|q7|1| Ǽ Ǽ|AE|2047|16777215|q7|1|More examples