Lexical Tools

Unicode Core Norm

Short Description: The core normalization on Unicode to ASCII.

Full Description:

This flow normalizes Unicode characters to ASCII from the input term by recursively perform following operations:

Map Unicode symbols and punctuation to ASCII
Map Unicode to ASCII
Split ligatures
Strip diacritics

until

the output is ASCII
no further normalized result is obtained

Some Unicode characters require multiple normalized operations to convert to ASCII. For examples,

Input	Intermediate Unicode/ASCII Results	Final ASCII Results	Operations
¾ [U+00BE]	3 [U+0033] + ⁄ [U+2044] + 4 [U+0034]	3 [U+0033] + / [U+002F] + 4 [U+0034]	SL+AS+SM+AS
ǅ [U+01C5]	D [U+0044] + ž [U+017E]	D [U+0044] + z [U+007A]	SL+AS+SD+AS
ǽ [U+01FD]	æ [U+00E6]	a [U+0061] + e [U+0065]	SD+SL+AS+AS

This flow uses four configurable mapping tables:

$LVG/data/Unicode/symbolMap.data
$LVG/data/Unicode/unicodeMap.data
$LVG/data/Unicode/ligatureMap.data
$LVG/data/Unicode/diacriticMap.data

Please note that the output of this flow is not necessary ASCII. Flows of Get Unicode Names or Strip or map Unicode to ASCII can be followed to complete normalization on Unicode to pure ASCII. Such as the most common used Unicode normalization (to ASCII) flows of Norm Unicode to ASCII or Norm Unicode to ASCII with Synonym Option utilizes this core norm flow along with Get Unicode Names. Please refer to the design documents of Unicode core norm for details.

When the -m flag is specified, the detail mutate operations for each characters of the input string are added after the standard set of lvg output fields. There are seven basic mutate operations as shown in following table:

Operations	Descriptions
Operations to continue in the recursive loop
SL	Split Ligatures
SD	Strip Diacritics
Operations to exit the recursive loop
SM	Symbol mapping
UM	Unicode mapping
AS	preserve an ASCII character
NMO	No More Operation
ERR	Error (exceed Max. limit of recursive number)

Difference:
None.
Features:
Core normalization of converting Unicode to ASCII of the input term by
- map symbols & punctuation to ASCII
- map Unicode to ASCII
- split ligatures
- strip diacritics
Symbol: q7

Examples:


shell> lvg -f:q7
“Quote”
“Quote”|"Quote"|2047|16777215|q7|1|

⅝
⅝|5/8|2047|16777215|q7|1|

Déjà Vu
Déjà Vu|Deja Vu|2047|16777215|q7|1|

spælsau
spælsau|spaelsau|2047|16777215|q7|1|

Ǽ
Ǽ|AE|2047|16777215|q7|1|

More examples

Implementation Logic:
1. Assign the input string to current string
2. Check if exceed Max. limit of recursive number
  - if Yes, exit and marked as "ERR".
3. Go through each character of input string, exit when current position is greater than the length of current String:
  - if current character ASCII
    => preserve the current character
    => increase current position to next
    => mark as "AS"
  - else if symbolMap contains current character
    => use mapped string
    => increase current position by the length of mapped string
    => mark as "SM"
  - else if unicodeMap contains current character
    => use mapped string
    => increase current position by the length of mapped string
    => mark as "UM"
  - else if a splitable ligature
    => use split ligature string
    => mark as "SL"
  - else if a stripable diacritic
    => use stripped diacritic character
    => mark as "SD"
  - else (no more operation)
    => use stripped diacritic character
    => mark as "NMO"
4. Recursively perform above operations.
Source Code: ToUnicodeCoreNorm.java
Hierarchy: Object -> Transformation -> ToUnicodeCoreNorm