Lexical Tools

Unicode Core Norm

  • Short Description: The core normalization on Unicode to ASCII.

  • Full Description:

    This flow normalizes Unicode characters to ASCII from the input term by recursively perform following operations:

    • Map Unicode symbols and punctuation to ASCII
    • Map Unicode to ASCII
    • Split ligatures
    • Strip diacritics
    until
    • the output is ASCII
    • no further normalized result is obtained

    Some Unicode characters require multiple normalized operations to convert to ASCII. For examples,

    InputIntermediate Unicode/ASCII ResultsFinal ASCII ResultsOperations
    ¾ [U+00BE]3 [U+0033] + ⁄ [U+2044] + 4 [U+0034] 3 [U+0033] + / [U+002F] + 4 [U+0034]SL+AS+SM+AS
    Dž [U+01C5]D [U+0044] + ž [U+017E] D [U+0044] + z [U+007A]SL+AS+SD+AS
    ǽ [U+01FD]æ [U+00E6] a [U+0061] + e [U+0065]SD+SL+AS+AS

    This flow uses four configurable mapping tables:

    • $LVG/data/Unicode/symbolMap.data
    • $LVG/data/Unicode/unicodeMap.data
    • $LVG/data/Unicode/ligatureMap.data
    • $LVG/data/Unicode/diacriticMap.data

    Please note that the output of this flow is not necessary ASCII. Flows of Get Unicode Names or Strip or map Unicode to ASCII can be followed to complete normalization on Unicode to pure ASCII. Such as the most common used Unicode normalization (to ASCII) flows of Norm Unicode to ASCII or Norm Unicode to ASCII with Synonym Option utilizes this core norm flow along with Get Unicode Names. Please refer to the design documents of Unicode core norm for details.

    When the -m flag is specified, the detail mutate operations for each characters of the input string are added after the standard set of lvg output fields. There are seven basic mutate operations as shown in following table:

    OperationsDescriptions
    Operations to continue in the recursive loop
    SLSplit Ligatures
    SDStrip Diacritics
    Operations to exit the recursive loop
    SMSymbol mapping
    UMUnicode mapping
    ASpreserve an ASCII character
    NMONo More Operation
    ERRError (exceed Max. limit of recursive number)


  • Difference:

    None.

  • Features:
    Core normalization of converting Unicode to ASCII of the input term by
    • map symbols & punctuation to ASCII
    • map Unicode to ASCII
    • split ligatures
    • strip diacritics


  • Symbol: q7

  • Examples:
    
    shell> lvg -f:q7
    “Quote”
    “Quote”|"Quote"|2047|16777215|q7|1|
    
    ⅝
    ⅝|5/8|2047|16777215|q7|1|
    
    Déjà Vu
    Déjà Vu|Deja Vu|2047|16777215|q7|1|
    
    spælsau
    spælsau|spaelsau|2047|16777215|q7|1|
    
    Ǽ
    Ǽ|AE|2047|16777215|q7|1|
    
    More examples

  • Implementation Logic:
    1. Assign the input string to current string
    2. Check if exceed Max. limit of recursive number
      • if Yes, exit and marked as "ERR".
    3. Go through each character of input string, exit when current position is greater than the length of current String:
      • if current character ASCII
        => preserve the current character
        => increase current position to next
        => mark as "AS"
      • else if symbolMap contains current character
        => use mapped string
        => increase current position by the length of mapped string
        => mark as "SM"
      • else if unicodeMap contains current character
        => use mapped string
        => increase current position by the length of mapped string
        => mark as "UM"
      • else if a splitable ligature
        => use split ligature string
        => mark as "SL"
      • else if a stripable diacritic
        => use stripped diacritic character
        => mark as "SD"
      • else (no more operation)
        => use stripped diacritic character
        => mark as "NMO"
    4. Recursively perform above operations.

  • Source Code: ToUnicodeCoreNorm.java

  • Hierarchy: Object -> Transformation -> ToUnicodeCoreNorm