Lexical Tools

Normalize Unicode to ASCII

  • Short Description: Normalize Unicode characters of the input term to pure ASCII.

  • Full Description:

    This flow normalizes characters of the input term to pure ASCII. That is to utilize Unicode core norm and then get Unicode symbol names for characters are not ASCII. This flow is equivalent to the combined flow options -f:q7:q3. Please refer to the design documents of Normalize Unicode characters to ASCII for details.

    No effect on the -m option. "none" is added at the end of the output.

  • Difference:

    Utilize the recursive algorithm of Unicode core norm (-f:q7) instead of using combined flows of striping diacritics (-f:q) and splitting ligatures (-f:q2) from previous version.

  • Features:
    Normalize Unicode characters of the input term to pure ASCII:
    • Unicode core norm, recursively perform:
      • Map Unicode symbols and punctuation to ASCII
      • Map Unicode to ASCII
      • Split ligatures
      • Strip diacritics
    • Get Unicode symbol name if the character is not ASCII

  • Symbol: q5

  • Examples:
    
    shell> lvg -f:q5
    Evolène ©2002
    Evolène ©2002|Evolene ![COPYRIGHT SIGN]!2002|2047|16777215|q5|1|
    
    Heavenly Bathrobes®
    Heavenly Bathrobes®|Heavenly Bathrobes![REGISTERED SIGN]!|2047|16777215|q5|1|
    
    More examples

  • Implementation Logic:
    1. Utilize Unicode core norm
    2. Get Unicode symbol name if the character is not ASCII

  • Source Code: ToNormUnicode.java

  • Hierarchy: Object -> Transformation -> ToUnicodeNorm