Lexical Tools

Normalize Unicode with Synonym Option

  • Short Description: Normalize Unicode characters of the input term to pure ASCII with synonym option.

  • Full Description:

    This flow normalizes characters of the input term to pure ASCII with synonym options. That is to utilize get Unicode synonyms, Unicode core norm, and then get Unicode symbol names for characters are not ASCII. This flow is equivalent to the combined flow options -f:q4:q7:q3. Please refer to the design documents of Normalize Unicode characters to ASCII with synonym option for details.

    No effect on the -m option. "none" is added at the end of the output.

  • Difference:

    Utilize the recursive algorithm of Unicode core norm (-f:q7) instead of using combined flows of striping diacritics (-f:q) and splitting ligatures (-f:q2) from previous versions.

  • Features:
    Normalize Unicode characters of the input term to pure ASCII with synonym option:
    • Get Unicode synonyms
    • Unicode core norm, recursively perform:
      • Map Unicode symbols and punctuation to ASCII
      • Map Unicode to ASCII
      • Split ligatures
      • Strip diacritics
    • Get Unicode symbol name if the character is not ASCII

  • Symbol: q6

  • Examples:
    
    shell> lvg -f:q6
    Østland
    Østland|Ostland|2047|16777215|q6|1|
    
    Déjà ©1999
    Déjà ©1999|Deja ![COPYRIGHT SIGN]!1999|2047|16777215|q6|1|
    
    μ
    μ|![MICRO SIGN]!|2047|16777215|q6|1|
    
    More examples

  • Implementation Logic:
    1. Get Unicode synonym
    2. Utilize Unicode core norm
    3. Get Unicode symbol name if the character is not ASCII

  • Source Code: ToNormUnicodeWithSynonym.java

  • Hierarchy: Object -> Transformation -> ToNormUnicodeWithSynonym