Lexical Tools

Get Unicode Names

  • Short Description: Get Unicode names from the input.

  • Full Description:

    This flow returns:

    • ![Unicode name]! if it's not an ASCII character (value > 127)
    • the original ASCII character if it's an ASCII
    from the input term.

    The format of an Unicode name starts with a staring tag ![ , followed by the Unicode name, and ends with an ending tag ]! . Please refer to the design documents of Get Unicode Name for details. This flow component is used to convert UTF-8 to pure ASCII in NLP since all Unicode names are ASCII. In addition, this flow preserves the information of the original Unicode character.

    When the -m flag is specified, the detail Unicode information for each characters of the input string are added after the standard set of lvg output fields. Three sets of information are included as:

    Unicode Hex Value, Unicode Name, Unicode Block

  • Difference:

    The lvg defined symbols are removed since 2008. This flow is simplified to return Unicode name if the input character is not an ASCII (regardless of diacritics, ligatures, symbols).

  • Features:
    1. Get the Unicode name for characters in the input term if the term is not an ASCII.

  • Symbol: q3

  • Examples:
    shell> lvg -f:q3
    lvg ©2008
    lvg ©2008|lvg ![COPYRIGHT SIGN]!2008|2047|16777215|q3|1|
    shell> lvg -f:q3 -m
    µ|![MICRO SIGN]!|2047|16777215|q3|1|U+00B5, MICRO SIGN, LATIN_1_SUPPLEMENT|
    μ|![GREEK SMALL LETTER MU]!|2047|16777215|q3|1|U+03BC, GREEK SMALL LETTER MU, GREEK|
    More examples

  • Implementation Logic:
    1. Check if the character is an ASCII.
      • if yes, return the original character
      • if no, return the Unicode symbol name
    2. Add starting and ending tags. The format is ![Unicode name]!.

  • Source Code: ToGetUnicodeNames.java

  • Hierarchy: Object -> Transformation -> ToGetUnicodeNames