Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Get Unicode Names

  • Short Description: Get Unicode names from the input.

  • Full Description:

    This flow returns:

    • ![Unicode name]! if it's not an ASCII character (value > 127)
    • the original ASCII character if it's an ASCII
    from the input term.

    The format of an Unicode name starts with a staring tag ![ , followed by the Unicode name, and ends with an ending tag ]! . Please refer to the design documents of Get Unicode Name for details. This flow component is used to convert UTF-8 to pure ASCII in NLP since all Unicode names are ASCII. In addition, this flow preserves the information of the original Unicode character.

    When the -m flag is specified, the detail Unicode information for each characters of the input string are added after the standard set of lvg output fields. Three sets of information are included as:

    
    Unicode Hex Value, Unicode Name, Unicode Block

  • Difference:

    The lvg defined symbols are removed since 2008. This flow is simplified to return Unicode name if the input character is not an ASCII (regardless of diacritics, ligatures, symbols).

  • Features:
    1. Get the Unicode name for characters in the input term if the term is not an ASCII.


  • Symbol: q3

  • Examples:
    
    shell> lvg -f:q3
    lvg ©2008
    lvg ©2008|lvg ![COPYRIGHT SIGN]!2008|2047|16777215|q3|1|
    
    shell> lvg -f:q3 -m
    µ
    µ|![MICRO SIGN]!|2047|16777215|q3|1|U+00B5, MICRO SIGN, LATIN_1_SUPPLEMENT|
    
    μ
    μ|![GREEK SMALL LETTER MU]!|2047|16777215|q3|1|U+03BC, GREEK SMALL LETTER MU, GREEK|
    
    More examples

  • Implementation Logic:
    1. Check if the character is an ASCII.
      • if yes, return the original character
      • if no, return the Unicode symbol name
    2. Add starting and ending tags. The format is ![Unicode name]!.

  • Source Code: ToGetUnicodeNames.java

  • Hierarchy: Object -> Transformation -> ToGetUnicodeNames