Lexical Tools

Get Unicode Name

  • Introduction:
    Most Unicode characters do not have semantic or graphic similarity to ASCII characters. In other words, these Unicode can't be logically normalized to ASCII since there are no ASCII characters to map to. In such case, Lexical tools convert Unicode character to ![Unicode name]! format to ensure the normalization result are all ASCII. This conversion is unique, reversible, and meaningful.

  • Algorithm:
    Get the unique Unicode name of the Unicode character and return the result in ![Unicode name]! format.

    This method is usually applied when a character can't be further normalized by Unicode core Norm algorithm, which is under following cases:

    • Is not an ASCII (value > 127, or U+007F)
    • Can't be mapped from symbols & punctuation mapping table
    • Can't be mapped from Unicode mapping table
    • Can't be split as ligatures
    • Can't be stripped as diacritics

    ![ and ]! are the starting tag and ending tag, respectively. These two tags are configurable by users. Please note that this method can be used independently for other purposes, such as to verify the name and value of a Unicode character.

  • Samples:
    The following table shows some commonly seen characters,

    CharUnicodeUnicode Name
    ©U+00a9![COPYRIGHT SIGN]!
    ®U+00AE ![REGISTERED SIGN]!
    µU+00B5 ![MICRO SIGN]!
    U+00B6 ![PILCROW SIGN]!
    ßU+00DF ![LATIN SMALL LETTER SHARP S]!
    βU+03B2 ![GREEK SMALL LETTER BETA]!
    μU+03BC ![GREEK SMALL LETTER MU]!
    πU+03C0 ![GREEK SMALL LETTER PI]!
    U+2120 ![SERVICE MARK]!
    U+2121 ![TELEPHONE SIGN]!
    U+2122 ![TRADE MARK SIGN]!
    U+FE82 ![ARABIC LETTER ALEF WITH MADDA ABOVE FINAL FORM]!

  • Java Code Implementation:
    • Download icu4j from internet.
    • include icu4j.jar in the Java CLASSPATH
    • import com.ibm.icu.lang.*;
    • If the charcter is ASCII
      • return the original chracter
    • else
      • String unicodeName = UCharacter.getName(inChar);
      • Send result as ![unicodeName]!

  • References: