Lexical Tools

Get Unicode Synonym

  • Introduction:
    Some Unicode characters (with different Unicode values) have similar graphical appearance (visual representation). Theoretically, these should be considered as typo when they are not used correctly. However, these characters are used interchangeably over the years by mistake and are considered as synonyms to each other. The base of these synonyms can be got by table mapping method.

  • Algorithm:
    Table mapping method is applied to convert Unicode character to the base of its synonyms. The mapping is a straight forward method, which replaces an Unicode character with another assigned mapped Unicode character. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/synonymMap.data. This file is the default Unicode synonym mapping table provided by lexical tools. The format is listed as below:

    UnicodeSynonym BaseCharSB CharUnicode NameSB Name
    U+03BCU+00B5μµGREEK SMALL LETTER MUMICRO SIGN

    Please note:

    • Fields 1 and 2 can be an ASCII or non-ASCII Unicode character (in Unicode Hex value)
    • Fields 3 and 5 are the Unicode character and name of field 1. They are used for notation (not used in the program).
    • Fields 4 and 6 are the Unicode character and name of field 2 (synonym base). They are used for notation (not used in the program).

    The table below shows two other common used Unicode synonyms. Please notes that they are not included as the default synonym in the Lexical tools.

    UnicodeSynonym BaseCharSB CharUnicode NameSB Name
    U+00DFU+03B2ßβLATIN SMALL LETTER SHARP SGREEK SMALL LETTER BETA
    U+00B6U+03C0πPILCROW SIGNGREEK SMALL LETTER PI

  • Java Code Implementation:
    • Perform mapping if the character is in the Unicode mapping table

  • References: