Lexical Tools

Unicode Core Norm

  • Introduction:
    Non-ASCII Unicode characters are normalized to ASCII by the principles of similarity on semantic or graphic. Four basic operations are used recursively:
    • Symbols and punctuation mapping
    • Unicode mapping
    • Split ligatures
    • Strip diacritics

    until:

    • the characters are normalized to ASCII, or
    • no further normalization can be done (ASCII or non-ASCII)

    There are some Unicode characters require multiple normalization steps to convert to ASCII. For examples:

    InputIntermediate Unicode/ASCII ResultsFinal ASCII ResultsOperations
    ¾ [U+00BE]3 [U+0033] + ⁄ [U+2044] + 4 [U+0034] 3 [U+0033] + / [U+002F] + 4 [U+0034]SL+AS+SM+AS
    dž [U+01C6]d [U+0064] + ž [U+017E] d [U+0064] + z [U+007A]SL+AS+SD+AS
    Ǣ [U+01E2]Æ [U+00C6] A [U+0041] + E [U+0045]SD+SL+AS+AS

The output of this core norm may result in non-ASCII Unicode characters. Theoretically, these non-ASCII characters have no mapping in ASCII domain. Further conversion is followed if the output is required to be pure ASCII. Such as converting non-ASCII characters to ![Unicode name]! or stripping them. On the other hand, in our implementation, this core normalization converts the most common used (~1500) Unicode characters to ASCII within the range of U+0000 to U+FFFF as shown in the table of Unicode Core Norm results. This operation is the core of Non-ASCII Unicode to ASCII normalization.

  • Algorithm:
    • If the recursive number is less than the max. recursive number
      => Go through each character (curChar) in the string:
      • if ASCII (AS)
        => increase curPos
      • if in the punctuation mapping table (PM)
        => perform punctuation & symbol mapping
        => replace the curChar with mapped string
        => increase curPos accordingly
      • if in the Unicode mapping table (UM)
        => perform Unicode to ASCII mapping
        => replace the curChar with mapped string
        => increase curPos accordingly
      • if splitable ligature (SL)
        => replace the curChar with split string
      • if strippable diacritics (SD)
        => replace the curChar with stripped diacritics character
      • else (NO)
        => increase curPos

      • Call itself again until the curPos >= the length of the string
    • else
      => stop

  • References: