Lexical Tools

Unicode Normalization

Unicode normalization is an algorithm used in Lexical Tools to normalize Unicode characters. If you are not interested in how the software works, you may skip this page.

Decomposition:

  • Compatibility Decomposition:
    It maps the character to one or more other characters that you could use as a replacement for the first character. This process is not necessary reversible.

    For example: [ ˚ :U+02DA RING ABOVE] = [ :U+0020 SPACE] + [ ̊ :U+030A COMBINING RING ABOVE]

  • Canonical Decomposition:
    It maps the character to its canonical equivalent character or characters. A canonical mapping is reversible without losing any information.

    For example: [ Å :U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE] = [ A :+0041 LATIN CAPITAL LETTER A] + [ ̊ :U+030A COMBINING RING ABOVE]

Composition:

  • Canonical Composition:
    Canonical composition (as opposed to canonical decomposition) replaces sequences of a base plus combining characters with pre-composed characters.

    For example: [ A :U+0041 LATIN CAPITAL LETTER A] + [ ̊ :U+030A COMBINING RING ABOVE] = [ Å :U+00c5 LATIN CAPITAL LETTER A WITH RING ABOVE]

Unicode Normalization:

Unicode Normalization forms define four forms of normalized test. The D, C, KD, and KC normalization differ both in whether they are the result of an initial canonical or compatibility decomposition, and in whether the decomposed text is recomposed with canonical composed characters wherever possible.

  • D: Canonical decomposition, if not followed by Canonical composition
  • C: Canonical decomposition, if followed by Canonical composition
  • KD: Compatibility decomposition, if not followed by Canonical composition
  • KC: Compatibility decomposition, if followed by Canonical composition

Example:

  • Input: [affairé]
  • D: a + ff + a + i + r + e + ́
  • C: a + ff + a + i + r + é
  • KD: a + f + f + a + i + r + e + ́
  • KC: a + f + f + a + i + r + é

From above, we can utilize normalization algorithm to:

  • Strip diacritics:
    • Normalization D
    • Base characters (non-diacritics) are always in the front
    • Diacritics are characters in the Combining diacritical Marks block.
    • The results, base and diacritics, are not necessary ASCII
  • Split ligature:
    • Normalization KC
    • Split into multiple characters
    • Some split characters contain space, should be trimmed
    • The results are not necessary ASCII
  • Strip diacritic & Split Ligature:
    • Normalization KD

References: