Lexical Tools

Split Ligatures in Unicode

  • Introduction: Ligature characters appeared in English document. To split ligature is a common process of normalizing non-ASCII Unicode to ASCII operations.

  • Algorithm:
    Ligatures are split in lexical tools with an enhanced algorithm (since 2008) by following processes:

    • Table mapping:
      Mapping method is used to overwrite the default ligatures splitting result. The mapping is a straight forward method, which replaces an Unicode ligature character with an assigned mapped ASCII string. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/ligatureMap.data. This table is used for those Unicode ligatures can't be split by Unicode normalization algorithm. This file is the default split ligature mapping table provided by lexical tools. The format is listed as below:

      UnicodeMapped StringCharUnicode Name
      U+00C6AEÆLATIN CAPITAL LETTER AE

      Please note:

      • Field 1 must be a non-ASCII Unicode character (in Unicode Hex value)
      • Field 2 must be an Unicode (ASCII or non-ASCII) string
      • Fields 3 and 4 are the Unicode character and name of field 1. They are used for notation (not used in the program).

    • Unicode normalization KC algorithm (KC: Compatibility Decomposition, followed by Canonical Composition):

      As discussed in the Unicode Normalization, Unicode Normalization KC can be used for splitting ligatures. Unicode normalization KC decomposes a ligature into several Unicode characters. This process is used after the table mapping. Please note:

      • The result of split Unicode characters are not necessary ASCII. For example,
        [½] (U+00BD) is split into [1] + [⁄] (U+2044) + [2].
        Where [⁄] (U+2044) is not an ASCII character. Further normalization operations should be followed to complete the normalization.
      • Unexpected result from KC algorithm:
        [µ] (U+00B5) is converted to [μ] (U+03BC)
        which are not unexpected. To fix this issue, we add a mapping line in the default ligature mapping table.
        UnicodeMapped StringCharUnicode Name
        U+00B5µµMICRO SIGN

    • Samples:
      A list of sample split ligatures are shown in following table:

      UnicodeMapped ASCIICharUnicode Name
      U+00BC1/4¼VULGAR FRACTION ONE QUARTER
      U+00BD1/2½VULGAR FRACTION ONE HALF
      U+00BE3/4¾VULGAR FRACTION THREE QUARTER
      U+00C6AEÆLATIN CAPITAL LETTER AE
      U+00E6aeæLATIN SMALL LETTER AE
      U+0132IJIJLATIN CAPITAL LETTER IJ
      U+0133ijijLATIN SMALL LETTER IJ
      U+0152OEŒLATIN CAPITAL LETTER OE
      U+0153oeœLATIN SMALL LETTER OE
      U+FB00ffLATIN SMALL LIGATURE FF
      U+FB01fi LATIN SMALL LIGATURE FI
      U+FB02fl LATIN SMALL LIGATURE FL
      U+FB03ffi LATIN SMALL LIGATURE FFI
      U+FB04ffl LATIN SMALL LIGATURE FFL
      U+FB05st LATIN SMALL LIGATURE LONG S T
      U+FB06st LATIN SMALL LIGATURE ST

    • Java Code Implementation:
      • Download icu4j from internet.
      • include icu4j.jar in the Java CLASSPATH
      • import com.ibm.icu.text.*;
      • if the character is in the ligature mapping table
        • Perform mapping
      • else
        • String normStr = Normalizer.normalize(inChar, Normalizer.NFKC);
        • Set the split string to normStr.trim()

    • References: