Lexical Tools

Split Ligatures in Unicode

Introduction: Ligature characters appeared in English document. To split ligature is a common process of normalizing non-ASCII Unicode to ASCII operations.

Algorithm:
Ligatures are split in lexical tools with an enhanced algorithm (since 2008) by following processes:

Table mapping:
Mapping method is used to overwrite the default ligatures splitting result. The mapping is a straight forward method, which replaces an Unicode ligature character with an assigned mapped ASCII string. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/ligatureMap.data. This table is used for those Unicode ligatures can't be split by Unicode normalization algorithm. This file is the default split ligature mapping table provided by lexical tools. The format is listed as below:

Unicode Mapped String Char Unicode Name
U+00C6 AE Æ LATIN CAPITAL LETTER AE

Please note:
- Field 1 must be a non-ASCII Unicode character (in Unicode Hex value)
- Field 2 must be an Unicode (ASCII or non-ASCII) string
- Fields 3 and 4 are the Unicode character and name of field 1. They are used for notation (not used in the program).
Unicode normalization KC algorithm (KC: Compatibility Decomposition, followed by Canonical Composition):
As discussed in the Unicode Normalization, Unicode Normalization KC can be used for splitting ligatures. Unicode normalization KC decomposes a ligature into several Unicode characters. This process is used after the table mapping. Please note:
- The result of split Unicode characters are not necessary ASCII. For example,
  [½] (U+00BD) is split into [1] + [⁄] (U+2044) + [2].
  Where [⁄] (U+2044) is not an ASCII character. Further normalization operations should be followed to complete the normalization.
- Unexpected result from KC algorithm:
  [µ] (U+00B5) is converted to [μ] (U+03BC)
  which are not unexpected. To fix this issue, we add a mapping line in the default ligature mapping table.
  
  Unicode Mapped String Char Unicode Name
  U+00B5 µ µ MICRO SIGN

Unicode	Mapped String	Char	Unicode Name
U+00C6	AE	Æ	LATIN CAPITAL LETTER AE

Unicode	Mapped String	Char	Unicode Name
U+00B5	µ	µ	MICRO SIGN

Samples:
A list of sample split ligatures are shown in following table:

Unicode	Mapped ASCII	Char	Unicode Name
U+00BC	1/4	¼	VULGAR FRACTION ONE QUARTER
U+00BD	1/2	½	VULGAR FRACTION ONE HALF
U+00BE	3/4	¾	VULGAR FRACTION THREE QUARTER
U+00C6	AE	Æ	LATIN CAPITAL LETTER AE
U+00E6	ae	æ	LATIN SMALL LETTER AE
U+0132	IJ	Ĳ	LATIN CAPITAL LETTER IJ
U+0133	ij	ĳ	LATIN SMALL LETTER IJ
U+0152	OE	Œ	LATIN CAPITAL LETTER OE
U+0153	oe	œ	LATIN SMALL LETTER OE
U+FB00	ff	ﬀ	LATIN SMALL LIGATURE FF
U+FB01	fi	ﬁ	LATIN SMALL LIGATURE FI
U+FB02	fl	ﬂ	LATIN SMALL LIGATURE FL
U+FB03	ffi	ﬃ	LATIN SMALL LIGATURE FFI
U+FB04	ffl	ﬄ	LATIN SMALL LIGATURE FFL
U+FB05	st	ﬅ	LATIN SMALL LIGATURE LONG S T
U+FB06	st	ﬆ	LATIN SMALL LIGATURE ST

Java Code Implementation:
- Download icu4j from internet.
- include icu4j.jar in the Java CLASSPATH
- import com.ibm.icu.text.*;
- if the character is in the ligature mapping table Perform mapping
- else String normStr = Normalizer.normalize(inChar, Normalizer.NFKC); Set the split string to normStr.trim()
References:
- Unicode
- Unicode Normalization