Lexical Tools

Strip Diacritics in Unicode

Introduction
Diacritic characters are used in Spanish, French, etc. and also appear in English documents. To strip diacritics is a common process of normalizing non-ASCII Unicode to ASCII operations.
Algorithm:
Diacritics are stripped in Lexical Tools with an enhanced algorithm (since 2008) by following processes:
- Table mapping (configurable):
  Table mapping method is used to overwrite the default diacritics stripping result. The mapping is a straight forward method, which replaces a non-ASCII Unicode diacritic character with an assigned mapped ASCII character. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/diacriticMap.data. This table is used for those Unicode diacritic characters can't be stripped by Unicode normalization algorithm. This file is the default strip diacritics mapping table provided by lexical tools. The format is listed as below:
  
  Unicode Mapped ASCII Char Unicode Name
  U+00F8 o ø LATIN SMALL LETTER O WITH STROKE
  
  Please note:
  - Field 1 must be a non-ASCII Unicode character (in Unicode Hex value)
  - Field 2 must be a ASCII character
  - Fields 3 and 4 are the Unicode character and name of field 1. They are used for notation (not used in the program)
- Unicode normalization D algorithm (D: Canonical Decomposition, not followed by Canonical Composition):
  As discussed in the Unicode Normalization, Unicode Normalization D can be used for stripping diacritics. Unicode normalization D decomposes a diacritic character into a base character and combining diacritic mark. The combining diacritic mark can be stripped (by reserving the first base character). This algorithm works very well on most diacritics, such as characters in blocks of Latin-1 supplement, Latin Externd-A, Latin Externd-B.
Java Code Implementation:
- Download icu4j from internet.
- include icu4j.jar in the Java CLASSPATH
- import com.ibm.icu.text.*;
- If the character is in the diacritic mapping table Perform mapping
- else
  - Stripped combining diacritical mark of the normStr
    - If the length of normStr is more than 1 (diacritical character)
    - If normStr contains characters of combining diacritical mark (exclude non-Latin based characters)
      => Keep the first character (base character always show on the front)
References:
- Lexical Tools: 2002 ~ 2003 used an user's define file to stripped diacritics by table mapping method. The default list can be accessed on the 2002~2003 Lvg strip diacritic mapping table.
- Lexical Tools: 2004 ~ 2007 used Unicode normalization algorithm D and an user's define file to stripped diacritics. The sample list can be access on the 2004~2007 Lvg sample list of strip diacritics.