Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. It is widely used in multilingual NLP (natural language processing) projects. On the other hand, there are some NLP projects still only dealing with ASCII characters. The lexical tools provide a new tool, toAscii, to convert Unicode characters (UTF-8) to ASCII (7-bit) characters.
Since 2008, the SPECIALIST Lexical tools, distributed by National Library of Medicine (NLM) provide several functions, called LVG (Lexical Variant Generation) flow components, to convert Unicode characters to ASCII. In general, ASCII conversion either preserves semantic and/or graphic representation or facilitates NLP. Different NLP applications might apply different methods for the ASCII conversion due to different requirements and objectives. There is no single standard method for ASCII conversion. For example, character, ™, can be converted in the following ways:
The tool, toAscii, encapsulates the lvg flow options
-f:q7:q8
.
That is,
The tool, toAscii, takes the input and converts the entire input line to ASCII and sends to output. There is only one (or none) output for one particular input. There are many different ways to convert Unicode characters to ASCII. The SPECIALIST Lexical Tools provides various powerful methods for ASCII conversion and allow users to configure the tools to their specifications. Uses may define/modify their own Unicode mapping in following tables to obtain the ASCII conversion results to their specifications.
Follow the installation instructions to install lexical tool and run the toAscii program. Check on the following items only if you don't use the provided script to install Lexical tools.
Enter the command:
shell> toAscii -p
- Please input a term (type "Ctl-d" to quit) >
ɑ-Best™
alpha-Best
- Please input a term (type "Ctl-d" to quit) >
Xigris®|spælsau|Evolène ©2002
Xigris|spaelsau|Evolene 2002
where:
toAscii takes its input (entire line) from standard input, perform ASCII conversion, and then send the results to standard output.
Please refer to design documents
Please refer to design documents