Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. It is widely used in multilingual NLP (natural language processing) projects. On the other hand, there are some NLP projects still only dealing with ASCII characters. The lexical tools provide a new tool, toAscii, to convert Unicode characters (UTF-8) to ASCII (7-bit) characters.
Since 2008, the SPECIALIST Lexical tools, distributed by National Library of Medicine (NLM) provide several functions, called LVG (Lexical Variant Generation) flow components, to convert Unicode characters to ASCII. In general, ASCII conversion either preserves semantic and/or graphic representation or facilitates NLP. Different NLP applications might apply different methods for the ASCII conversion due to different requirements and objectives. There is no single standard method for ASCII conversion. For example, character, ™, can be converted in the following ways:
The tool, toAscii, encapsulates the lvg flow options
-f:q7:q8
.
That is,
The tool, toAscii, takes the input and converts the entire input line to ASCII and sends to output. There is only one (or none) output for one particular input. There are many different ways to convert Unicode characters to ASCII. The SPECIALIST Lexical Tools provides various powerful methods for ASCII conversion and allow users to configure the tools to their specifications. Uses may define/modify their own Unicode mapping in following tables to obtain the ASCII conversion results to their specifications.
Follow the installation instructions to install lexical tool and run the toAscii program. Check on the following items only if you don't use the provided script to install Lexical tools.
Enter the command:
shell> toAscii -p
- Please input a term (type "Ctl-d" to quit) >
ɑ-Best™
alpha-Best
- Please input a term (type "Ctl-d" to quit) >
Xigris®|spælsau|Evolène ©2002
Xigris|spaelsau|Evolene 2002
where:
toAscii takes its input (entire line) from standard input, perform ASCII conversion, and then send the results to standard output.
Please refer to design documents
Please refer to design documents