Lexical Tools

  • ToAscii
  • Java


Introduction

Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. It is widely used in multilingual NLP (natural language processing) projects. On the other hand, there are some NLP projects still only dealing with ASCII characters. The lexical tools provide a new tool, toAscii, to convert Unicode characters (UTF-8) to ASCII (7-bit) characters.

Since 2008, the SPECIALIST Lexical tools, distributed by National Library of Medicine (NLM) provide several functions, called LVG (Lexical Variant Generation) flow components, to convert Unicode characters to ASCII. In general, ASCII conversion either preserves semantic and/or graphic representation or facilitates NLP. Different NLP applications might apply different methods for the ASCII conversion due to different requirements and objectives. There is no single standard method for ASCII conversion. For example, character, ™, can be converted in the following ways:

  • Graphic: TM
  • Semantic: ![TRADE MARK SIGN]!
  • Graphic and Semantic: (TM), or (tm)
  • NLP: empty string, consider ™ as a stopword

The tool, toAscii, encapsulates the lvg flow options -f:q7:q8. That is,

  1. q7: performs Unicode Core Norm
    • map Unicode symbols and punctuation to ASCII
    • map Unicode to ASCII
    • split ligatures
    • strip diacritics
  2. q8: then strip or map non-ASCII Unicode characters,

The tool, toAscii, takes the input and converts the entire input line to ASCII and sends to output. There is only one (or none) output for one particular input. There are many different ways to convert Unicode characters to ASCII. The SPECIALIST Lexical Tools provides various powerful methods for ASCII conversion and allow users to configure the tools to their specifications. Uses may define/modify their own Unicode mapping in following tables to obtain the ASCII conversion results to their specifications.

  • q7: Unicode Core Norm
    • symbolMap.data
    • unicodeMap.data
    • diacriticMap.data
    • ligatureMap.data
  • q8: then strip or map non-ASCII Unicode characters,
    • nonStripMap.data

Setup

Follow the installation instructions to install lexical tool and run the toAscii program. Check on the following items only if you don't use the provided script to install Lexical tools.

  • CLASSPATH:
    1. include the Lexical tools distribution jar file, ${LVG_DIR}/lib/lvg${YEAR}dist.jar, in your CLASSPATH
    2. include the lvg top directory, ${LVG_DIR}, in your CLASSPATH

  • Configuration File: assign the full path of the top directory of lvg${YEAR} to a variable named LVG_DIR in the configuration file, ${LVG_DIR}/data/config/lvg.properties.

Test Run

  • run java program

    Enter the command:

    
    shell> toAscii -p
    - Please input a term (type "Ctl-d" to quit) >
    ɑ-Best™
    alpha-Best
    - Please input a term (type "Ctl-d" to quit) >
    Xigris®|spælsau|Evolène ©2002
    Xigris|spaelsau|Evolene 2002
    

    where:

    • toAscii: script for ASCII conversion
    • -p: set toAscii system option to show prompt (try -h option!)

Output Format

toAscii takes its input (entire line) from standard input, perform ASCII conversion, and then send the results to standard output.

Global Behavior Options

Please refer to design documents

Output Field Options

Please refer to design documents