SPECIALIST Lexicon

ASCII LEXICON, 1st Version (09~10)

I. Introduction

The Specialist LEXICON is distributed in UTF-8 format annually with UMLS. There are some NLP projects uses the Specialist LEXICON and still only dealing with ASCII characters. Due to the requests from user groups, the pure ASCII version of LEXICON is distributed since 2009.

II. Algorithm

Convert LEXICON form UTF-8 to ASCII (7-bit):
Use Java API class, ToAsciiApi( ), from Lexical Tools (after 2009) to make the conversion.

Automatic/manually clean up:
After the conversion, some data in Lexical records need to be clean up. For example, the spelling variant résumé is converted to resume and should be removed since it is the same as the base form. In 2009 LEXICON, we found following ASCII conversion cases that need to be clean up as shown in the following table. The LexCheck.CheckContent.Check( ) is used to clean up duplications.

LEXICON content	Action	Notes & Example
{base=filler	N/A	All base is unique
spelling_variant=filler	remove if it is duplicated	spelling_variant=résumé
abbreviation_of=abbreviation	remove if it is duplicated	None
acronym_of=acronyms	remove if it is duplicated	None
nominalization_of=filler	remove if it is duplicated	None
variants=irreg	remove if it is duplicated	irreg\|saute\|sautes\|sauted\|sauted\|sauteing\|
compl=pphr(	N/A	Needs manual cleanup (none)
trademark=filler(	N/A	Needs manual cleanup (none)

The SPECIALIST Lexicon