Lexical Tools

SD-Rules Set - From Lexicon To General English

I. Introduction

The SPECIALIST Lexicon is a large syntactic lexicon of biomedical and general English. The Lexicon is considered a representative subset of general English because the characteristics of Lexicon are similar to those of general English, such as the growth of words vs. time, parts of speech distribution vs. words count, morphology of inflections and derivations, and orthography of spelling variants, etc.. The Lexicon has been widely used to provide the lexical information needed for Natural Language Processing (NLP) research.

The automatic generation of derivational variants is one of the most challenging processes in Lexical Variants Generation (LVG). A systematic approach to automatically generating derivational variants based on the SPECIALIST Lexicon was proposed and implemented in Lexical Tools in 2011 [1]. This approach addressed prefix derivations (PD), zero derivations (ZD), and suffix derivations (SD) from nominalizations (nomD). In 2012, Lexical System Group (LSG) added a derivation generation system using a new methodology to generate Lexicon based SD variants that are not covered by nomD [2].

The LVG derivational variants generation function is based on a Facts and Rules paradigm designed to capture the morphological relations between terms. It uses a list of known SD-Pairs (SD-Facts) and a set of rules (SD-Rules). The SD-Facts process performs a normalized indexed keyword lookup function from a database table to retrieve known SDs. This table collects all SD-pairs known to Lexicon and is updated with Lexicon annual release. 44,832 validated SD-pairs were collected in the 2013 release. In addition to SD-Facts, LVG also uses SD-Rules to generate SD variants to retrieve SD-pairs that are not in the Lexicon. The SD-Rules are stored and retrieved through a persistent Trie [3] mechanism with heuristic rules for generating suffixD variants in the LVG rule based generation.

Lexical System Group (LSG) derives 96 unique SD-Rules from the most common English suffixes. For example, the suffix “ment” can be added to a verb to create a noun, which is then the suffix derivational variant of the word. Thus, adding “ment” to “retire”, which creates “retirement”, expressible as the SD-pair retire|verb|retirement|noun. This SD-Rule is coded in the following format in LVG: $|verb|ment$|noun, where “$” means the end of the word. 585 raw SD-pairs are found while applying this SD-Rule to Lexicon (as the corpus). 558 of them are tagged semi-automatically as valid SD-pairs and the accuracy rate of this rule reaches 95.38% (558/585). On the other hand, some SD-Rules obtain low accuracy rates. For example, SD-Rule, ism$|noun|ist$|noun, has only a 1.20% accuracy rate with 4 valid SD-pairs out of 334 raw SD-pairs when it is applied to Lexicon. An optimization method is used to exclude SD-Rules with a low accuracy rate to obtain the desired system performance with both the accumulated coverage rate and accumulated accuracy rate are above 95% in the corpus of Lexicon. If the characteristics in derivations in the Lexicon are similar to general English, the same algorithm of Trie and heuristic rules of these SD-Rules can be used to generate SD that are in other corpus (not in Lexicon) with a similar high accuracy rate. This study proposes a system to test the derivational characteristics of two corpora: one taken from the lexicon Lexicon and the other from MEDLINE as to find:

  • The coverage rate of SD-Facts from Lexicon (LVG)
  • The similarity of accuracy rate of SD-Rules
  • The similarity of over-all system performance

If this study can't be performed, we should compare the results of above test between Lexicon releases to ensure the similarity is close to 1.0. Theoretically, the derivational characteristics between Lexicon releases should be very similar.

II. Main Research Steps
The following section briefly describes the high level procedures for this study:

  • Define target corpus (MEDLINE):
    • Collect 1~3 years Medline abstracts and titles
    • Find all words
    • Tag all words with part of speech (POS)
  • Find and tag raw SD-pairs
    • Apply 96 candidate SD-Rules from Lexical Tools, 2013 to the word list and find all potential raw SD pairs
    • Auto tag these raw SD-pairs (from SD-Facts of Lexical Tools, 2003)
    • Manual tag for the rest (by linguists)
  • Analyze the results
    • Calculate the coverage rate of SD-Facts to the new data set
    • Compare the accuracy rate of each SD-Rule
    • Compare the system performance (accumulated accuracy and coverage rate) of overall SD-Rules
    • From the similarity result of above to decide if Lexicon is a representable subset of medical and general English in suffix derivations generation
      Apply cosine coefficient by using each rule as a vector and the accuracy rate (coverage, performance) as the magnitude of the vector.

III. References

  1. C.J. Lu, L. McCreedy, D. Tormey, and A.C. Browne., “A Systematic Approach for Automatically Generating Derivational Variants in Lexical Tools Based on the SPECIALIST Lexicon”, IEEE IT Professional Magazine, May/June, 2012, p. 36-42
  2. C.J. Lu, D. Tormey, and A.C. Browne., “Completion on the Enhanced Features in Lexical Tools - Derivations”, AMIA 2013 Annual Symposium, Washington, DC, October 16-20, 2013, p. 904
  3. Alfred V. Aho, Jeffrey D. Ullman, John E. Hopcroft, “Data Structure and Algorithms” Addison Wesley, 1983. pp. 163-169.