SD-Rules Set - From Lexicon To General English
I. Introduction
The SPECIALIST Lexicon is a large syntactic lexicon of biomedical and general English. The Lexicon is considered a representative subset of general English because the characteristics of Lexicon are similar to those of general English, such as the growth of words vs. time, parts of speech distribution vs. words count, morphology of inflections and derivations, and orthography of spelling variants, etc.. The Lexicon has been widely used to provide the lexical information needed for Natural Language Processing (NLP) research.
The automatic generation of derivational variants is one of the most challenging processes in Lexical Variants Generation (LVG). A systematic approach to automatically generating derivational variants based on the SPECIALIST Lexicon was proposed and implemented in Lexical Tools in 2011 [1]. This approach addressed prefix derivations (PD), zero derivations (ZD), and suffix derivations (SD) from nominalizations (nomD). In 2012, Lexical System Group (LSG) added a derivation generation system using a new methodology to generate Lexicon based SD variants that are not covered by nomD [2].
The LVG derivational variants generation function is based on a Facts and Rules paradigm designed to capture the morphological relations between terms. It uses a list of known SD-Pairs (SD-Facts) and a set of rules (SD-Rules). The SD-Facts process performs a normalized indexed keyword lookup function from a database table to retrieve known SDs. This table collects all SD-pairs known to Lexicon and is updated with Lexicon annual release. 44,832 validated SD-pairs were collected in the 2013 release. In addition to SD-Facts, LVG also uses SD-Rules to generate SD variants to retrieve SD-pairs that are not in the Lexicon. The SD-Rules are stored and retrieved through a persistent Trie [3] mechanism with heuristic rules for generating suffixD variants in the LVG rule based generation.
Lexical System Group (LSG) derives 96 unique SD-Rules from the most common English suffixes. For example, the suffix “ment” can be added to a verb to create a noun, which is then the suffix derivational variant of the word. Thus, adding “ment” to “retire”, which creates “retirement”, expressible as the SD-pair retire|verb|retirement|noun. This SD-Rule is coded in the following format in LVG: $|verb|ment$|noun, where “$” means the end of the word. 585 raw SD-pairs are found while applying this SD-Rule to Lexicon (as the corpus). 558 of them are tagged semi-automatically as valid SD-pairs and the accuracy rate of this rule reaches 95.38% (558/585). On the other hand, some SD-Rules obtain low accuracy rates. For example, SD-Rule, ism$|noun|ist$|noun, has only a 1.20% accuracy rate with 4 valid SD-pairs out of 334 raw SD-pairs when it is applied to Lexicon. An optimization method is used to exclude SD-Rules with a low accuracy rate to obtain the desired system performance with both the accumulated coverage rate and accumulated accuracy rate are above 95% in the corpus of Lexicon. If the characteristics in derivations in the Lexicon are similar to general English, the same algorithm of Trie and heuristic rules of these SD-Rules can be used to generate SD that are in other corpus (not in Lexicon) with a similar high accuracy rate. This study proposes a system to test the derivational characteristics of two corpora: one taken from the lexicon Lexicon and the other from MEDLINE as to find:
If this study can't be performed, we should compare the results of above test between Lexicon releases to ensure the similarity is close to 1.0. Theoretically, the derivational characteristics between Lexicon releases should be very similar.
II. Main Research Steps
The following section briefly describes the high level procedures for this study:
III. References