4.8 Lexical Programs

The lexical variant generation package consists of three primary programs -- a normalizer, a word index generator, and a lexical variant generator, together with a set of ancillary programs for normalization.

This package is implemented in Java (V1.2). Updates and bug fixes to this versions may be found at the url: http://umlsks.nlm.nih.gov/KSS/LVG.

The distributions come with install programs (for Solaris, Linux, and Window) and a ReadMe.txt file describing how to install and configure the lexical programs and providing a brief description of each program.

The docs directory contains user guides, Java API documents, and design documents describing in detail the use of Lexical tools. This document is a general introduction to the programs in the lexical variant generation package for the 2002 version.

The compressed lexical programs are as follows:

lvg2002.tar.gz

- The official 2002 distribution of LVG. This includes the source code for the programs, the data and tables in a pure Java embedded databse (Instant DB) the programs use, full documentation, installation instructions, and jar files of the programs. See the documents contained within this distribution for a more complete description of this product.

Normalization (norm)

The lexical program norm generates the normalized strings that are used in the normalized string index, MRXNS. Thus norm must be used before MRXNS can be searched.

The normalization process involves stripping possessives, replacing punctuation with spaces, removing stop words, lower-casing each word, breaking a string into its constituent words, and sorting the words in alphabetic order. The uninflected forms are generated using the SPECIALIST lexicon if words appear in the lexicon, otherwise they are generated algorithmically. When a form could be an inflection of more than one base form, the new normalization process returns multiple uninflected forms. If a string to be normalized contains multiple ambiguous forms, and the permutation of these ambiguous forms offer more than 10 output forms, the input form lowercased, with punctuation replaced, word order sorted, but not uninflected, is returned. The upper limit of permutation number (10) is configurable by modifying the configuration file. The program luiNorm has the behavior of prior year's normalization, and is distributed for those who need it.

Norm reads its standard input and writes to standard output. It expects input lines to be records separated into fields. The field separator is |. The string to be normalized is identified to norm using the -t option. -t takes a numerical argument which denotes the field in which the input string is to be found. If no -t option appears, norm assumes that the input string is in the first field (-t:1). There need not be more than one field, so lines consisting only of input strings are properly understood.

Norm output records include all the fields of the input record with an additional field to the right containing the normalized form of the input string.

For example, if the user had a list of terms to be looked up via the normalized string index in a file called terms, he or she could use norm -i:terms -o:terms.nrm to get the normalized form of each term. If the input file terms contained the following:

    2, 4-Dichlorophenoxyacetic acid
    Syndrome, anterior, compartment
    Abnormal, weight, gain
    Anemia, Refractory, with Excess of Blasts
    left atriums

the file term.nrm would contain:

   2, 4-Dichlorophenoxyacetic acid|2 4 acid dichlorophenoxyacetic
   Syndrome, anterior, compartment|anterior compartment syndrome
   Abnormal, weight, gain|abnormal gain weight
   Anemia, Refractory, with Excess of Blasts|anemia blast excess refractory
   left atriums|atrium left
   left atriums|atrium leave

The string in the second field of each line of terms.nrm is now suitable for matching to MRXNS.

Word Index (wordInd)

The lexical program wordInd breaks strings into words for use with the word index in MRXW. Users of the word index should use wordInd to break strings into words before searching in the word index. This assures congruence between the words to be looked up and the word index.

Word for this purpose is defined as a token containing only alphanumeric characters with length one or greater. The wordInd program lowercases the output words.

The wordInd program reads its standard input and writes to its standard output. Like norm and lvg, it expects each input line to be a record separated into fields by |. The field containing the input string is identified using the -t option. The numerical argument of -t denotes the field in which the input string may be found. If no -t option is given, the input string is expected to be in the first field (-t:1). There need not be more than one field, so lines consisting only of input strings are properly understood.

The wordInd program outputs one line of output for each word found in the input string. Input fields are not repeated in the output unless specified in a -F option. Applying wordInd to the input string Heart Disease, Acute would result in three output lines:

    heart
    disease
    acute

The numerical argument of -F indicates an input field to be repeated in the output. A numerical argument for -F option is required for each input field that is to be repeated. Fields are repeated in the order in which the numerical argument of -F options appear. The output words always appear as an additional field to the right of any repeated input fields. For example, applying wordInd -t:2 -F:2:1 to a record of the form UI23456|tooth, canine|definition.....; would result in the following output:

    tooth, canine|UI23456|tooth
    tooth, canine|UI23456|canine

The third field of each of those records contains a word extracted from the input term in the first field (-t:2 ,-F:2). The -F:1 option repeats the UI numbers from the first field of input. The fact that -F:2:1 placed the UI numbers (field 1) after the input string (field 2).

Lexical Variant Generation (lvg)

The lvg program generates lexical variants of input words. It consists of several different flow components that can be combined in various ways to produce lexical variants. The user of lvg chooses combinations of flow components and combines them into a flow. (The normalizer program, norm, is essentially the lvg program with a pre-selected flow option: lvg -f:N.) The arguments of the -f flag are used to specify a flow. Each flow can be thought of as a pipeline with each flow component feeding the next. For example, the flow -f:i simply generates inflectional variants and -f:l:i generates lowercase inflectional variants. Each of the flow components options is discussed on the documents for lvg.

The lvg program reads from its standard input and writes to its standard output. Input records may be typed in at the keyboard, after typing the command on the commandline (lvg -f:i) or input lines may be read from a file (lvg -f:i -i:file) or piped to lvg from another command (COMMAND|lvg -f:i ). Output records may be directed to the screen (default), send to a file (lvg -f:i -i:INFILE -o:OUTFILE) or piped to another command (lvg -f:i -i:infile | COMMAND).

Input

The lvg program is designed to work with one line input records divided into fields. The default field separator is |. The field separator can be changed using the -s option. The field in which the input term, whose variants are to be generated, can be specified with the -t option. In the absence of a -t flag the input term is assumed to be in the first field of the input. So both dog and dog|canine|UI4567 would generate variants of dog. With the -t flag set to 2, dog|canine|UI4567 would generate variants of canine. In the case of single field input (dog), lvg generates variants from the only field regardless of the setting of -t.

The lvg program can read category (part of speech) and inflection information from the input record. The numerical argument to the -cf option indicates the field in which category information is located. In the input record, category information needs to be encoded as a number according to the scheme described on the documents for lvg. The numerical argument to the -if option indicates the field in which inflection information is located. In the input record, inflection information needs to be encoded as a number according to the scheme described on the documents for Lexical tools.

Output

The lvg program adds five new fields to the input record and outputs a record for each variant generated. For example, if dog|canine|UI4567 is given to the standard input of lvg -f:i the output sent to standard out will be:

    dog|canine|UI4567|dog|128|1|i|1|
    dog|canine|UI4567|dog|128|512|i|1|
    dog|canine|UI4567|dogs|128|8|i|1|
    dog|canine|UI4567|dog|1024|1|i|1|
    dog|canine|UI4567|dog|1024|262144|i|1|
    dog|canine|UI4567|dog|1024|1024|i|1|
    dog|canine|UI4567|dogs|1024|128|i|1|
    dog|canine|UI4567|dogged|1024|64|i|1|
    dog|canine|UI4567|dogged|1024|32|i|1|
    dog|canine|UI4567|dogging|1024|16|i|1|

The first three fields of each record above are identical to the input record, the rest are supplied by lvg. The first additional field is the variant form lvg has generated. The second additional field is the syntactic category of the variant encoded as a number. The third additional field is the inflection of the variant encoded as a number. The fourth additional field indicates the flow that was selected. The fifth field is the number of the flow which generated this variant. Output category (parts of speech) and inflection information are encoded in the same scheme used for input category and inflection information.

For a more detailed technical discussion of lvg, norm, and wordInd see the documents for Lexical tools.