Canonize Algorithm
Introduction:
Program, Canonize, creates the canonical or equivalent class forms, given a set of words. This program is run to create the luiNorm canonical forms when a new or updated set of words is added. Typically this program is used (run) once a year. The words are in the forthcoming UMLS Metathesaurus and Specialist Lexicon are the input words for this Canonize program.
A core LVG technique is to "uninflect" input terms to their base forms. This process occasionally results in two legitimate uninflected forms for the same inflected input. For example, left uninflects to both left and leave reflecting its ambiguity as either adjective or verb. A technique to manage this ambiguity produces only one "canonical" base form for any given input term. The process of canonicalization pre-computes all uninflected forms (of all bases and their spelling variants) and then arranges these into classes composed of terms that could be expanded to the same inflected form. The canonical form is an arbitrarily chosen member of this class and represents all the members of the class. For example, the terms left, leave, and leaf are all included in one such class, and the canonical form is leaf, known to lexicon first, pure ASCII, shortest, and then the alphabetical order. The member of the class is chosen to be a form from the lexicon if possible. This is an attempt to limit the number of word fragments that shows up as canonical representations of the class of terms.
In the Java implementation, spelling variants knowledge is also implemented in the algorithm. In other words, spelling variants will have same canonical form. For example, "dependent" and "dependant" are spelling variants and have same canonical form as "dependant".
The canonical form is, in reality, a representation of a class of terms, rather than a word. As such, its numeric representation of that class is just as valid.
Input
The input file includes a list of words, which are all words known in the target that you are going to. In our case, it's all the words from the Lexicon plus all the words from the Metathesaurus (atoms.data), put through wordInd -s:1, then sorted and uniqued. The sorting and uniquing make the canonize program run more efficiently.
Output:
A file includes all canonized data for all uninflected words from UMLS Metathesaurus and The Specialist Lexicon. The format of the output is as follows:
uninflected form | canonical form | canonical Id |
For example:
leaf | leaf | 436283 |
left | leaf | 436283 |
leave | leaf | 436283 |
Algorithm:
I. Pre-Process
II. Core-Process
Check if this new equivalent class is related to any existed equivalent class:
Notes:
BASE | |||
---|---|---|---|
Name | Type | Properties | Notes |
base | varchar(100) | Not null, primary key (Unique) | base (spelling vars) |
classId | int | Not null, index | Equivalent class id |
INFLECTION | |||
---|---|---|---|
Name | Type | Properties | Notes |
inflection | varchar(100) | Not null, primary key (Unique) | Base |
classId | int | Not null, index | Equivalent class id |
CANON | |||
---|---|---|---|
Name | Type | Properties | Notes |
base | varchar(100) | Not null, index | Base |
classId | int | Not null, index | Equivalent class id |
III. Post-Process
Data Structure of equivalent class:
Comparison to the algorithm in C version:
The Java version uses a forward algorithm to compute all canonical forms. In other words, the Java version retrieve all citation forms for all base words (input); retrieve uninflected form for all citation forms; then compares citation forms and inflected forms to see if words are related.
The implementation in C uses reversed algorithm. Again, it define two based forms are related when they have same inflected forms. However, it limits the domain of all uninflected forms to the words list from Lexicon and Metathesaurus. In other words, it go through all words list to check if they have same base. If so, make the base form related.
The algorithm can be summarized as follows:
The major differences between C and Java implementation are Java have wider range for a related equivalent class. This is caused by those words only exist in Metathesaurus and not in Lexicon. The inflectional forms (generated by rules) may make it related to a class which is known by Lexicon, and vice versa.