Norm creates an abstract representation of text strings allowing users to ignore alphabetic case, inflection, spelling variants, punctuation, genitive markers, stop words, diacritics, symbols, ligatures, and word order. The normalized string is a version of the original string in lower case, without punctuation, genitive markers, or stop words, diacritics, ligatures, with each word in its uninflected (citation) form, the words sorted in alphabetical order, and normalize non-ASCII Unicode characters to ASCII by mapping punctuation and symbols to ASCII, mapping Unicode to ASCII, stripping diacritics, splitting ligatures, and stripping non-ASCII Unicode characters. Lexical variants which differ only in those ways will have the same normalized form. Norm is used to create the normalized string and word indexes to the UMLS Metathesaurus and is used to access those indexes.
Normalization encapsulates the lvg flow options
-f:q0:g:rs:o:t:l:B:Ct:q7:q8:w
.
That is,
There may be more than one normalized form for a particular string. Some English inflected forms have more than one uninflected form. For example, "scleroses" could be the plural of the noun "sclerosis" or the third person singular of the verb "sclerose". In this version of Norm, multiple uninflected forms are returned for ambiguously inflected forms as recorded in the SPECIALIST lexicon.
In 2004 release, Norm is enhanced to normalize spelling variants by returning the citation form of the uninflected base form instead of the base form. For example, both "coloring" and "colouring" are normalized into "color". In addition, in 2004 release, Norm is enhanced to strip diacritics, split ligatures, and return the synonyms of Unicode symbols if the character is not ASCII, diacritics, or ligatures. This feature is modified after 2008 release (see below).
The minimum word size for normalization differs from the lvg default word size. In the normalization process, the minimum word size is one or more characters.
In 2006 release, Norm is enhanced to remove parenthetic plural forms of (s), (es), (ies), (S), (ES), and (IES). However, Norm will not remove these patterns when they are not plural forms, such as in chemical terms, protein, or mathematical equations. For example, "Inj oth musc(s)/tend(s)" is normalized to "inj musc oth tend" and "Abdomen CT Adrenal Mass(es) Bilateral" is normalized to "abdomen adrenal bilateral ct mass". Also, Norm is enhanced to handle term with irreg inflectional variants better. For example, "proofread", "proof-read", and "proof read" are all normalized to "proof read".
Unicode is commonly used in the recent years. UTF-8 is used as the default format for the input and output in norm (lvg) since 2004. Also, UTF-8 is introduced in the SPECIALIST Lexicon since then. The citation form could contain non-ASCII characters and result in the output of norm contains non-ASCII characters. Such as the citation form of "varon" is "varón". Norm was enhanced in 2007 to resolve this issue and produce ASCII only outputs.
In 2008 release, Norm is enhanced to utilize Unicode core norm (-f:q7) to convert non-ASCII Unicode characters to ASCII. This operation includes mapping Unicode symbols and punctuation to ASCII, mapping Unicode to ASCII, splitting ligatures, and stripping diacritics. Another flow component (-f:q8) is then followed to strip or map non-ASCII Unicode to ensure pure ASCII outputs.
Follow the installation instructions to install lexical tool and run the norm program. Check on the following items only if you don't use the provided script to install Lexical tools.
Enter the command:
shell> norm -p - Please input a term (type "Ctl-d" to quit) > left left|left left|leave
where:
Norm copies its input from standard input to standard output with the normalized term appended. Output consists of:
Input line | Output term |
This may be one or more fields. | This is the normalized term from the input line. |
Please refer to design document
Please refer to design document