Lexical Tools


  • Short Description: Normalize and canonicalize the input text.

  • Full Description: This is luiNorm flow. This flow component was based the flow -fN from C version. It is a version of Norm used in the UMLS Metathesaurus to collect strings into terms (represented by lui's).

    This is being provided for backward compatibility along with some feature enhancements. A better flow to use for normalization is the -f:N flow. This process involves abstracting away from case, uninflect words, and word order. It also involves removing stop words, possessives, parenthetic plural forms, stripping diacritics, splitting ligatures, mapping non_ASCII symbols to ASCII, replacing punctuation with spaces, and strip and map non-ASCII Unicode characters from the input term. Specifically, this normalization is more or less equivalent to the combined flow options (in this order as well) -f:q7:g:rs:o:t:l:B:C:q8:w. That is, normalize non-ASCII Unicode characters to ASCII, remove genitives, then remove parenthetic plural forms, then replace punctuations with spaces, then remove stop words, then lowercase, then uninflected words, then take each of the normalized uninflected words and map them to their canonical form, then strip or map non-ASCII Unicode characters to ASCII, and then word order sort.

    Only one output record is generated for one input term.

    No effect on the -m option flag. "none" is added at the end of the output.

    This flow option is useful for making a compact yet retrieval-enhancing index on a set of terms. It is used to build the UMLS Metathesaurus normalized word and string indexes. It is therefore useful to transform one's queries using this flow option when retrieving from the Metathesaurus normalized word and string indexes.

    Note. The normalized form is, in reality, a representation of a class of terms, rather than a word. As such, a numeric representation of that class is just as valid.

  • Difference:
    1. The Java version utilized new Canonize data. This causes differences in results and can be categorized into three cases:

      • A single citation form is found for spelling variants
        The new canonical from takes care of spelling variants by using citations form to replace the base form. This feature is an enhancement on LuiNorm. This feature resolves base spelling variants issue. For examples, "coloring" and "colouring" has uninflected form as "color" and "colour", respectively. "color" and "colour" are base spelling variants and can be normalized into "color" by choosing the citation form.

      • No citation form is found (words not in Lexicon)
        In other words, no uninflected form can be found by the FACT. In such case, lexicon morphology rules are used to find all uninflected forms for the input. The one with least value from the output list in an alphabetic order (and not known to lexicon) is chosen as the base form. This base form is then used to find its canonical form. The canonical form may return different result compared to C version since the new Canonical algorithm includes more words in an equivalence class (refer to design document).

        For example, "buckls" is not known by lexicon. Lexicon morphology rule is then used to find the uninflected terms, "buckl" and "buckls". "buckl" is chose as the base form since its value is smaller in terms of alphabetic order. A canonical form, "buckl", is found for "buckls".

      • Multiple citation forms are found
        It is possible to have multiple citation forms in different categories for a word. In such case, the one with least value from the output list in an alphabetic order is chosen as the canonical form for the base form.

        For example, the citation forms of "block" are "bloc" and "block" when it is a noun and verb, respectively. "bloc" is chosen as the canonical form for the base form since its value is smaller in terms of alphabetic order.

  • Features:
    1. Unicode core norm to
      • Map Unicode symbols and punctuation to ASCII
      • Map Unicode to ASCII
      • split ligatures
      • strip diacritics
    2. remove genitives
    3. remove parenthetic plural forms
    4. replace punctuations with spaces
    5. remove stop words
    6. lowercase
    7. normalized uninflect word in a term
    8. canonicalize words
    9. sort words by order
    10. strip or map non-ASCII Unicode characters

  • Symbol: N3

  • Examples:
    shell> lvg -f:N3
    lvg© 2008
    lvg© 2008|2008 lvg|2047|1|q7+g+rs+o+t+l+B+C+q8+w|1|
    Dysenterie amibienne (aiguë)
    Dysenterie amibienne (aiguë)|aigue amibienne dysenterie|2047|1|q7+g+rs+o+t+l+B+C+q8+w|1|
    Burn(s);skin|burn skin|2047|1|q7+g+rs+o+t+l+B+C+q8+w|1|
    More examples

  • Implementation Logic:
    1. use flow component q7 to normalize Unicode
    2. use flow component g to remove genitive
    3. use flow component rs to remove parenthetic plural forms
    4. use flow component o to replace punctuations with spaces
    5. use flow component t to strip stop words
    6. use flow component l to lowercase all characters
    7. use flow component B to uninflect words
    8. use flow component C to canonicalize words. This step only transform the very first LexItem from the output of previous step since all words have same canonical form if they are known to lexicon.
    9. use flow component q8 to strip or map non-ASCII Unicode characters
    10. use flow component w to sort words by ASCII order

  • Source Code: ToLuiNormalize.java

  • Hierarchy: Object -> Transformation -> ToLuiNormalize