Lexical Tools

LuiAssignment Analysis

I. Summary

This test/analysis is based on the feedback reports from OCCS UMLS group (Soma Lanka). OCCS UMLS group uses the new release Lexical tool (luiNorm) to assign new LUI on UMLS strings for the new release of UMLS. All strings are assigned to a LUI based on the luiNorm form. OCCS UMLS group runs a program to compare the difference on LUI assignment between new release and previous release and results in three files, as described below:

  • merge.rpt: for strings that merge into a LUI
  • splite.rpt: for strings that split to different LUIs
  • splite_merge.rpt: for strings were split and part (or all) merged into another

The formats of above three files are the same. There are 4 fields in the file, they are:

Old LUINew LUISUIString

Based on these three files, the Lexical Systems Group tries to analyze the causes of the change, fix bugs and enhance features of the luiNorm flow.

II. Analysis

The change of LUI (luiNorm form) could be caused by the change of software algorithm or Lexicon data. We would like to know as much detail as possible to make sure luiNorm behaves the way we expect. The analysis is straight forward. Basically, following steps are used to identify which flow component cause the change:

  1. Get luiNorm from previous Lexical Tools:
    Get the luiNorm forms of UMLS Strings from previous release of lexical tools. This process need to be done separately in a different java file because the previous release lvg jar file is used.

  2. Tag line with different luiNorm forms:
    Compare the results from step 1. to the luiNorm form from the new release Lexical tools. Mark each line as shown in the following table:
    TagCondition
    SSame luiNorm forms
    CChange in luiNorm forms

  3. Get results of all flow components of luiNorm from previous Lexical tools:
    For all lines tagged with C (for results changed), get all results of all components from previous luiNorm. This is the source of new merge, split, split_merge happens. There are different flow components between 2007and 2008 version. We would like to find out what cause the change. Thus, we should use the 2008 version as the base for comparison. 10 flow components are used to analyze this results, they are:
    Flow componentTagCauseTesting FlowsPrev Flows
    -f:q7q7Unicode Core Norm-f:T:q7-f:T:q:q2
    -f:ggRemove genitive-f:g-f:g
    -f:rsrsRemove parenthetical plural forms-f:rs-f:rs
    -f:ooRemove punctuation-f:o-f:o
    -f:ttRemove stopWords-f:t-f:t
    -f:llLowercase-f:l-f:l
    -f:BBRetrieve the uninflected form-f:B-f:B
    -f:CCRetrieve the Canonical form-f:C-f:C
    -f:q8q8Strip or Map Unicode to ASCII-f:q8-f:g4
    -f:wwSort words by order-f:w-f:w

  4. Tag line on different flow component result:
    For all lines tagged with C, get all results of above 11 flow components from new luiNorm and then compare to the results from previous year to identify the causes of different luiNorm.

    As mentioned above, change might be caused by the change of software algorithm or Lexicon data. They are discussed as follows:

    • Software Algorithm Change
      Software changes include bug fix and feature enhancement and only happen occasionally. This type of change should be monitored closely to ensure the behavior of luiNorm is as expected. For example, -f:rs flow component is enhanced to remove upper case parenthetical plural forms, such as (S), (ES), and (IES), in 2007 release. Thus, lines tagged with 'C' (change in luiNorm) with strings contain upper case parenthetical plural forms should be tagged with 'rs'. On the other hand, -f:l flow component has not changed since previous release, accordingly, no line should be tagged as 'l'.

      In 2008, luiNorm is enhanced to 10 flow components as follows:
      Try to compare the input and output of the same flow components.

    • Lexicon Data Change
      The SPECIALIST Lexicon is updated continuously. A snap shot of Lexicon data (frozen Lexicon) is used for the new release of Lexical tools annually. New added lexical records (with new EUIs), deleted lexical records, or modified records in the Lexicon result in different base forms (uninflected form), spelling variants, EUIs, inflections, etc. from the results of Lexical tools. This change reflects in two flow components in luiNorm, -f:B and -f:C. This change is considered as enhancement of Lexical tools. It is a routine normal behavior and repeated on every new release.

  5. Get detail causes of different Canonical results from previous Lexical Tools:
    Canonization is a complicated computation. A canonical form is determined by base forms, inflectional forms, spelling variants, etc.. of all words in UMLS and Lexicon by the canonize algorithm and Lexicon data. To identify the cause of different canonical form, first, we get the EUIs and spelling variants from the previous Lexical tools:
    • EUI: for new added records
    • Spelling variants: for new spelling variants
    • Others: new word in UMLS or new rules in Lexicon

  6. Tag detail causes of different Canonical results
    For all lines tagged with 'C|C', (Canonical|Change), we get the EUIs and spelling variants from the new Lexical tool. And then compare to the result from above step to identify the cause, as shown in the table below:
    Flow ComponentTagDetail Cause
    -f:ECENew EUI, lexical records
    -f:s -CR:oCsNew Spelling variants
    -f:C CNew words in UMLS/Lexicon or new rules in Lexicon

III. Procedures: Run the Analysis

  • shell> cd ${TEST}/LVG/ComponentTest/luiNorm/luiAssignment
  • shell> cp -r ${PREV_YEAR} ${YEAR}
  • shell> cd ${YEAR}

  • Setup:
    • update "project.year" in build.xml
    • update "lvg.jar" in build.xml
    • copy the latest lvg${YEAR}dist.jar to ./lib
    • keep the last lvg${PREV_YEAR}dist.jar
    • update software in ./source/*.java if needed
    • build the software
  • Input:
    • ./orgData/merge.rpt
    • ./orgData/split.rpt
    • ./orgData/splite_merge.rpt
  • Program
    Select the options to run the analysis
    • ./bin/Merge ${YEAR}
      8
    • ./bin/Split ${YEAR}
      8
    • ./bin/SpliteMerge ${YEAR}
      8
    • ./bin/GetReports ${YEAR}
  • Outputs:
    • ./data/Merge/*
    • ./data/Split/*
    • ./data/SplitMerge/*
    • ./data/summary.rpt

    Use summary.rpt for analyzing data and report.

IV. Reports