LuiAssignment Analysis
I. Summary
This test/analysis is based on the feedback reports from OCCS UMLS group (Soma Lanka). OCCS UMLS group uses the new release Lexical tool (luiNorm) to assign new LUI on UMLS strings for the new release of UMLS. All strings are assigned to a LUI based on the luiNorm form. OCCS UMLS group runs a program to compare the difference on LUI assignment between new release and previous release and results in three files, as described below:
- merge.rpt: for strings that merge into a LUI
- splite.rpt: for strings that split to different LUIs
- splite_merge.rpt: for strings were split and part (or all) merged into another
The formats of above three files are the same. There are 4 fields in the file, they are:
Based on these three files, the Lexical Systems Group tries to analyze the causes of the change, fix bugs and enhance features of the luiNorm flow.
II. Analysis
The change of LUI (luiNorm form) could be caused by the change of software algorithm or Lexicon data. We would like to know as much detail as possible to make sure luiNorm behaves the way we expect. The analysis is straight forward. Basically, following steps are used to identify which flow component cause the change:
- Get luiNorm from previous Lexical Tools:
Get the luiNorm forms of UMLS Strings from previous release of lexical tools. This process need to be done separately in a different java file because the previous release lvg jar file is used.
- Tag line with different luiNorm forms:
Compare the results from step 1. to the luiNorm form from the new release Lexical tools. Mark each line as shown in the following table:
Tag | Condition
|
---|
S | Same luiNorm forms
|
---|
C | Change in luiNorm forms
|
---|
- Get results of all flow components of luiNorm from previous Lexical tools:
For all lines tagged with C (for results changed), get all results of all components from previous luiNorm. This is the source of new merge, split, split_merge happens. There are different flow components between 2007and 2008 version. We would like to find out what cause the change. Thus, we should use the 2008 version as the base for comparison. 10 flow components are used to analyze this results, they are:
Flow component | Tag | Cause | Testing Flows | Prev Flows
|
---|
-f:q7 | q7 | Unicode Core Norm | -f:T:q7 | -f:T:q:q2
|
-f:g | g | Remove genitive | -f:g | -f:g
|
-f:rs | rs | Remove parenthetical plural forms | -f:rs | -f:rs
|
-f:o | o | Remove punctuation | -f:o | -f:o
|
-f:t | t | Remove stopWords | -f:t | -f:t
|
-f:l | l | Lowercase | -f:l | -f:l
|
-f:B | B | Retrieve the uninflected form | -f:B | -f:B
|
-f:C | C | Retrieve the Canonical form | -f:C | -f:C
|
-f:q8 | q8 | Strip or Map Unicode to ASCII | -f:q8 | -f:g4
|
-f:w | w | Sort words by order | -f:w | -f:w
|
- Tag line on different flow component result:
For all lines tagged with C, get all results of above 11 flow components from new luiNorm and then compare to the results from previous year to identify the causes of different luiNorm.
As mentioned above, change might be caused by the change of software algorithm or Lexicon data. They are discussed as follows:
- Get detail causes of different Canonical results from previous Lexical Tools:
Canonization is a complicated computation. A canonical form is determined by base forms, inflectional forms, spelling variants, etc.. of all words in UMLS and Lexicon by the canonize algorithm and Lexicon data. To identify the cause of different canonical form, first, we get the EUIs and spelling variants from the previous Lexical tools:
- EUI: for new added records
- Spelling variants: for new spelling variants
- Others: new word in UMLS or new rules in Lexicon
- Tag detail causes of different Canonical results
For all lines tagged with 'C|C', (Canonical|Change), we get the EUIs and spelling variants from the new Lexical tool. And then compare to the result from above step to identify the cause, as shown in the table below:
Flow Component | Tag | Detail Cause
|
---|
-f:E | CE | New EUI, lexical records
|
-f:s -CR:o | Cs | New Spelling variants
|
-f:C | C | New words in UMLS/Lexicon or new rules in Lexicon
|
III. Procedures: Run the Analysis
- shell> cd ${TEST}/LVG/ComponentTest/luiNorm/luiAssignment
- shell> cp -r ${PREV_YEAR} ${YEAR}
- shell> cd ${YEAR}
- Setup:
- update "project.year" in build.xml
- update "lvg.jar" in build.xml
- copy the latest lvg${YEAR}dist.jar to ./lib
- keep the last lvg${PREV_YEAR}dist.jar
- update software in ./source/*.java if needed
- build the software
- Input:
- ./orgData/merge.rpt
- ./orgData/split.rpt
- ./orgData/splite_merge.rpt
- Program
Select the options to run the analysis
- ./bin/Merge ${YEAR}
8
- ./bin/Split ${YEAR}
8
- ./bin/SpliteMerge ${YEAR}
8
- ./bin/GetReports ${YEAR}
- Outputs:
- ./data/Merge/*
- ./data/Split/*
- ./data/SplitMerge/*
- ./data/summary.rpt
Use summary.rpt for analyzing data and report.
IV. Reports