WSD Test2 (MSH WSD Set)
This test uses MSH WSD Set data. Once the new database and data files are ready, we can run the WSD test2 as follows:
- Setup input data:
- Run WSD Test:
- shell> cd ${TC_TEST}/WsdTest/bin
- shell> 2.TestWsd
- Get the Statistic Report:
- shell> cd ${TC_TEST}/WsdTest/bin
- shell> 3.TestWsdStats
- Run WSD and Get the Statistic Report:
- shell> cd ${TC_TEST}/WsdTest/bin
- shell> 4.TestAll
This step can be used to replace above two steps
- Get all Statistic Reports:
- shell> cd ${TC_TEST}/WsdTest/bin
- shell> 5.GetAllStats
Generate all stats.rpt.*
- Data files:
- ${TC_TEST}/WsdTest2/data/Output/${YEAR}/${CASE}/${SCORE}/${WORD}.out
- ${TC_TEST}/WsdTest2/data/Output/${YEAR}/${CASE}/${SCORE}/Stats.rpt.*
- Please refer to MSH WSD Test results for test details
- MSH WSD Set Background
- Details (Antonio Jimeno-Yepes): MSH WSD Set
- Algorithm for generate MSH WSD Set:
- Find ambiguous words from MRCONSO with English, MSH, and multiple CUIs
- Go through all MEDLINE (2009AB, ~ 28 million citations) with ambiguous words
- Get the assigned MH (MeSH)
- Mapped MH to CUI (MRCONSO, 1-1 relationship, not ambiguous)
- Filter out if assigned MHs contain multiple CUIs (ambiguous)
- Summary:
- 203 ambiguous entities
- 106 ambiguous abbreviations
- 88 ambiguous terms
- 9 combination of ambiguous abbreviations and terms
- Total 37,888 instances (203 entities x 2~3 senses x 100 max. instances)
- Summary for count:
- document count for each sense for all ambiguous word in MSH WSD Set
- Used for weighted precision
- Discussion & Future Work:
- Length of Ambiguous Words:
StWsd force Ambiguous Words to be legal word. However, TC package ignore all word with length less than 3 (<= 3). In other words, there is no entry in TC tables for any word with length less than 3. So, it is possible that no legal words exist for a StWsd instance when the ambiguous word is less than 3. Such case results in no answer found.
- Legal ST:
The MSH WSD set used 2009AB Metathesaurus. However, not all the mapped ST exist for all version of StWsd. For example, 2011/Wasp/M2: C0043041|T009|Invertebrate does not exist in TC.2011 and thus the ST is not legal. In such case, all answer are found to be M1 and results in a 50% precision.
- Multiple CUIs mapped to same ST:
In this data set, there are CUIs mapped to same ST (TUI) for a word-instance. For example, Cold/M2 (C0024117) and M3 (C0009443) are both mapped to T047|Disease or Syndrome. In such case, StWSD can't find the answer (disambiguate) since their ST are the same.
- In this test set, same (similar) amount of instances are created for all senses for an ambiguous word. However, StWSd (JDI/STI) are based on the word/document count statistics tool. Which can't achieve high precision if the frequency of each senses are significant different in the source (Metathesaurus).
=> Use weighted precision (sum of Avg. precision x doc count percentage)
- What is the performance between all ambiguous word, ambiguous term, ambiguous abbreviation, and ambiguous term and abbreviation?
- Further result data analysis and research on the sources are suggested to complete this study.