Text Categorization

WSD Test2 (MSH WSD Set)

This test uses MSH WSD Set data. Once the new database and data files are ready, we can run the WSD test2 as follows:

  • Setup input data:
    • Files:
      • release -> ../Org/release/ (MSH WSD Set from Antonio Yepes)
      • MRSTY.2009AB

      • shell> cd /nfsvol/crfiler-lex/Lu/Test/TC/WsdTest2/bin/
      • shell> runProg ModifyAmbiguousWords
      • shell> 1.SetupInputs

  • Run WSD Test:
    • shell> cd ${TC_TEST}/WsdTest/bin
    • shell> 2.TestWsd

  • Get the Statistic Report:
    • shell> cd ${TC_TEST}/WsdTest/bin
    • shell> 3.TestWsdStats

  • Run WSD and Get the Statistic Report:
    • shell> cd ${TC_TEST}/WsdTest/bin
    • shell> 4.TestAll
      This step can be used to replace above two steps

  • Get all Statistic Reports:
    • shell> cd ${TC_TEST}/WsdTest/bin
    • shell> 5.GetAllStats
      Generate all stats.rpt.*

  • Data files:
    • ${TC_TEST}/WsdTest2/data/Output/${YEAR}/${CASE}/${SCORE}/${WORD}.out
    • ${TC_TEST}/WsdTest2/data/Output/${YEAR}/${CASE}/${SCORE}/Stats.rpt.*
    • Please refer to MSH WSD Test results for test details

  • MSH WSD Set Background
    • Details (Antonio Jimeno-Yepes): MSH WSD Set
    • Algorithm for generate MSH WSD Set:
      1. Find ambiguous words from MRCONSO with English, MSH, and multiple CUIs
      2. Go through all MEDLINE (2009AB, ~ 28 million citations) with ambiguous words
      3. Get the assigned MH (MeSH)
      4. Mapped MH to CUI (MRCONSO, 1-1 relationship, not ambiguous)
      5. Filter out if assigned MHs contain multiple CUIs (ambiguous)
    • Summary:
      • 203 ambiguous entities
        • 106 ambiguous abbreviations
        • 88 ambiguous terms
        • 9 combination of ambiguous abbreviations and terms
      • Total 37,888 instances (203 entities x 2~3 senses x 100 max. instances)
    • Summary for count:
      • document count for each sense for all ambiguous word in MSH WSD Set
      • Used for weighted precision

  • Discussion & Future Work:
    • Length of Ambiguous Words:
      StWsd force Ambiguous Words to be legal word. However, TC package ignore all word with length less than 3 (<= 3). In other words, there is no entry in TC tables for any word with length less than 3. So, it is possible that no legal words exist for a StWsd instance when the ambiguous word is less than 3. Such case results in no answer found.
    • Legal ST:
      The MSH WSD set used 2009AB Metathesaurus. However, not all the mapped ST exist for all version of StWsd. For example, 2011/Wasp/M2: C0043041|T009|Invertebrate does not exist in TC.2011 and thus the ST is not legal. In such case, all answer are found to be M1 and results in a 50% precision.
    • Multiple CUIs mapped to same ST: In this data set, there are CUIs mapped to same ST (TUI) for a word-instance. For example, Cold/M2 (C0024117) and M3 (C0009443) are both mapped to T047|Disease or Syndrome. In such case, StWSD can't find the answer (disambiguate) since their ST are the same.
    • In this test set, same (similar) amount of instances are created for all senses for an ambiguous word. However, StWSd (JDI/STI) are based on the word/document count statistics tool. Which can't achieve high precision if the frequency of each senses are significant different in the source (Metathesaurus).
      => Use weighted precision (sum of Avg. precision x doc count percentage)
    • What is the performance between all ambiguous word, ambiguous term, ambiguous abbreviation, and ambiguous term and abbreviation?
    • Further result data analysis and research on the sources are suggested to complete this study.