Text Categorization

WSD Test Suite Design

The design is similar between WsdTest (for NLM WSD Collection) and WsdTest2 (MSH WSD Set). We use WsdTest2 as an example in this page to illustrate the structure of WSD test suite.

  • Test Suite Software:
    • The structure of test suite include following components:

      ComponentDescriptionValues
      YearTC release2007|2008|2009|2010|2011
      CaseStWsd Case TypeSentence|TiAb|Sentences
      ScoreScore TypeCs|Dc|Rdc|Wc|Rwc
      TestCasea collection of WSD test instances for a specified word-score-case-year 
      Instancea single case (PMID) for WSD test 

    • Java class simplified UML are shown as below:

  • Input Data:
    ${WSD_TEST}/WsdTest2/data/Input
    • allAmbiguousWords.txt: all ambiguous words
      use "_" to replace " "
    • MRSTY: map cui to ST
    • release: original data from MSH WSD Set
    • TestSet: Modified data set for StWsd
      => ambiguous Words/
      • answers: gold standard WSD meaning for each PMID
        PMIDMeaning (sense ID:, M1, M2, etc.)
      • choices: all possible senses, ST candidates
        Sense IDCuiTuiST Name
      • testCase.Sentence: ambiguous sentence (the sentence contains ambiguous word for disambiguation)
        PMIDambiguous sentence
      • testCase.TiAb: MedLine title and abstract. This can be used to retrieve ambiguous sentences (all sentences contain ambiguous word and its inflections)
        PMIDTitle & abstract (TiAb)

  • Output Data:
    ${WSD_TEST}/WsdTest2/data/Output/
    • ${YEAR}/: result for different year (version) of StWsd
      • ${YEAR}:
        20072008200920102011
    • ${YEAR}/${TEST_CASE}/:
      • ${TEST_CASE}:
        Ambiguous SentenceTitle & Abstract (TiAb, citation)Ambiguous sentences
    • ${YEAR}/${TEST_CASE}/${SCORE_TYPE}/:
      • ${SCORE_TYPE}:
        Cs (Combined Score)Dc (Documents counts)Rdc (Real-time Dc) Rwc (Real-time Wc) Wc (word counts)
      • ${AMBIGUOUS_WORDS}.out: details WSD results for each ambiguous word
      • Stats.rpt.abbr: statistics report for all ambiguous abbreviations
      • Stats.rpt.all: statistics report for all ambiguous words
      • Stats.rpt.both: statistics report for all ambiguous abbreviations and terms
      • Stats.rpt.term: statistics report for all ambiguous terms