Text Categorization

St-Documents Enhancement Approach and Results

We used the latest TC version (2008) as the baseline and applied algorithm discussed in the previous section to find the best St-Documents. Bellows are the detail approach and results:

I. Test Suite Setup
To generate and refine St-Documents, integrate into TC package, and run the WSD test is a complicated and tedious processes. A test suite on the WSD test was developed to easy the process. The test suite is summarized as follows:

  • Test data set
    We would like to test WSD tool on the largest data set available. Thus, we tested WSD on both Training set (67 instances) and Test set (33 instances) of NLM's WSD Test collection (100 instances) to evaluate the overall performance (precision, variation, etc.)
  • Ambiguous Words list
    There are 50 ambiguous words in NLM WSD test collection. Five of the ambiguous words are eliminated for the test: "association", "cold", "man", "sex", "weight" because of multiple concepts mapping for the same ST and not valid gold standard answer in the test set.
  • Score types on STI and STRI:
    Both score types from STI and STRI plus an expert system score are tested. That is
    • WC: STI words count score
    • DC: STI documents count score
    • RWC: STRI words count score
    • RDC: STRI documents count score
    • ES: Expert System score (WSD enhancement)
  • Input contexts:
    Three types of input contexts are used:
    • Target sentence: just the sentence in which ambiguous word appears
    • Entire citation: includes title and abstract of the article
    • Ambiguous Sentences: all sentences contain ambiguous word and it's variants from title and abstract of the article (WSD enhancement)

II. Approach

  • Weighted Frequency
    First, we run the WSD test with DC by adding the occurrence information into St-Documents. In other words, words in St-Documents may appear several times if it associates with the ST several times. The precision of WSD test on this new St-Documents improve from 73.67% to 76.01%, as shown on the 3rd and 4th rows of the following table. This 2.34% of precision increasement is a big improvement and we confirm our assumption of the importance of weighted frequency.

     Target-SentenceEntire CitationAvg.
    St-Document\ScoreDCDCDC
    Baseline73.81%73.52%73.67%
    frequency76.29%75.73%76.01%
    frequency-1StGroup76.85%76.27%76.56%

  • Prioritizing ST Group
    As discussed before, words in ST-documents should be those best words to represent the associated ST. A (ambiguous) word could have multiple CUIs to be associated to multiple Semantic types with multiple ST groups. We tried the St-Document (with frequency) with words only belong to one St-Group. The average precision of WSD test improves from 76.01% to 76.56%, as shown on the 4th and 5th rows of the above table. This result confirm that the word associated STs, which only belong to one St-Group, is the core words of St-Documents and should have higher priority when form a St-Document.

  • STRI-Filter:
    • Refine St-Documents by basic criteria
      STRI filter can be used to refine St-Documents by filter out words are not significantly associated with the ST (low STRI score or rank). First, we tried use top 5 and top 10 (DC) rank on the St-Document with frequency and 1 St-Group. Precisions of both WSD test has been dropped, as shown on the 3rd and 4th rows of the table below. This implies the criteria is too tight and lots of good words has been filter out. Second, we tried use words with STRI score is within 1 Standard deviation from the top rank score (DC). The average precision of the WSD test improve from 76.56% to %, as shown on the 5th rows of the table below. This means this criteria filters out bad words from the St-Documents.

       Target-SentenceEntire CitationAvg.
      St-Document\ScoreDCDCDC
      frequency-1StGroup, top 574.30%74.87%74.59%
      frequency-1StGroup, top 1075.95%75.33%75.64%
      frequency-1StGroup, StdDev77.54%76.24%76.89%

    • Further refined St-Documents by combined criteria
      From the observation of above, we also tried the STRI filter criteria of
      • STRI score is within 1 Standard deviation from the top rank score (DC)
        and
      • top rank (DC): 5, 10, 15, 20, 25

      The results of above are shown in the following table. The 5th rows (frequency-1StGroup: StdDev & Top 15) has the best Avg. precision on WSD test, which improved from 76.89% (frequency-1StGroup, StdDev) to 77.59%.

       Target-SentenceEntire CitationAvg.
      St-Document\ScoreDCDCDC
      frequency, 1StGroup: StdDev & Top 576.26%76.16%76.21%
      frequency, 1StGroup: StdDev & Top 1077.95%76.99%76.47%
      frequency, 1StGroup: StdDev & Top 1578.07%77.10%77.59%
      frequency, 1StGroup: StdDev & Top 2077.65%76.68%77.17%
      frequency, 1StGroup: StdDev & Top 2577.61%76.31%76.96%

    • Final refined St-Documents on words belong to multiple St-Groups
      There are good words belong to multiple St-Groups and should be added into St-Documents. We applied similar concept and run STRI filter on these words to add to St-Document from above. As discussed before, words belong to multiple St-Groups should have lower priority. Accordingly, the filter criteria should be tighter. Top rank filter (1-5) was used for this test. The results shows that the St-Documents with frequency-1StGroup: StdDev & Top 15 with multiple StGroups: top 3 has the best average precision on WSD test (78.40 %), as shown on the 5th rows on the following table.

       Target-SentenceEntire CitationAvg.
      St-Document\ScoreDCDCDC
      frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 178.60%78.06%78.33%
      frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 278.60%78.17%78.39%
      frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 378.71%78.08%78.40%
      frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 478.37%77.46%77.92%
      frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 577.49%75.90%76.70%

III. Results - Best St-Documents
As a conclusion, by applying weighted frequency, prioritize St-Groups and STRI filter to obtain an optimum St-Documents and improve the average precision on WSD test from 73.67% (baseline) to 78.40% (optimum St-Documents). The final optimum St-Documents are obtained by the following rules:

  • Add frequency information to st-Documents
  • Words associated only with 1 St-Group: DC score within 1 Standard Deviation from top score and top 15 rank
  • Words associated with multiple St-Groups: top 3 rank

The next section will discussed the design and improvement on the WSD tool to easy the usage of this tool and reach even high precision on WSD test.