Text Categorization

St-Documents Enhancement Approach and Results

We used the latest TC version (2008) as the baseline and applied algorithm discussed in the previous section to find the best St-Documents. Bellows are the detail approach and results:

I. Test Suite Setup
To generate and refine St-Documents, integrate into TC package, and run the WSD test is a complicated and tedious processes. A test suite on the WSD test was developed to easy the process. The test suite is summarized as follows:

Test data set
We would like to test WSD tool on the largest data set available. Thus, we tested WSD on both Training set (67 instances) and Test set (33 instances) of NLM's WSD Test collection (100 instances) to evaluate the overall performance (precision, variation, etc.)
Ambiguous Words list
There are 50 ambiguous words in NLM WSD test collection. Five of the ambiguous words are eliminated for the test: "association", "cold", "man", "sex", "weight" because of multiple concepts mapping for the same ST and not valid gold standard answer in the test set.
Score types on STI and STRI:
Both score types from STI and STRI plus an expert system score are tested. That is
- WC: STI words count score
- DC: STI documents count score
- RWC: STRI words count score
- RDC: STRI documents count score
- ES: Expert System score (WSD enhancement)
Input contexts:
Three types of input contexts are used:
- Target sentence: just the sentence in which ambiguous word appears
- Entire citation: includes title and abstract of the article
- Ambiguous Sentences: all sentences contain ambiguous word and it's variants from title and abstract of the article (WSD enhancement)

II. Approach

Weighted Frequency
First, we run the WSD test with DC by adding the occurrence information into St-Documents. In other words, words in St-Documents may appear several times if it associates with the ST several times. The precision of WSD test on this new St-Documents improve from 73.67% to 76.01%, as shown on the 3rd and 4th rows of the following table. This 2.34% of precision increasement is a big improvement and we confirm our assumption of the importance of weighted frequency.

Target-Sentence Entire Citation Avg.
St-Document\Score DC DC DC
Baseline 73.81% 73.52% 73.67%
frequency 76.29% 75.73% 76.01%
frequency-1StGroup 76.85% 76.27% 76.56%
Prioritizing ST Group
As discussed before, words in ST-documents should be those best words to represent the associated ST. A (ambiguous) word could have multiple CUIs to be associated to multiple Semantic types with multiple ST groups. We tried the St-Document (with frequency) with words only belong to one St-Group. The average precision of WSD test improves from 76.01% to 76.56%, as shown on the 4th and 5th rows of the above table. This result confirm that the word associated STs, which only belong to one St-Group, is the core words of St-Documents and should have higher priority when form a St-Document.

STRI-Filter:

Refine St-Documents by basic criteria
STRI filter can be used to refine St-Documents by filter out words are not significantly associated with the ST (low STRI score or rank). First, we tried use top 5 and top 10 (DC) rank on the St-Document with frequency and 1 St-Group. Precisions of both WSD test has been dropped, as shown on the 3rd and 4th rows of the table below. This implies the criteria is too tight and lots of good words has been filter out. Second, we tried use words with STRI score is within 1 Standard deviation from the top rank score (DC). The average precision of the WSD test improve from 76.56% to %, as shown on the 5th rows of the table below. This means this criteria filters out bad words from the St-Documents.

Target-Sentence Entire Citation Avg.
St-Document\Score DC DC DC
frequency-1StGroup, top 5 74.30% 74.87% 74.59%
frequency-1StGroup, top 10 75.95% 75.33% 75.64%
frequency-1StGroup, StdDev 77.54% 76.24% 76.89%

Further refined St-Documents by combined criteria
From the observation of above, we also tried the STRI filter criteria of

STRI score is within 1 Standard deviation from the top rank score (DC)
and
top rank (DC): 5, 10, 15, 20, 25

The results of above are shown in the following table. The 5th rows (frequency-1StGroup: StdDev & Top 15) has the best Avg. precision on WSD test, which improved from 76.89% (frequency-1StGroup, StdDev) to 77.59%.

	Target-Sentence	Entire Citation	Avg.
St-Document\Score	DC	DC	DC
frequency, 1StGroup: StdDev & Top 5	76.26%	76.16%	76.21%
frequency, 1StGroup: StdDev & Top 10	77.95%	76.99%	76.47%
frequency, 1StGroup: StdDev & Top 15	78.07%	77.10%	77.59%
frequency, 1StGroup: StdDev & Top 20	77.65%	76.68%	77.17%
frequency, 1StGroup: StdDev & Top 25	77.61%	76.31%	76.96%

Final refined St-Documents on words belong to multiple St-Groups
There are good words belong to multiple St-Groups and should be added into St-Documents. We applied similar concept and run STRI filter on these words to add to St-Document from above. As discussed before, words belong to multiple St-Groups should have lower priority. Accordingly, the filter criteria should be tighter. Top rank filter (1-5) was used for this test. The results shows that the St-Documents with frequency-1StGroup: StdDev & Top 15 with multiple StGroups: top 3 has the best average precision on WSD test (78.40 %), as shown on the 5th rows on the following table.

	Target-Sentence	Entire Citation	Avg.
St-Document\Score	DC	DC	DC
frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 1	78.60%	78.06%	78.33%
frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 2	78.60%	78.17%	78.39%
frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 3	78.71%	78.08%	78.40%
frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 4	78.37%	77.46%	77.92%
frequency, 1StGroup: StdDev & Top 15; mStGroups: Top 5	77.49%	75.90%	76.70%

III. Results - Best St-Documents
As a conclusion, by applying weighted frequency, prioritize St-Groups and STRI filter to obtain an optimum St-Documents and improve the average precision on WSD test from 73.67% (baseline) to 78.40% (optimum St-Documents). The final optimum St-Documents are obtained by the following rules:

Add frequency information to st-Documents
Words associated only with 1 St-Group: DC score within 1 Standard Deviation from top score and top 15 rank
Words associated with multiple St-Groups: top 3 rank

The next section will discussed the design and improvement on the WSD tool to easy the usage of this tool and reach even high precision on WSD test.

	Target-Sentence	Entire Citation	Avg.
St-Document\Score	DC	DC	DC
Baseline	73.81%	73.52%	73.67%
frequency	76.29%	75.73%	76.01%
frequency-1StGroup	76.85%	76.27%	76.56%

	Target-Sentence	Entire Citation	Avg.
St-Document\Score	DC	DC	DC
frequency-1StGroup, top 5	74.30%	74.87%	74.59%
frequency-1StGroup, top 10	75.95%	75.33%	75.64%
frequency-1StGroup, StdDev	77.54%	76.24%	76.89%