Background
I. Two major factors of STI/STRI results:
Based on the study of "Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: preliminary experiment" by Humphrey, SM, Rogers WJ, Kilicoglu H, Demner-Fushman D, Rindflesch TC., a WSD tool can be developed by utilizing Semantic Type Indexing (STI) or Semantic Type Real-Time Indexing (STRI). A candidate semantic type with high rank/score from the STI/STRI results implies the better sense for the ambiguous word. The results of both STI and STRI are calculated as followed:
semantic type | list of words belong to (represent) this ST |
semantic type | JD scores |
word | St scores |
The results of STI and STRI are based on Word-Sts table and St-Jds table, respectively. Accordingly, these results are based on two root factors:
II. St-Documents is the key factor of STI/STRI results:
Since year 2007, Lexical Systems Group releases two public TC packages, tc.2007 and tc.2008. TC package includes the Java versions of JDI, STI, and STRI. As discussed above, the results of STI and STRI are mainly based on JDI and St-Document. The study of "A method for verifying a vector-based text classification system" by Dr. Lu CJ, Humphrey MS, Browne AC, points out that JDI is a well defined methodology and is considered as an stable system with and steady results. In other words, the results of JDI.2007 and JDI.2008 are very similar. Thus, the major key factor of STI and STRI is the St-Documents. To confirm with above assumption, we run combination of JDI and St-Documents for both TC.2007 and TC.2008 on all 100 instances (both training set and test set) of the NLM's WSD collection for three WSD algorithms with two scores (Document count and Words counts). The setup and results of test suite on this test are described in the follow sections:
III. Setup of WSD testing suite:
Accordingly,
IV. Results of WSD test on 2007 vs. 2008:
JDI\St-Documents | St-Documents.2007 | St-Documents.2008 |
JDI.2007 | 75.00% | 74.58% |
JDI.2008 | 75.20% | 74.93% |
JDI\St-Documents | St-Documents.2007 | St-Documents.2008 |
JDI.2007 | 74.61% | 74.09% |
JDI.2008 | 74.96% | 73.81% |
JDI\St-Documents | St-Documents.2007 | St-Documents.2008 |
JDI.2007 | 74.32% | 73.87% |
JDI.2008 | 75.21% | 74.44% |
JDI\St-Documents | St-Documents.2007 | St-Documents.2008 |
JDI.2007 | 74.05% | 73.77% |
JDI.2008 | 74.33% | 73.52% |
As shown in all above four tables, the precision of the first column (St-Document.2007) are better than precision of the second column (St-Document.2008) for both target-sentence and Context test on both WC and DC scores. This implies a better St-Document (St-Document.2007) can lead to a better precision on WSD applications. Theoretically, words in ST-documents should be only words best represent the associated ST. WSD precision from STI/STRI can be improved if we can enhance st-Documents by find the set of best words list for each ST.