Text Categorization

Background

I. Two major factors of STI/STRI results:

Based on the study of "Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: preliminary experiment" by Humphrey, SM, Rogers WJ, Kilicoglu H, Demner-Fushman D, Rindflesch TC., a WSD tool can be developed by utilizing Semantic Type Indexing (STI) or Semantic Type Real-Time Indexing (STRI). A candidate semantic type with high rank/score from the STI/STRI results implies the better sense for the ambiguous word. The results of both STI and STRI are calculated as followed:

  • Generate Semantic Type Documents (St-documents)
    St-Documents consists of one-word Metathesaurus string belong to a semantic type. Theoretically, these words (one-word strings) should be words best represent the associated semantic type. The format of St-Documents is:
    semantic typelist of words belong to (represent) this ST

  • Get JDI result on the St-documents
    Use the word list of each ST as the input of JDI and save the results to St-JDs table. The format of the St-Jds table is:
    semantic typeJD scores

  • Use similarity of Word-Jds table and St-Jds table and save results to Word-Sts table. The format of the Word-Sts table is:
    wordSt scores

The results of STI and STRI are based on Word-Sts table and St-Jds table, respectively. Accordingly, these results are based on two root factors:

  • JDI
  • St-Documents

II. St-Documents is the key factor of STI/STRI results:

Since year 2007, Lexical Systems Group releases two public TC packages, tc.2007 and tc.2008. TC package includes the Java versions of JDI, STI, and STRI. As discussed above, the results of STI and STRI are mainly based on JDI and St-Document. The study of "A method for verifying a vector-based text classification system" by Dr. Lu CJ, Humphrey MS, Browne AC, points out that JDI is a well defined methodology and is considered as an stable system with and steady results. In other words, the results of JDI.2007 and JDI.2008 are very similar. Thus, the major key factor of STI and STRI is the St-Documents. To confirm with above assumption, we run combination of JDI and St-Documents for both TC.2007 and TC.2008 on all 100 instances (both training set and test set) of the NLM's WSD collection for three WSD algorithms with two scores (Document count and Words counts). The setup and results of test suite on this test are described in the follow sections:

III. Setup of WSD testing suite:

  • Test data set
    We would like to test WSD tool on the largest data set. Thus, we tested WSD on both Training set (67 instances) and Test set (33 instances) of NLM's WSD Test collection (100 instances) to evaluate the overall performance (precision, variation, etc.)
  • Ambiguous Words list
    There are 50 ambiguous words in NLM WSD test collection. Five of the ambiguous words are eliminated for the test: association, cold, man, sex, weight because of multiple concepts mapping for the same ST and not valid gold standard answer in the test set.
  • Score types:
    There are two types of score from the results of STI and STRI. That is
    • WC: words count score
    • DC: documents count score

    Please refer to "Journal Descriptor Indexing tool for categorizing text according to discipline or semantic type" by Humphrey SM, Dr. Lu CJ, Rogers WJ, Browne AC. for details.
  • STI or STRI
    There are two Semantic Types Indexing tools in TC package: STI and STRI.
    • STRI (Semantic Types Real-Time Indexing):
      STRI uses JDI to index all input words first, and then get the cosine coefficient on the resulting JDI Vector and St-JD vector (from StJd Db-table) on the real-time base.
    • STI (Semantic Types Indexing):
      STI pre-calculate and store the cosine coefficient of all words and stJd Db-table in the data base. And then calculate the Avg. ST score for all input legal words.

    Accordingly,

    • if the input is one word (word_1), the result will be the same:
      • STRI: cos coef of (word_1-JDs, St-Jd)
      • STI: cos coef of (word_1-JDs, St-Jd)
    • if the input are multiple words (let's say: word_1 and word_2), the result will be the different (but similar):
      • STRI: cos coef of ((word_1-JDs + word_2-JDs)/2, St-Jd)
      • STI: (cos coef of (word_1-JDs, St-Jd) + cos coef of (word_2-JDs, St-Jd))/2
    In this test, we only used STI.
  • Input contexts:
    Two types of input contexts are used:
    • Target sentence: just the sentence in which ambiguous word appears
    • Entire citation: includes title and abstract of the article

IV. Results of WSD test on 2007 vs. 2008:

  • Target Sentence - WC
    JDI\St-DocumentsSt-Documents.2007St-Documents.2008
    JDI.200775.00%74.58%
    JDI.200875.20%74.93%

  • Target Sentence - DC
    JDI\St-DocumentsSt-Documents.2007St-Documents.2008
    JDI.200774.61%74.09%
    JDI.200874.96%73.81%

  • Entire Citation - WC
    JDI\St-DocumentsSt-Documents.2007St-Documents.2008
    JDI.200774.32%73.87%
    JDI.200875.21%74.44%

  • Entire Citation - WC
    JDI\St-DocumentsSt-Documents.2007St-Documents.2008
    JDI.200774.05%73.77%
    JDI.200874.33%73.52%

As shown in all above four tables, the precision of the first column (St-Document.2007) are better than precision of the second column (St-Document.2008) for both target-sentence and Context test on both WC and DC scores. This implies a better St-Document (St-Document.2007) can lead to a better precision on WSD applications. Theoretically, words in ST-documents should be only words best represent the associated ST. WSD precision from STI/STRI can be improved if we can enhance st-Documents by find the set of best words list for each ST.