Text Categorization

St-Documents Enhancement

As discussed in the previous section, we can improve the WSD precision by find a better St-Documents. In this study, we used St-Document.2008 as the baseline and apply several enhancement algorithms to obtain better St-Documents and WSD precision. This section summarizes the algorithm of St-Documents enhancement. The research approach and results are detailed in approach and results section.

Weighted Frequency
In the original St-Documents generating algorithm, words are unique in the St-Documents even the word is associated with the same ST multiple times (from multiple sources). From the statistics point of view, the frequency of occurrence should be taken into consideration. A good St-Documents should have more weight on those words associate to the ST multiple times than those words just associate to the same ST once. For example, "aspirin" is assigned to ST Pharmacologic Substance (phsu) 19 times since "aspirin" are associated with ST of phsu from different sources (LCH, MTH, RCD, SNM, NCI, MTHSPL, SNOMEDCT, PSY, USPMG, MSH, etc.) while "dayquil" is only assigned to ST of phsu only once. Accordingly, "aspirin" should have more weight for ST of phsu than "dayquil".

Prioritize with ST Groups
As discussed before, words in ST-documents should be only words best represent the associated ST. A word can be associated with multiple concepts (CUIs). and thus associated with multiple Semantic types (STs). The associated Semantic types can belong to different Semantic types groups or only one Semantic Type group. Words are only associated with one CUIs (concept) is not ambiguous word. These words only associated with only one Semantic Types and Semantic Type Group and should have higher priority than ambiguous words associated with Semantic Types belong to different Semantic Type Groups to be in the St-Documents. These criteria should be added to obtain a better St-Documents.

For examples, word of "cold" has 4 mapping CUIs associated with 4 STs and St-Groups.

CUI	Instances	ST	St Group
C0009264	6	npop (T070: Natural Phenomenon or Process)	PHEN\|Phenomena
C0009443	3	dsyn (T047: Experimental Model of Disease)	DISO\|Disorders
C0010412	1	topp (T061: Therapeutic or Preventive Procedure)	PROC\|Procedures
C0234192	1	phsf (T039: Organ or Tissue Function)	PHYS\|Physiology

For examples, word of "fine" has 2 mapping CUIs associated with 2 STs. Both Sts fall into same ST Group, CONC. These types of words should have higher priority than those associated with multiple St-Groups to be in St-Documents.

CUI	Instances	ST	St Group
C0205232	3	qlco (T080: Qualitative Concept)	CONC\|Concepts & Ideas
C0687757	1	rnlw (T089: Regulation or Law)	CONC\|Concepts & Ideas

STRI-Filter
After St-Documents is generated, we run JDI on the St-Documents to generate St-Jds table to perform the program of STRI (Semantic Type Index, Real-time) tool. A STRI filter can be performed by running the STRI on the each word in the St-Documents and examining the associated ST score. A good word to represent a ST in a St-Documents should have high STRI score (rank) for the associated ST. This re-cursive filter algorithm is used to refine the St-Document.
In this study, there are three criteria we used in the STRI-recursive filter algorithm:
- Standard Deviation:
  The associated ST score of the target word should be within the distance of one standard deviation from the very top score.
- Top rank:
  The associate ST score of the target word should be within the designated range of top ranks.
- Combination:
  Combination of above two criteria. For examples, the STRI score should be within the distance of one standard deviation from the very top score and on the top 10 rank.
Combination of prioritizing ST groups and STRI-Filter
We should apply different STRI filter criteria on word associate with only one or multiple ST-group (prioritizing). Here is the summary:
- One St-group words: loser STRI-filter criteria
- Multiple St-group words: tighter STRI-filter criteria

The approach and testing results are described in the next section.