Text Categorization

stWsd Design and Algorithm

I. Design

Inputs
The inputs of tools should be simple and straight forward. Only needed information are required to input. Others should be optional with a default. This stWsd tool only requires three inputs:
- Ambiguous word
- Possible choices (ST candidates)
- The sentence/paragraph with the ambiguous word.
  It could be a sentence (the target sentence) or paragraph (entire citation). If it is a paragraph, stWsd provides option to converts the input text to ambiguous sentences.
Outputs
The only output is the best selected sense (ST).
Options
- backend program: STI, STRI
- score type: WC, DC, ES
- sentence type: convert to ambiguous sentences or use the original input
- show details information

II. Algorithm

Tokenization
STI and STRI are word based program which calculate St score on word level. Accordingly, the first step of StWsd is to tokenize the input text into words. The same input filter of STI/STRI are used. After tokenizing the input into words, ambiguous variants and forced legal words are identified.
Forced legal words
The StWsd use STI/STRI as the backend program to find the best St-Candidate. While applying STI/STRI, only legal words are used to calculate the ST scores. Stop words, restricted words, and signal frequency are three major constraints for the legal words. We should avoid no results are found due to the input text does not have any legal words. The ambiguous word and its morphological variants are considered as legal words for STI/STRI to find the best St-Candidate for this case. This list of words are called force legal words and the algorithm is described as follows:
- Tokenize ambiguous words and ambiguous variants into words
- Unify words
- Sort words
Morphological variants of ambiguous word
As discussed above, the morphological variants of the ambiguous word are used for forced legal words. The morphological variants of ambiguous word are called ambiguous variants, which is unique and includes the ambiguous words itself. The fruitful variants flow component from Lexical Tools is used for this purpose. Below are the summary of the algorithm:
- Get the fruitful variants of the ambiguous word (include ambiguous word)
- Lowercase the found variants
- Remove variants with punctuation
- Unify variants
- Sort variants
Ambiguous Sentences
If the input text is a paragraph, such as title and abstract, some sentences have nothing to do with the meaning of ambiguous word and should be removed. A good way of filter out meaningless sentences is to remove sentences does not have ambiguous variants. In other words, only sentences contains ambiguous variants are used as the input for StWsd. This filtered sentenced is called ambiguous sentences. The algorithm is summarized as follows:
- Tokenize the input into sentences
- Remove sentences do not have ambiguous variants
Optimum score
There are two programs, STI and STRI, are used to calculate the St score/rank in StWsd. Both programs provide WC (word count) and DC (document count) scores. In total, we will have 4 score systems just for the same input:
- STI-DC
- STI-WC
- STRI-DC
- STRI-WC
Sometimes, the score from above four systems does not agree with each other. Which should we use to get the best result? StWsd provides a new score system, ES (expert Score System), which combines above four systems and results in one. In our tests, we found out:
- STI always has better precision than STRI
- The results from STI-DC and STI-WC are very similar but not the same. In general, STI-WC is slightly better. However, there are cases, STR-DC is better than WC.
Base on the above observations, we use the following algorithm to generate a new score system, ES:
- When WC and DC have the same best St-Candidate, use it.
- When WC and DC have different best St-Candidate, use the one with highest relative score. The relative score is calculated as the difference between St-candidates.
For examples,
- Ambiguous word:
  - lead
- St-Candidates:
  - elii: T196|Element, Ion, or Isotope
  - lbpr: T059|Laboratory Procedure
- Input Text:
  - "These data indicate that mouthing behaviors are an important mechanism of exposure among urban children with low-level elevations in blood lead and that lead-based paint is a more important contributor of lead to house dust than is lead-contaminated soil."
- Detail score for St-Candidate:
```
--- ST scores (x 1) and rank based on word count ---
7|0.5590|elii|T196|Element, Ion, or Isotope
16|0.4703|lbpr|T059|Laboratory Procedure
--- ST scores (x 1) and rank based on document count ---
11|0.5257|lbpr|T059|Laboratory Procedure
13|0.5145|elii|T196|Element, Ion, or Isotope
```
  - WC relative score of elii: 0.0887 (= 0.5590 - 0.4703)
  - DC relative score of lbpr: 0.0112 (= 0.5257 - 0.5145)
  Thus, ES chooses [elii] with higher relative score (0.0887)
- Review correct choice:
  - [elii], same as ES

In our test, the STI-ES system reached the best performance (79.08%). We will discuss the detail in the next section.