Text Categorization

sti



Introduction

Sti tool uses the Jdi methodology to calculate word-St scores. The calculation of word-St scores uses words in the MEDLINE training set and ST (semantic type) documents; a stDocument is a set of one-word UMLS Metathesaurus strings belonging to an ST. The word-St scores for a word have been calculated by comparing the JDI of the word and the JDI of each ST document. The pre-calculated word-St scores are loaded into a database. Sti takes the inputs, which are text phrases, and applies filters such as word extraction algorithms, stopwords, minimum word length, etc. Then, Sti calculates the average word-St scores for all inputs, and sends the ranked STs with their scores to the output.

Set Up

Follow the installation instructions to install text categorization tools and run the sti program. Check on the following items only if you don't use the provided script to install Text Categorization tools.

  • CLASSPATH:
    1. include the Text Categorization tools distribution jar file, ${TC_DIR}/lib/tc2011dist.jar, in your CLASSPATH.
    2. include the TC top directory in your CLASSPATH.

  • Configuration File: assign the full path of the top directory of tc2011 to a variable named ROOT_DIR in the configuration file, data/Config/tc.properties.

Test Run

Input

Sti take text as input:

Output

Sti calculates the average ST scores of the input text for both word counts and document counts and sent the top rank ST to output. If detail flag, -d, is used, the results include rank, ST scores in following format:

RankST ScoresST abbreviationST name

Sti Options

Please refer to design document