Text Categorization

stri



Introduction

Stri tool uses the Jdi methodology as its basis. It uses ST (semantic type) documents; an ST document is a set of one-word UMLS Metathesaurus strings belonging to an ST. Stri takes the inputs, which may be text phrases or MeSH terms. Filters are applied to text input, such as word extraction algorithms, stopwords, minimum word length, etc. Then, Stri ranks the STs for an input according to similarity of JDI of the input (result of running Jdi tool on the input in real time) compared to pre-calculated JDI of each ST document, and sends the ranked STs with their scores to the output.

Set Up

Follow the installation instructions to install text categorization tools and run the sti program. Check on the following items only if you don't use the provided script to install Text Categorization tools.

  • CLASSPATH:
    1. include the Text Categorization tools distribution jar file, ${TC_DIR}/lib/tc2011dist.jar, in your CLASSPATH.
    2. include the TC top directory in your CLASSPATH.

  • Configuration File: assign the full path of the top directory of tc2011 to a variable named ROOT_DIR in the configuration file, data/Config/tc.properties.

Test Run

Input

Sti take text as input:

Output

Stri calculates the average ST scores of the input text for both word counts and document counts and sent the top rank ST to output. If detail flag, -d, is used, the results include rank, ST scores in following format:

RankST ScoresST abbreviationST name

Stri Options

Please refer to design document