Text Categorization

sti

Sti tool uses the Jdi methodology to calculate word-St scores. The calculation of word-St scores uses words in the MEDLINE training set and ST (semantic type) documents; a stDocument is a set of one-word UMLS Metathesaurus strings belonging to an ST. The word-St scores for a word have been calculated by comparing the JDI of the word and the JDI of each ST document. The pre-calculated word-St scores are loaded into a database. Sti takes the inputs, which are text phrases, and applies filters such as word extraction algorithms, stopwords, minimum word length, etc. Then, Sti calculates the average word-St scores for all inputs, and sends the ranked STs with their scores to the output.

Set Up

Follow the installation instructions to install text categorization tools and run the sti program. Check on the following items only if you don't use the provided script to install Text Categorization tools.

CLASSPATH:
1. include the Text Categorization tools distribution jar file, ${TC_DIR}/lib/tc2011dist.jar, in your CLASSPATH.
2. include the TC top directory in your CLASSPATH.
Configuration File: assign the full path of the top directory of tc2011 to a variable named ROOT_DIR in the configuration file, data/Config/tc.properties.

Test Run

Run java program

Enter the command:


> sti -p
- Please input a term (type "Ctl-d" to quit) >
heart valve
--> Input: [heart valve]
--- ST scores (x 1) and rank based on word count ---
clna|T201|Clinical Attribute
1|0.6059|clna|T201|Clinical Attribute
2|0.5329|spco|T082|Spatial Concept
3|0.5262|patf|T046|Pathologic Function
4|0.4441|drdd|T203|Drug Delivery Device
5|0.4395|medd|T074|Medical Device
6|0.4024|fndg|T033|Finding
7|0.3594|ftcn|T169|Functional Concept
8|0.3028|clas|T185|Classification
9|0.2972|sosy|T184|Sign or Symptom
10|0.2882|aapp|T116|Amino Acid, Peptide, or Protein
--- ST scores (x 1) and rank based on document count ---
clna|T201|Clinical Attribute
1|0.7184|clna|T201|Clinical Attribute
2|0.5808|spco|T082|Spatial Concept
3|0.5700|patf|T046|Pathologic Function
4|0.4924|medd|T074|Medical Device
5|0.4792|drdd|T203|Drug Delivery Device
6|0.4592|fndg|T033|Finding
7|0.3950|ftcn|T169|Functional Concept
8|0.3607|sosy|T184|Sign or Symptom
9|0.3432|clas|T185|Classification
10|0.3399|diap|T060|Diagnostic Procedure
--- Overall ST rank ---
clna|T201|Clinical Attribute|dc

where:

sti: Sti script to run Sti Java class
-p: set Sti system option to show prompt (try -h option!)

Input

Sti take text as input:

Output

Sti calculates the average ST scores of the input text for both word counts and document counts and sent the top rank ST to output. If detail flag, -d, is used, the results include rank, ST scores in following format:

Rank	ST Scores	ST abbreviation	ST name

Sti Options

Please refer to design document