Text Categorization

jdi

Jdi tool uses statistical associations between words and JDs, between MHs and JDs, and between SHs and JDs, from a training set of MEDLINE citations. The word-JD scores, Mh-Jd scores, and Sh-Jd scores are pre-calculated and loaded into a database. Jdi takes the inputs, which may be text phrases, MeSH terms, or a combination. Filters are applied to text input, such as word extraction algorithms, stopwords, minimum word length, etc. Then, JDI calculates the average score for all inputs, and sends the ranked JDs with their scores to the output.

Jdi is the core methodology of TC tools. It is used in Sti and Stri program. It is used to categorize text, index contents, retrieve records, and Word Sense Disambiguation.

Set Up

Follow the installation instructions to install text categorization tools and run the jdi program. Check on the following items only if you don't use the provided script to install Text Categorization tools.

CLASSPATH:
1. include the Text Categorization tools distribution jar file, ${TC_DIR}/lib/tc2011dist.jar, in your CLASSPATH.
2. include the tc top directory in your CLASSPATH.
Configuration File: assign the full path of the top directory of tc2011 to a variable named ROOT_DIR in the configuration file, data/Config/tc.properties.

Test Run

Run java program

Enter the command:


> jdi -p
- Please input a term (type "Ctl-d" to quit) >
heart valve
--> Input: [heart valve]
--- JD scores (x 1) and rank based on word count ---
JD018|Cardiology
1|0.0858526|JD018|Cardiology
2|0.0624434|JD148|Pulmonary Medicine
3|0.0495025|JD124|Vascular Diseases
4|0.0251979|JD144|General Surgery
5|0.0209033|JD030|Diagnostic Imaging
6|0.0108041|JD120|Transplantation
7|0.0090153|JD005|Anesthesiology
8|0.0086425|JD014|Biomedical Engineering
9|0.0067363|JD100|Radiology
10|0.0064961|JD118|Therapeutics
--- JD scores (x 1) and rank based on document count ---
JD018|Cardiology
1|0.1564322|JD018|Cardiology
2|0.0979494|JD148|Pulmonary Medicine
3|0.0891969|JD124|Vascular Diseases
4|0.0438102|JD030|Diagnostic Imaging
5|0.0400007|JD144|General Surgery
6|0.0236169|JD005|Anesthesiology
7|0.0187880|JD120|Transplantation
8|0.0158293|JD014|Biomedical Engineering
9|0.0151241|JD092|Physiology
10|0.0133293|JD118|Therapeutics
--- Overall JD rank ---
JD018|Cardiology|dc

where:

jdi: jdi script to run jdi Java class
-p: set jdi system option to show prompt (try -h option!)

Input

jdi takes two types of input:

Text

Phrase
Title
Abstract
Title and abstract

MeSH: Mesh main heading or subheadings, separated by "|".

Output

jdi calculates the average JD scores of the input text for both word counts and document counts, then display the top 10 JD with scores for both count. The top ranked JD by document count are shown at the end as overall JD rank.

Rank	JD Scores	JD Id	JD name

jdi Options

Please refer to design document