Text Categorization

JDI: Text and MeSH

Description:
Read in text and MeSH and then perform JD indexing based on
- word frequency count
- document count for word
text, MH, and SH are separated by '|'.
Inputs:
- a text and MeSH: text, MH, and SH are separated by '|'
- a file, such as 9801.2004.TIABMH.in
Algorithm:
- Pre-Process (Input Filter):
  - Separate phrase and MeSH
  - Tokenize all words of the input phrase
  - Apply Word Extraction Filter (if it is MEDLINE TI or AB)
  - Apply acronym filter (TBD)
  - Filter out not legal words
  - Filter out duplicated words if unique flag is true
  - Assign the final words for processing
  - Tokenize SH and MH from the input Meshs
  - Filter out illegal Meshs (not in Mh-Jd Table or Sh-Jd Table
  - Assign legal Meshs
- Process:
  - Get JD scores for each (legal) word in the text from DB: WORD_JD_SCORES table
  - Get JD scores for each (legal) Mesh from DB: MH_JD_SCORES table and SH_JD_SCORES table
  - Calculate Avg. JD scores for the phrase and legal Meshs for both words frequency count and document count.
- Post-process (Output Filter):
  - Print out Input term (text and MeSH)
  - Output filter details
  - Score entries display number
  - No output message
  - Cluster option
  - JD candidates
  - Use alphabetical order for JDs have same score (Ex: "taylor", "assault")

Sample commands:

> jdi -itmh -d -p
=> index a text and MeSHs input from standard input with prompt and detail scores

> jdi -itmh -d -f:ml -i:9801.2004.TIABMH.in -o:9801.2004.TIABMH.out
=> index text and MeSHs from file, 9801.2004.TIABMH.in, use MedLine filter options with detail scores, and send the results to a file, 9801.2004.TIABMH.out

Sample Outputs:
- a file, such as 9801.2004.TIABMH.out