Text Categorization

PreProcess: Mh-Sh-Jdid-Dc (from MEDLINE)

Description:
JDI-MeSH is based on the training set from MEDLINE citations. The first step of establish this training set is to get Jdids, JDs, starred MeSH (Mh & Sh) information from MEDLINE. Both MeSH main heading and MeSH subheading have two files. One for total document count and one for Jdid document count. These files are used to calculate the final scores of MH-Jdid-Dc and SH-Jdid-Dc tables.
Input:
MEDLINE training set for tc2007
- MEDLINE 2004: /nfsvol/indaux/MEDLINE_baseline/2004/medline04n${NUM}.txt
- Date created (DA) from year: 1999, 2000, 2001
- ${NUM} are file names of file include citation with DA in years of 1999, 2000, 2001
- jds.txt
- jidTaJds.txt
- shs.txt
Java File & Algorithm:
- GenerateFilesFromMedLine.java:
  - Read in all fields ( PMID, TI, AB, TA, JID, RN, MH) from MedLine citations if DA is within specified range
  - Read in JDs information through JID for each citation
  - Check if DA (created date) is in specified years
  - Check if this citation has JDs
- Update MH document count and MH-JD document count
- Update SH document count and SH-JD document count
- Print out total document count for MH, MH-JDID, SH, SH-JDID, respectively:
  - Sent MH, DC to mhDc.txt
  - Sent MH-JDID, DC to mhJdidDc.txt
  - Sent SH, DC to shDc.txt
  - Sent SH-JDID, DC to shJdidDc.txt

Output File:

Notes:

Make sure all JDs are defined in both files: jds.txt and jidTaJds.txt Otherwise, this program will generate error message when it reach a JD from jidTaJds.txt but not in JDs list.
These files are generated along with all other files from MEDLINE.