PreProcess: Files from MEDLINE baseline
- Description:
JDI is based on the training set from MEDLINE citations.
The first step of establish this training set is to get titles, abstracts, JIDs, JDs, starred MeSH information from MEDLINE.
- Input:
MEDLINE training set for tc2007
- MEDLINE 2004: /nfsvol/indaux/MEDLINE_baseline/2004/medline04n${NUM}.txt
- Date created (DA) from year: 1999, 2000, 2001
- ${NUM} are file names of file include citation with DA in years of 1999, 2000, 2001
- jds.txt
- jidTaJds.txt
- contractions.txt
- shs.txt
- Java File & Algorithm:
- GenerateFilesFromMedLine.java:
- Read in all fields (
PMID, TI, AB, TA, JID, RN, MH)
from MedLine citations if
DA is within specified range
- Read in JDs information through JID for each citation
- Check if DA (created date) is in specified years
- Check if this citation has JDs
- Sent PMID, TI, AB, TA, JID, RNs, MHs, JDs to pmidJd${NUM}.txt
- Sent filtered tokenized words (
rules and
algorithm) from title to uiTiWords.${NUM}.txt
- Sent filtered tokenized words (
rules and
algorithm) from abstract to uiAbWords.${NUM}.txt
- Sent PMID, JID, JDs to uiJidJds${NUM}.txt
- Update MH document count and MH-JD document count
- Update SH document count and SH-JD document count
- Print out total document count for MH, MH-JDID, SH, SH-JDID, respectively:
- Sent MH, MH_DC, JDs, JD_DC to mhStarJd.txt
- Sent SH, SH_DC, JDs, JD_DC to shStarJd.txt
- Sent MH, DC to mhDc.txt
- Sent MH-JDID, DC to mhJdidDc.txt
- Sent SH, DC to shDc.txt
- Sent SH-JDID, DC to shJdidDc.txt
- Output File:
- Notes:
- Make sure all JDs are defined in both files: jds.txt and jidTaJds.txt
Otherwise, this program will generate error message when it reach a JD from jidTaJds.txt but not in JDs list.
- The formats of some files are used to compare to Susanne's files. They are not used in generating a new training set.