Text Categorization

PreProcess: Files from MEDLINE baseline

  • Description:
    JDI is based on the training set from MEDLINE citations. The first step of establish this training set is to get titles, abstracts, JIDs, JDs, starred MeSH information from MEDLINE.

  • Input:
    MEDLINE training set for tc2007
    • MEDLINE 2004: /nfsvol/indaux/MEDLINE_baseline/2004/medline04n${NUM}.txt
    • Date created (DA) from year: 1999, 2000, 2001
    • ${NUM} are file names of file include citation with DA in years of 1999, 2000, 2001

    • jds.txt
    • jidTaJds.txt
    • contractions.txt
    • shs.txt

  • Java File & Algorithm:
    • GenerateFilesFromMedLine.java:
      • Read in all fields ( PMID, TI, AB, TA, JID, RN, MH) from MedLine citations if DA is within specified range
      • Read in JDs information through JID for each citation
      • Check if DA (created date) is in specified years
      • Check if this citation has JDs
        • Sent PMID, TI, AB, TA, JID, RNs, MHs, JDs to pmidJd${NUM}.txt
        • Sent filtered tokenized words ( rules and algorithm) from title to uiTiWords.${NUM}.txt
        • Sent filtered tokenized words ( rules and algorithm) from abstract to uiAbWords.${NUM}.txt
        • Sent PMID, JID, JDs to uiJidJds${NUM}.txt
      • Update MH document count and MH-JD document count
      • Update SH document count and SH-JD document count
      • Print out total document count for MH, MH-JDID, SH, SH-JDID, respectively:
        • Sent MH, MH_DC, JDs, JD_DC to mhStarJd.txt
        • Sent SH, SH_DC, JDs, JD_DC to shStarJd.txt

        • Sent MH, DC to mhDc.txt
        • Sent MH-JDID, DC to mhJdidDc.txt
        • Sent SH, DC to shDc.txt
        • Sent SH-JDID, DC to shJdidDc.txt

  • Output File:
  • Notes:
    • Make sure all JDs are defined in both files: jds.txt and jidTaJds.txt Otherwise, this program will generate error message when it reach a JD from jidTaJds.txt but not in JDs list.
    • The formats of some files are used to compare to Susanne's files. They are not used in generating a new training set.