Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

PreProcess: Files from MEDLINE baseline

  • Description:
    JDI is based on the training set from MEDLINE citations. The first step of establish this training set is to get titles, abstracts, JIDs, JDs, starred MeSH information from MEDLINE.

  • Input:
    MEDLINE training set for tc2007
    • MEDLINE 2004: /nfsvol/indaux/MEDLINE_baseline/2004/medline04n${NUM}.txt
    • Date created (DA) from year: 1999, 2000, 2001
    • ${NUM} are file names of file include citation with DA in years of 1999, 2000, 2001

    • jds.txt
    • jidTaJds.txt
    • contractions.txt
    • shs.txt

  • Java File & Algorithm:
    • GenerateFilesFromMedLine.java:
      • Read in all fields ( PMID, TI, AB, TA, JID, RN, MH) from MedLine citations if DA is within specified range
      • Read in JDs information through JID for each citation
      • Check if DA (created date) is in specified years
      • Check if this citation has JDs
        • Sent PMID, TI, AB, TA, JID, RNs, MHs, JDs to pmidJd${NUM}.txt
        • Sent filtered tokenized words ( rules and algorithm) from title to uiTiWords.${NUM}.txt
        • Sent filtered tokenized words ( rules and algorithm) from abstract to uiAbWords.${NUM}.txt
        • Sent PMID, JID, JDs to uiJidJds${NUM}.txt
      • Update MH document count and MH-JD document count
      • Update SH document count and SH-JD document count
      • Print out total document count for MH, MH-JDID, SH, SH-JDID, respectively:
        • Sent MH, MH_DC, JDs, JD_DC to mhStarJd.txt
        • Sent SH, SH_DC, JDs, JD_DC to shStarJd.txt

        • Sent MH, DC to mhDc.txt
        • Sent MH-JDID, DC to mhJdidDc.txt
        • Sent SH, DC to shDc.txt
        • Sent SH-JDID, DC to shJdidDc.txt

  • Output File:
  • Notes:
    • Make sure all JDs are defined in both files: jds.txt and jidTaJds.txt Otherwise, this program will generate error message when it reach a JD from jidTaJds.txt but not in JDs list.
    • The formats of some files are used to compare to Susanne's files. They are not used in generating a new training set.