Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

PreProcess - JDI, Phase I

I. Word-Jdid-Wc-Dc
The JDI training data set is obtained from MEDLINE records. Below are the detail procedures to obtain the training data set:

  1. Retrieve destined years of MedLine records:
    In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.
    • Source: /net/indfiler/vol/vol3/aux/MEDLINE_baseline/2004/medline04nxxxx*.txt
    • Program: retrieve fields of PMID, TI, AB, TA, JID, RN, MH if the year (DA field) is within destined range
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l (*.table)

    • New target: /export/home/lu/Development/TC/tc2006/data/MedLine/2004/TrainSet/Results/PMIDJD

  2. Add Journal Descriptors to retrieved records:
    This step is to add JDs into retrieved records by combining above two steps together.
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l
    • Source: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/PMIDWJD/*.l

  3. Retrieve words by filtering records into 3 files based on fields: TI, AB, JD
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/PMIDWJD/*.l
    • Program, based on words tokenizer rules and algorithm:
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-ti-w.xxxx.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-ab-w.xxxx.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-jid-jd.xxxx.l

  4. Calculate Word vs. Jd, word count and document count (w-jd-wc-dc)
    • Program: calculate word count and document count of Journal Descriptors for all words:
      • Combine ui-ti-w.xxxx.l and ui-ab-w.xxxx.l to ui-ti-ab-w.xxxx.l
      • Load all words from total count file
      • Load PMID (ui) and JDs data from file
      • Get JDs based on PMID and PMID
      • Calculate total word count of each JDs for all words if word in all words list
      • Calculate total document count of each JDs for all words if word in all words list
      • Print out w-jd-wc-dc in the order of alphabetical order of word, document count, JD ID number
    • Target: w-jd.tot.abbr.gt1.l
      WordJd IdWord CountDocument Count

  5. Calculate words and normalized total word count (w-signal.lw.gt1.l)
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ui-wtot.xxxx.gt1.l
    • Source: total word count w-wc-dc.tot.gt1.l
    • Program: calculate word normalized count
      • calculate local word count for each document
      • get global word count from total word count file
      • get noise = (local word count/global word count) * log2(global word count/local word count)
      • get signal = log2(global word count) - noise
      • get Signal weight = local word count * signal
      • get normalized signal = log2(1 + signal weight)
      • word normalized signal = summation of (normalized signal) for the word
    • Target: w-signal-1w.gt1.l
      WordWord Normalized Count

  6. Calculate JD vs. document count (jd-dc.gt1.l)
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-jid-jd.xxxx.l
    • Source: JournalDescriptors
    • Program: calculate document count for all Journal Descriptors
      • Go through all ui-jid-jd.xxxx.l files and calculate document count for all Journal descriptors
      • Print out jd-dc in the order of JD-ID
    • Target: "jd-dc.gt1.l": contains JD names and document count. This file is used to normalize the score.
      Jd NameDocument Count

II. Mh-Jdid-Dc & Sh-Jdid-Dc
The training data set are calculate Mesh Headings (MH) and Sub headings (SH) from MedLine records. Below are the detail procedures to obtain the training data set:

  1. Retrieve destined years of MedLine records:
    In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.
    • Source: /net/indfiler/vol/vol3/aux/MEDLINE_baseline/2004/medline04nxxxx*.txt
    • Program: retrieve fields of PMID, TI, AB, TA, JID, RN, MH if the year (DA field) is within destined range
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l (*.table)
    • Same as in JD words and documents counts

  2. Get all Journal Descriptors for all Journals:
    • This information are in file of: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
    • The above file can be derived from (TBD)
      • /aux/humphrey/JD/serfile/ser/2005/lsi2005.xml
      • /aux/humphrey/JD/serfile/ser/2005/rule
    • Same as in JD words and documents counts

  3. Add Journal Descriptors to retrieved records:
    This step is to add JDs into retrieved records by combining above two steps together.
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l
    • Source: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
      or
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWORDS/ui-jid-jd.*.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/MH/*.l

  4. Retrieve and calculate count of Mesh headings and subheadings (with star) by filtering records into 2 files based on the field: MH
    • Source: suesun:/export/home/humphrey/JD/data/2004/MH/*.l
    • Source: suesun:/export/home/humphrey/JD/sh-abbrev.l

    • Program & algorithm:
      • tokenize MH with * in MH field (include both *MH and *SH)
      • update count in all JDs (from JID) for each MH

      • HeadingTotal doc count of this MHtotal doc count of all JDs

      • tokenize SH only with * in the *SH
      • translate SH into sub-heading abbreviation
      • update count in all JDs (from JID) for each SH

      • SubHeadingTotal doc count of this SHtotal doc count of all JDs

    • Target: suesun:/export/home/humphrey/JD/data/2004/MH/mhstar-jd.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/MH/shstar-jd.l