Text Categorization

PreProcess - JDI, Phase I

I. Word-Jdid-Wc-Dc
The JDI training data set is obtained from MEDLINE records. Below are the detail procedures to obtain the training data set:

  1. Retrieve destined years of MedLine records:
    In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.
    • Source: /net/indfiler/vol/vol3/aux/MEDLINE_baseline/2004/medline04nxxxx*.txt
    • Program: retrieve fields of PMID, TI, AB, TA, JID, RN, MH if the year (DA field) is within destined range
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l (*.table)

    • New target: /export/home/lu/Development/TC/tc2006/data/MedLine/2004/TrainSet/Results/PMIDJD

  2. Add Journal Descriptors to retrieved records:
    This step is to add JDs into retrieved records by combining above two steps together.
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l
    • Source: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/PMIDWJD/*.l

  3. Retrieve words by filtering records into 3 files based on fields: TI, AB, JD
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/PMIDWJD/*.l
    • Program, based on words tokenizer rules and algorithm:
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-ti-w.xxxx.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-ab-w.xxxx.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-jid-jd.xxxx.l

  4. Calculate Word vs. Jd, word count and document count (w-jd-wc-dc)
    • Program: calculate word count and document count of Journal Descriptors for all words:
      • Combine ui-ti-w.xxxx.l and ui-ab-w.xxxx.l to ui-ti-ab-w.xxxx.l
      • Load all words from total count file
      • Load PMID (ui) and JDs data from file
      • Get JDs based on PMID and PMID
      • Calculate total word count of each JDs for all words if word in all words list
      • Calculate total document count of each JDs for all words if word in all words list
      • Print out w-jd-wc-dc in the order of alphabetical order of word, document count, JD ID number
    • Target: w-jd.tot.abbr.gt1.l
      WordJd IdWord CountDocument Count

  5. Calculate words and normalized total word count (w-signal.lw.gt1.l)
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ui-wtot.xxxx.gt1.l
    • Source: total word count w-wc-dc.tot.gt1.l
    • Program: calculate word normalized count
      • calculate local word count for each document
      • get global word count from total word count file
      • get noise = (local word count/global word count) * log2(global word count/local word count)
      • get signal = log2(global word count) - noise
      • get Signal weight = local word count * signal
      • get normalized signal = log2(1 + signal weight)
      • word normalized signal = summation of (normalized signal) for the word
    • Target: w-signal-1w.gt1.l
      WordWord Normalized Count

  6. Calculate JD vs. document count (jd-dc.gt1.l)
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-jid-jd.xxxx.l
    • Source: JournalDescriptors
    • Program: calculate document count for all Journal Descriptors
      • Go through all ui-jid-jd.xxxx.l files and calculate document count for all Journal descriptors
      • Print out jd-dc in the order of JD-ID
    • Target: "jd-dc.gt1.l": contains JD names and document count. This file is used to normalize the score.
      Jd NameDocument Count

II. Mh-Jdid-Dc & Sh-Jdid-Dc
The training data set are calculate Mesh Headings (MH) and Sub headings (SH) from MedLine records. Below are the detail procedures to obtain the training data set:

  1. Retrieve destined years of MedLine records:
    In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.
    • Source: /net/indfiler/vol/vol3/aux/MEDLINE_baseline/2004/medline04nxxxx*.txt
    • Program: retrieve fields of PMID, TI, AB, TA, JID, RN, MH if the year (DA field) is within destined range
    • Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l (*.table)
    • Same as in JD words and documents counts

  2. Get all Journal Descriptors for all Journals:
    • This information are in file of: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
    • The above file can be derived from (TBD)
      • /aux/humphrey/JD/serfile/ser/2005/lsi2005.xml
      • /aux/humphrey/JD/serfile/ser/2005/rule
    • Same as in JD words and documents counts

  3. Add Journal Descriptors to retrieved records:
    This step is to add JDs into retrieved records by combining above two steps together.
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l
    • Source: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
    • Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWORDS/ui-jid-jd.*.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/MH/*.l

  4. Retrieve and calculate count of Mesh headings and subheadings (with star) by filtering records into 2 files based on the field: MH
    • Source: suesun:/export/home/humphrey/JD/data/2004/MH/*.l
    • Source: suesun:/export/home/humphrey/JD/sh-abbrev.l

    • Program & algorithm:
      • tokenize MH with * in MH field (include both *MH and *SH)
      • update count in all JDs (from JID) for each MH

      • HeadingTotal doc count of this MHtotal doc count of all JDs

      • tokenize SH only with * in the *SH
      • translate SH into sub-heading abbreviation
      • update count in all JDs (from JID) for each SH

      • SubHeadingTotal doc count of this SHtotal doc count of all JDs

    • Target: suesun:/export/home/humphrey/JD/data/2004/MH/mhstar-jd.l
    • Target: suesun:/export/home/humphrey/JD/data/2004/MH/shstar-jd.l