Text Categorization

PreProcess - JDI, Phase I

I. Word-Jdid-Wc-Dc
The JDI training data set is obtained from MEDLINE records. Below are the detail procedures to obtain the training data set:

Retrieve destined years of MedLine records:
In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.
- Source: /net/indfiler/vol/vol3/aux/MEDLINE_baseline/2004/medline04nxxxx*.txt
- Program: retrieve fields of PMID, TI, AB, TA, JID, RN, MH if the year (DA field) is within destined range
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l (*.table)
- New target: /export/home/lu/Development/TC/tc2006/data/MedLine/2004/TrainSet/Results/PMIDJD
Add Journal Descriptors to retrieved records:
This step is to add JDs into retrieved records by combining above two steps together.
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l
- Source: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/PMIDWJD/*.l
Retrieve words by filtering records into 3 files based on fields: TI, AB, JD
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/PMIDWJD/*.l
- Program, based on words tokenizer rules and algorithm:
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-ti-w.xxxx.l
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-ab-w.xxxx.l
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-jid-jd.xxxx.l
Calculate Word vs. Jd, word count and document count (w-jd-wc-dc)
- Program: calculate word count and document count of Journal Descriptors for all words:
  - Combine ui-ti-w.xxxx.l and ui-ab-w.xxxx.l to ui-ti-ab-w.xxxx.l
  - Load all words from total count file
  - Load PMID (ui) and JDs data from file
  - Get JDs based on PMID and PMID
  - Calculate total word count of each JDs for all words if word in all words list
  - Calculate total document count of each JDs for all words if word in all words list
  - Print out w-jd-wc-dc in the order of alphabetical order of word, document count, JD ID number
- Target: w-jd.tot.abbr.gt1.l
  
  Word Jd Id Word Count Document Count
Calculate words and normalized total word count (w-signal.lw.gt1.l)
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ui-wtot.xxxx.gt1.l
- Source: total word count w-wc-dc.tot.gt1.l
- Program: calculate word normalized count
  - calculate local word count for each document
  - get global word count from total word count file
  - get noise = (local word count/global word count) * log₂(global word count/local word count)
  - get signal = log₂(global word count) - noise
  - get Signal weight = local word count * signal
  - get normalized signal = log₂(1 + signal weight)
  - word normalized signal = summation of (normalized signal) for the word
- Target: w-signal-1w.gt1.l
  
  Word Word Normalized Count
Calculate JD vs. document count (jd-dc.gt1.l)
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-jid-jd.xxxx.l
- Source: JournalDescriptors
- Program: calculate document count for all Journal Descriptors
  - Go through all ui-jid-jd.xxxx.l files and calculate document count for all Journal descriptors
  - Print out jd-dc in the order of JD-ID
- Target: "jd-dc.gt1.l": contains JD names and document count. This file is used to normalize the score.
  
  Jd Name Document Count

II. Mh-Jdid-Dc & Sh-Jdid-Dc
The training data set are calculate Mesh Headings (MH) and Sub headings (SH) from MedLine records. Below are the detail procedures to obtain the training data set:

Retrieve destined years of MedLine records:
In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.
- Source: /net/indfiler/vol/vol3/aux/MEDLINE_baseline/2004/medline04nxxxx*.txt
- Program: retrieve fields of PMID, TI, AB, TA, JID, RN, MH if the year (DA field) is within destined range
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l (*.table)
- Same as in JD words and documents counts
Get all Journal Descriptors for all Journals:
- This information are in file of: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
- The above file can be derived from (TBD)
  - /aux/humphrey/JD/serfile/ser/2005/lsi2005.xml
  - /aux/humphrey/JD/serfile/ser/2005/rule
- Same as in JD words and documents counts
Add Journal Descriptors to retrieved records:
This step is to add JDs into retrieved records by combining above two steps together.
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l
- Source: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
  or
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWORDS/ui-jid-jd.*.l
- Target: suesun:/export/home/humphrey/JD/data/2004/MH/*.l
Retrieve and calculate count of Mesh headings and subheadings (with star) by filtering records into 2 files based on the field: MH
- Source: suesun:/export/home/humphrey/JD/data/2004/MH/*.l
- Source: suesun:/export/home/humphrey/JD/sh-abbrev.l
- Program & algorithm:
  - tokenize MH with * in MH field (include both *MH and *SH)
  - update count in all JDs (from JID) for each MH
  - Heading Total doc count of this MH total doc count of all JDs
  - tokenize SH only with * in the *SH
  - translate SH into sub-heading abbreviation
  - update count in all JDs (from JID) for each SH
  - SubHeading Total doc count of this SH total doc count of all JDs
- Target: suesun:/export/home/humphrey/JD/data/2004/MH/mhstar-jd.l
- Target: suesun:/export/home/humphrey/JD/data/2004/MH/shstar-jd.l