PreProcess - JDI, Phase I
I. Word-Jdid-Wc-Dc
The JDI training data set is obtained from MEDLINE records. Below are the detail procedures to obtain the training data set:
- Retrieve destined years of MedLine records:
In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.
- Source: /net/indfiler/vol/vol3/aux/MEDLINE_baseline/2004/medline04nxxxx*.txt
- Program: retrieve fields of PMID, TI, AB, TA, JID, RN, MH
if the year (DA field) is within destined range
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l (*.table)
- New target: /export/home/lu/Development/TC/tc2006/data/MedLine/2004/TrainSet/Results/PMIDJD
- Add Journal Descriptors to retrieved records:
This step is to add JDs into retrieved records by combining above two steps together.
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l
- Source: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/PMIDWJD/*.l
- Retrieve words by filtering records into 3 files based on fields: TI, AB, JD
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/PMIDWJD/*.l
- Program, based on words tokenizer rules and algorithm:
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-ti-w.xxxx.l
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-ab-w.xxxx.l
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-jid-jd.xxxx.l
- Calculate Word vs. Jd, word count and document count (w-jd-wc-dc)
- Program: calculate word count and document count of Journal Descriptors for all words:
- Combine ui-ti-w.xxxx.l and ui-ab-w.xxxx.l to ui-ti-ab-w.xxxx.l
- Load all words from total count file
- Load PMID (ui) and JDs data from file
- Get JDs based on PMID and PMID
- Calculate total word count of each JDs for all words if word in all words list
- Calculate total document count of each JDs for all words if word in all words list
- Print out w-jd-wc-dc in the order of alphabetical order of word, document count, JD ID number
- Target: w-jd.tot.abbr.gt1.l
Word | Jd Id | Word Count | Document Count
|
---|
- Calculate words and normalized total word count (w-signal.lw.gt1.l)
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ui-wtot.xxxx.gt1.l
- Source: total word count w-wc-dc.tot.gt1.l
- Program: calculate word normalized count
- calculate local word count for each document
- get global word count from total word count file
- get noise = (local word count/global word count) * log2(global word count/local word count)
- get signal = log2(global word count) - noise
- get Signal weight = local word count * signal
- get normalized signal = log2(1 + signal weight)
- word normalized signal = summation of (normalized signal) for the word
- Target: w-signal-1w.gt1.l
Word | Word Normalized Count
|
---|
- Calculate JD vs. document count (jd-dc.gt1.l)
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWRODS/ui-jid-jd.xxxx.l
- Source: JournalDescriptors
- Program: calculate document count for all Journal Descriptors
- Go through all ui-jid-jd.xxxx.l files and calculate document count for all Journal descriptors
- Print out jd-dc in the order of JD-ID
- Target: "jd-dc.gt1.l": contains JD names and document count. This file is used to normalize the score.
II. Mh-Jdid-Dc & Sh-Jdid-Dc
The training data set are calculate Mesh Headings (MH) and Sub headings (SH) from MedLine records. Below are the detail procedures to obtain the training data set:
- Retrieve destined years of MedLine records:
In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.
- Source: /net/indfiler/vol/vol3/aux/MEDLINE_baseline/2004/medline04nxxxx*.txt
- Program: retrieve fields of PMID, TI, AB, TA, JID, RN, MH
if the year (DA field) is within destined range
- Target: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l (*.table)
- Same as in JD words and documents counts
- Get all Journal Descriptors for all Journals:
- This information are in file of:
suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
- The above file can be derived from (TBD)
- /aux/humphrey/JD/serfile/ser/2005/lsi2005.xml
- /aux/humphrey/JD/serfile/ser/2005/rule
- Same as in JD words and documents counts
- Add Journal Descriptors to retrieved records:
This step is to add JDs into retrieved records by combining above two steps together.
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALL/*.l
- Source: suesun:/aux/humphrey/JD/serfile/ser/jid-ta-jd.in.20031201.mod.fixed.l
or
- Source: suesun:/export/home/humphrey/JD/data/2004/PMID/ALLWORDS/ui-jid-jd.*.l
- Target: suesun:/export/home/humphrey/JD/data/2004/MH/*.l
- Retrieve and calculate count of Mesh headings and subheadings (with star) by filtering records into 2 files based on the field: MH
- Source: suesun:/export/home/humphrey/JD/data/2004/MH/*.l
- Source: suesun:/export/home/humphrey/JD/sh-abbrev.l
- Program & algorithm:
- tokenize MH with * in MH field (include both *MH and *SH)
- update count in all JDs (from JID) for each MH
-
Heading | Total doc count of this MH | total doc count of all JDs
|
---|
- tokenize SH only with * in the *SH
- translate SH into sub-heading abbreviation
- update count in all JDs (from JID) for each SH
-
SubHeading | Total doc count of this SH | total doc count of all JDs
|
---|
- Target: suesun:/export/home/humphrey/JD/data/2004/MH/mhstar-jd.l
- Target: suesun:/export/home/humphrey/JD/data/2004/MH/shstar-jd.l