Text Categorization

PreProcess: JDs (Journal Descriptors)

Description:
Journal Descriptors are preferred Mesh terms that describe journals. Each Journal has a ID, called JID. Each JID is related to certain (one or more) JDs. In the lisp system, 122 Journal Descriptors (JD) are in jd-abbr-table (Preferred Mesh Terms). This information is included in the List of Serials Indexed file (lsi${YEAR}.xml). This file is derived from lsi${YEAR}.xml since 2007 release.
Input:
- ftp://ftp.nlm.nih.gov/online/journals/lsi2007.xml
- jds.txt (from previous version)
Java File & Algorithm:
- GenerateJidTaJdsFromLsi.java
  - parse lsi.xml file
  - Find xml tag <NlmUniqueID> for Journal ID, JID
  - Find xml tag <MedlineTA> for Journal Title, TA
  - Find xml tag <BroadJournalHeading> for Journal Descriptors, JDs
  - Find xml tag <BroadJournalHeadingList> for the beginning of JDs
  - print out information in the new format to file: jidTaJds.out
  - print out information in the new format to file: jds.txt
Output File:
jds.txt, used in TC.JDI and TC.STRI

Index JD Id JD Name Status

Notes:

There are difference in JDs between versions:

Susanne's file (used in 2004 training set) & lsi2006.xml:

jd-abbr-table	lsi2006.xml	Notes
Anthropology, Physical	Anthropology
Antibiotics	Anti-Bacterial Agents
Behavior	Behavioral Sciences
Delivery of Health Care	Health Services
Family Planning	Family Planning Services
Genetics, Behavioral	Behavioral Sciences Genetics
	Library Science
	Research	Not a valid JD, should be removed
	Tuberculosis	Not a valid JD, should be removed

lsi2006.xml & lsi2007.xml:

lsi2006.xml lsi2007.xml
Nutrition Nutritional Sciences
Different JDs will generate different JDI training set and results. We use the similarity on those common JDs to compare results.