Text Categorization

PreProcess: JDs (Journal Descriptors)

  • Description:
    Journal Descriptors are preferred Mesh terms that describe journals. Each Journal has a ID, called JID. Each JID is related to certain (one or more) JDs. In the lisp system, 122 Journal Descriptors (JD) are in jd-abbr-table (Preferred Mesh Terms). This information is included in the List of Serials Indexed file (lsi${YEAR}.xml). This file is derived from lsi${YEAR}.xml since 2007 release.

  • Input:
    • ftp://ftp.nlm.nih.gov/online/journals/lsi2007.xml
    • jds.txt (from previous version)

  • Java File & Algorithm:
    • GenerateJidTaJdsFromLsi.java
      • parse lsi.xml file
      • Find xml tag <NlmUniqueID> for Journal ID, JID
      • Find xml tag <MedlineTA> for Journal Title, TA
      • Find xml tag <BroadJournalHeading> for Journal Descriptors, JDs
      • Find xml tag <BroadJournalHeadingList> for the beginning of JDs
      • print out information in the new format to file: jidTaJds.out
      • print out information in the new format to file: jds.txt

  • Output File:
    jds.txt, used in TC.JDI and TC.STRI
    IndexJD IdJD NameStatus
  • Notes:
    • Journal descriptors changed every year.
    • The file is sorted by the order of JD ID (version, then alphabetically)
    • Status: Active, Inactive
    • There are difference in JDs between versions:
      • Susanne's file (used in 2004 training set) & lsi2006.xml:
        jd-abbr-tablelsi2006.xmlNotes
        Anthropology, PhysicalAnthropology 
        AntibioticsAnti-Bacterial Agents 
        BehaviorBehavioral Sciences 
        Delivery of Health CareHealth Services 
        Family PlanningFamily Planning Services 
        Genetics, Behavioral
        • Behavioral Sciences
        • Genetics
         
         Library Science 
         ResearchNot a valid JD, should be removed
         TuberculosisNot a valid JD, should be removed
      • lsi2006.xml & lsi2007.xml:
        lsi2006.xmllsi2007.xml
        NutritionNutritional Sciences

      • Different JDs will generate different JDI training set and results. We use the similarity on those common JDs to compare results.