Text Categorization

Pre-Process: Jid-Ta-Jds

  • Description:
    This file includes the information of Journal Id (JID), Journal title (TA), and the associated Journal Descriptors (JDs) from List of Serials Indexed file lsi${YEAR}.xml. It was originally manually maintained by NLM and Susanne in 2004 training set. It was static and provided by Susanne as "jid-ta-jd.im.20031201.mod.fixed.l". In the Java 2007 release, we derived this file from List of Serials Indexed file, lsi2006.xml. We use lsi2007.xml for the 2008 release.

  • Input:
    • By NLM:
      • ftp://ftp.nlm.nih.gov/online/journals/lsi2007.xml

  • Java File & Algorithm:
    • GenerateJidTaJdsFromLsi.java
      • parse lsi.xml file
      • Find xml tag <NlmUniqueID> for Journal ID, JID
      • Find xml tag <MedlineTA> for Journal Title, TA
      • Find xml tag <BroadJournalHeading> for Journal Descriptors, JDs
      • Find xml tag <BroadJournalHeadingList> for the begining of JDs
      • print out information in the new format to file: jidTaJds.out
    • perform unique sort on jidTaJds.out to get jidTaJds.txt (sort -u jidTaJds.out > jidTaJds.txt)

  • Output File:
    • jidTaJds.txt, used in TC.MLT
      JIDTAJD 1JD 2...

  • Notes:
    • None