Text Categorization

TC - JDI Train Set Test

I. Testing target:

Three tables are generated for TC database in the JDI train set. They are:

These tables are tested (by comparing to the previous release) after they are generated. The source file details of each release are:

ReleaseJDsMEDLINE & YearMRCON
2007Lisp fileMEDLINE 2004, DA: 1999, 2000, 20012003AC
2004.99.01lsi2006.xmlMEDLINE 2004, DA: 1999, 2000, 20012003AC
2004.00.02lsi2006.xmlMEDLINE 2004, DA: 2000, 2001, 20022003AC
2004.01.03lsi2006.xmlMEDLINE 2004, DA: 2001, 2002, 20032003AC
2005.02.04lsi2006.xmlMEDLINE 2005, DA: 2002, 2003, 20042004AC
2006.03.05lsi2006.xmlMEDLINE 2006, DA: 2003, 2004, 20052005AC
2007.04.06lsi2006.xmlMEDLINE 2007, DA: 2004, 2005, 20062006AD
2008lsi2007.xmlMEDLINE 2008, DA: 2005, 2006, 20072007AC

ReleaseJDsMEDLINE & YearMRCON
2007Lisp fileMEDLINE 2004, DA: 1999, 2000, 20012003AC
2008.99.01lsi2007.xmlMEDLINE 2008, DA: 1999, 2000, 20012007AC
2008.00.02lsi2007.xmlMEDLINE 2008, DA: 2000, 2001, 20022007AC
2008.01.03lsi2007.xmlMEDLINE 2008, DA: 2001, 2002, 20032007AC
2008.02.04lsi2007.xmlMEDLINE 2008, DA: 2002, 2003, 20042007AC
2008.03.05lsi2007.xmlMEDLINE 2008, DA: 2003, 2004, 20052007AC
2008.04.06lsi2007.xmlMEDLINE 2008, DA: 2004, 2005, 20062007AC
2008.05.07lsi2007.xmlMEDLINE 2008, DA: 2005, 2006, 20072007AC

II. Testing Procedures:

  • Convert to standard format:
    Convert WordJdidWcDcTable.txt to WordJdidWcTable.txt and WordJdidDcTable.txt with following standard format:
    entity name: word/Mh/ShJDIDcount score
    	shell> flds 1,2,3 WordJdidWcDcTable.txt > WordJdidWcTable.txt
    	shell> flds 1,2,4 WordJdidWcDcTable.txt > WordJdidDcTable.txt
    	
  • Compare four files:
    • wordJdidWcTable.txt (converted from WordJdidWcDcTable.txt)
    • wordJdidDcTable.txt (converted from WordJdidWcDcTable.txt)
    • MhJdidDcTable.txt
    • ShJdidDcTable.txt

  • Compare procedures:
    • Get common JDs (GetCommonJds.java):
      Journal descriptors (JDs) can be different between versions due to different lsi${year}.xml. Different JDs results in different word count and document count for word, MeSH Main headings, and MeSH subheadings. To simplify the comparing procedures, only JD counts with common JDs in both releases are compared. So, the first step is to get the list of common Jds.
    • Compare JD vectors (CompareJdiVectors.java):
      Use similarity distance (Cosine coefficient) to compare common JD scores (vectors) for all words, MHs, SHs. The output is generated to ${TC_TEST}/TrainSetTest/data/Output/${SRC_YEAR}-${TEST_YEAR}/Jdi/ with following format:
      entity name: word/Mh/ShSimilarity distance

      Please note that some entity does not have any common JD score in both/either release. In such case, the similarity distance will be a NaN and should not be compared. For examples, in the comparison of 2007 and 2007+ releases, four MeSH main headings falls in this category

      Main HeadingNot common JDs, 2007 Not common JDs, 2007+
      butirosin sulfateJD007JD136
      capreomycin sulfateJD007JD136
      certificate of needJD027 
      congenital, hereditary, and neonatal diseases and abnormalitiesJD006 
    • Analyze & summarize reports (AnalyzeSimilarityResults.java):
      Analyze comparison results from above files:
      • Total entity number
      • Total similarity distance
        Calculate the similarity distance between the similarity results and 100% similar result. 1.0 is completely similar. This value should be above 0.9.
      • Similarity distribution with increment by 0.05.

III. Testing Results:

Please refer to JDI similarity tests for the TC annual release.

2007 to 2008

VersionsMh-DCSh-DcWord-DcWord-Wc
2007 -2008.99.010.99130.99490.99230.9911
2008.99.01-2008.00.020.99030.99700.97930.9737
2008.00.02-2008.01.030.98921.00000.97720.9723
2008.01.03-2008.02.040.98941.00000.97950.9739
2008.02.04-2008.03.050.99181.00000.98080.9753
2008.03.05-2008.04.060.99001.00000.97950.9742
2008.04.06-2008.05.070.98941.00000.97970.9742

Compare to 2007

VersionsMh-DCSh-DcWord-DcWord-Wc
2007-2008.99.070.0.0.0.
2007-2008.00.070.0.0.0.
2007-2008.01.070.0.0.0.
2007-2008.02.070.95850.99290.87390.8618
2007-2008.03.070.95320.99290.85680.8443
2007-2008.04.070.95070.99310.85160.8387
2007-2008.05.070.94640.99320.84720.8337
2007-2008.06.070.93980.99300.84320.8293
2007-2008.07.070.92750.99230.84280.8283

Compare to 2008

VersionsMh-DCSh-DcWord-DcWord-Wc
2008-2008.99.070.0.0.0.
2008-2008.00.070.0.0.0.
2008-2008.01.070.0.0.0.
2008-2008.02.070.99361.00000.98930.9857
2008-2008.03.070.99481.00000.99200.9891
2008-2008.04.070.99691.00000.99570.9939
2008-2008.05.071.00001.00001.00001.0000
2008-2008.06.070.99621.00000.99350.9906
2008-2008.07.070.98481.00000.98030.9736