Text Categorization

Install/Run Other Version of Data Set

A new feature is added to run TC to with other version of data set in Jdi, Sti, Stri, and StWsd since TC.2009 release. The supported data set includes TC.2007, TC.2008, and TC.2009. The procedures are detailed as follows:

I. Install TC

II. Install Data Set

  • Download data set from TC web site:
  • Uncompress and unarchive this file into the top directory of TC. On a Linux machine, let's use 2008 version as an example, this would look like:
    	> mv tcData.2008.tgz ${TC_DIR}
    	> cd ${TC_DIR}
    	> gtar -xzvf tcData.2008.tgz
    	
  • Notes:
    After this step, you should see the data directory under ${TC_DIR} as data.2008. This directory include a complete data set for TC.2008:
    • Config: default configuration file (used for references)
    • HSqlDb: Database for TC.2008
    • Jdi: files used for Jdi
      • contractions.txt
      • jds.txt
      • jidTaJds.txt
      • restrictWords.txt
      • shs.txt
      • stopWords.txt
      • wordSignalWcDcGt1.txt
    • Sti: files used for Sti
      • sts.txt
    • Stri: files used for Stri
      • stJdTable.txt

III. Run Program with Specified Data Set

  • Use run specified version option (-rv:STR). On a Linux machine, let's use Jdi with 2008 version as an example, this would look like:
    	> cd ${TC_DIR}/bin
    	> jdi -rv:2008 -p
    	
  • Notes:
    the -rv:STR option performs following step on the background to run different versions:
    • Database: override the database name in configuration file according to the specified version, see table below.
    • Data files directory: override the directory name of all TC input files in the configuration file according to the specified version, see table below.
    • Default Max. normalized signal: The value of Max. normalized signal is updated yearly. Update the value of Max. signal in the inputFilterOption.legalWordOption according to the specified version, see table below.

       DATA_DIRDB_NAMEMax. Signal
      2011data/tc2011792054
      2010data.2010tc2010754648
      2009data.2009tc2009705815
      2008data.2008/tc2008645881
      2007data.2007/tc2004510754