Text Categorization

Installation Instructions



Introduction

Thank you for downloading the Java Text Categorization Tools Java. This package includes one compressed file -- tc2011.tgz.

Requirements

Installation

Download the tc2011.tgz file from Text Categorization Tool web site

Uncompress and unarchive this file into the location where you intend to have it installed. On a Linux machine, this would look like:

  > gtar -xzvf tc2011.tgz

If you are on a Linux platform, gunzip and tar may be used to uncompress and unarchive this file. This would look like:

  > gunzip tc2011.tgz
  > tar -xvf tc2011.tar

If you are on a windows platform, pkzip and winzip may be used to uncompress and unarchive this file. Please make sure the file you download has the right extension (tgz) before you unzip it. Winzip looks into the file to check if there is tar file within it, and it asks if it should create a temporary file with the tar file in it. The proper response is to allow this to happen. Once the temporary file has been created, winzip reads it and displays the archive as it would any other zip'ed file. Unzip the resulting contents to the location you want to install to. Make sure that you preserve the directory structures when unarchiving.

Once the files are in place, change directories to the top level. This should be tc2011. If you are on a windows platform, this will involve opening up a DOS command window (Start->run->cmd), and changing directories to where you have put these files. We will henceforth refer to this top level directory as the TC_DIR directory.

From the TC_DIR, invoke the appropriate installation command. These are installation scripts appropriate to the platforms we know about:

PlatformsInstallation Scripts
Linux i586 install/bin/install_linux.sh
Windows install\bin\install_win.bat

The scripts shown above will install a copy of the JRE, then proceed to alter the configuration file settings to set the install location. If you install tc on a PC platform, choose NOT to restart your computer after JRE is installed. You may restart your computer after finishing tc installation.

The script will create shell files in the ${TC_DIR}/bin directory for each of the text categorization tools, with the proper environment set up. Also, it creates tc configuration file under ${TC_DIR}/data/config/tc.properties.

Once done, the script will attempt to verify that the installation was successful.

If the installation was successful, there will be a congratulation message that finishes off this process.

If the installation is not successful, there will be a message indicating so. Please refer to the ${TC_DIR}/logs directory for a complete transcript of the process. It is likely that there were error messages indicating the source of the failure.

After a successful installation

The text Categorization tools are now ready to be used. They can be found in the ${TC_DIR}/bin directory. On the Linux platforms, they include the following shell scripts:

On the Windows platform, they include the following batch files:

  • mlt.bat
  • jdi.bat
  • sti.bat
  • stri.bat
  • stWsd.bat

Each may be used from any location on the machine. If you put the ${TC_DIR}/bin path in your $PATH environment variable they do not have to be referenced with the whole path.

An Example


> jdi -p
- Please input a term (type "Ctl-d" to quit) >
heart valve
--> Input: [heart valve]
--- JD scores (x 1) and rank based on word count ---
JD018|Cardiology
1|0.0858526|JD018|Cardiology
2|0.0624434|JD148|Pulmonary Medicine
3|0.0495025|JD124|Vascular Diseases
4|0.0251979|JD144|General Surgery
5|0.0209033|JD030|Diagnostic Imaging
6|0.0108041|JD120|Transplantation
7|0.0090153|JD005|Anesthesiology
8|0.0086425|JD014|Biomedical Engineering
9|0.0067363|JD100|Radiology
10|0.0064961|JD118|Therapeutics
--- JD scores (x 1) and rank based on document count ---
JD018|Cardiology
1|0.1564322|JD018|Cardiology
2|0.0979494|JD148|Pulmonary Medicine
3|0.0891969|JD124|Vascular Diseases
4|0.0438102|JD030|Diagnostic Imaging
5|0.0400007|JD144|General Surgery
6|0.0236169|JD005|Anesthesiology
7|0.0187880|JD120|Transplantation
8|0.0158293|JD014|Biomedical Engineering
9|0.0151241|JD092|Physiology
10|0.0133293|JD118|Therapeutics
--- Overall JD rank ---
JD018|Cardiology|dc

An Example Installation

Altering your environment to use the text categorization tools, once they have been installed

You may invoke these tools from a command line. These tools are found in the ${TC_DIR}/bin directory. You can add this ${TC_DIR}/bin path to your $PATH environment variable. This would enable you to find and run these tools from any location.

In Linux, this would be done by adding this path to your ~/.cshrc or ~/.profile startup script.

In Windows, this would be done by appending this path to the PATH variable from the "control panel\System\Advanced\Environment variable\Edit" menus.

Manual installation
Manual installation page shows details of how the installation script configures and installs Text Categorization tools. Users may skip this section if they use installation script to install text categorization tools.