Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

PreProcess: Word-ST Table


  • Description:

    A table (file) stores the Word-St scores is generated, loaded into DB table. This table is then used to perform ST indexing on phrase. There are two types of scores:

    • word count
    • document count

  • Input:

  • Java File & Algorithm:
    • Read in WC and DC scores for all Word-Jdid from wordJdidWcDcTable (wordJdidWcDcTable.txt)
      • The order of JD scores are not sorted
      • JD scores are not in the table if it is 0
    • Read in WC and DC scores for all ST-Jdid from stJdsTable (stJdsTable.txt)
      • The order of JD scores are sorted
      • JD scores are in the table even if it is 0
    • Calculate cosine coefficient on Vectors of Wc and Dc for all Word-Jdid and ST-Jdid to form Word-St-Wc-Dc tables
      • Make sure all JD vectors have same amount of vector components
    • Print out the tables

  • Output Files:
    • wordStTable.txt
      WordST indexST AbbreviationTUIWord scoresDocument scores

  • Notes: