Text Categorization

Pre-Process: Word-Jdid-Wc-Dc table

  • Description:
    This file includes final scores of Word-Jdid-Wc-Dc for all words in training set (MEDLINE). This file is used as the input file for JDI database.

  • Input:

  • Procedures & Java files:
    • GenerateWordJdidWcDcTable.java
    • Read and calculate word count and document count scores for all word-Jdid from file and then sent to output file
      • Read total word count and document count for each word-Jdid from wordJdidWcDcGt1.txt
      • Read total (normalized) Wc signal and total Dc for all words from wordSignalWcDcScores.txt
      • Read jdDcNFactor for each Jdid from jdidDcNFactor.txt
      • Calculate word count scores and document count scores for all word-Jdid:
        • word count score = (word count/total normalized Wc signal) * NFactor
        • document count score = (document count/total of Dc) * NFactor
    • Print out Word-Jdid-Wc-Dc scores

  • Output file: