Text Categorization

Pre-Process: Word-Wc-Dc (Gt1)

  • Description:
    This file includes information of total word count and total document count for all words in title and abstract in the training set (MEDLINE). Please note that words are filtered out if document count is less than 2 (Gt1).

  • Input:

  • Java Files & Algorithm:
    • GenerateWordWcDc.java
      • Read data from uiTiAbWords.${NUM}.txt
      • Calculate total word count for all words
      • Calculate total document count for all words
      • Filter out words with document count less than 2
      • Print out word-wc-dc in the order of document count, alphabetic order

  • Output file: