Text Categorization

Pre-Process: Word-Signal-Wc-Dc (Gt1)

  • Description:
    This file includes information of Word-Signal-Wc-Dc for all words in training set (MEDLINE) with document count great than 1. Word signal is the normalized total word count and is used to replace in JDI.

  • Input:

  • Java Files & Algorithm:
    • GenerateWordSignalWcDc.java
    • Read global word count and document count from wordWcDcGt1.txt
    • Calculate word signal
      • Read in all words (Gt1) of titles and abstracts from uiTiAbWords.${NUM}.txt
      • Update local word count of all words for each citation (UI/PMID) from uiTiAbWords.${NUM}.txt if word in wordWcDcGt1.txt
      • Calculate normalized global word signal:

        global Wc: is the total word count for the entire training set

        go through all documents in the training set and calculate:
        • local Wc: is the total word count for the document we are working on
        • local noise = (local Wc/global Wc)*log2 (global Wc/local Wc);
        • local signal = log2 (global Wc) - local noise;
        • local signal weight = (local Wc) * local signal;
        • local normalized signal = log2 (1 + local signal weight);
      • Add local normalized signal to global word signal (used as normalized word count in the next step)
    • Print out Word-Signal-Wc-Dc

  • Output file:
    • wordSignalWcDcGt1.txt, used in JDI
      Wordword signal (normalize)total word counttotal document count