Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

Pre-Process: Word-Signal-Wc-Dc (Gt1)

  • Description:
    This file includes information of Word-Signal-Wc-Dc for all words in training set (MEDLINE) with document count great than 1. Word signal is the normalized total word count and is used to replace in JDI.

  • Input:

  • Java Files & Algorithm:
    • GenerateWordSignalWcDc.java
    • Read global word count and document count from wordWcDcGt1.txt
    • Calculate word signal
      • Read in all words (Gt1) of titles and abstracts from uiTiAbWords.${NUM}.txt
      • Update local word count of all words for each citation (UI/PMID) from uiTiAbWords.${NUM}.txt if word in wordWcDcGt1.txt
      • Calculate normalized global word signal:

        global Wc: is the total word count for the entire training set

        go through all documents in the training set and calculate:
        • local Wc: is the total word count for the document we are working on
        • local noise = (local Wc/global Wc)*log2 (global Wc/local Wc);
        • local signal = log2 (global Wc) - local noise;
        • local signal weight = (local Wc) * local signal;
        • local normalized signal = log2 (1 + local signal weight);
      • Add local normalized signal to global word signal (used as normalized word count in the next step)
    • Print out Word-Signal-Wc-Dc

  • Output file:
    • wordSignalWcDcGt1.txt, used in JDI
      Wordword signal (normalize)total word counttotal document count