Text Categorization

Pre-Process: Ui-Ti-Ab-Words

  • Description:
    This file includes all words in title and abstract in the training set (MEDLINE). Also, words are tokenized and filtered out by the rules and algorithm.

  • Input files:

  • Java Files & Algorithm:
    • Read in title form uiTiWords.${NUM}.txt
    • Read in abstract form uiAbWords.${NUM}.txt
    • Combine uiTiWords.${NUM}.txt and uiAbWords.${NUM}.txt by PMID
    • Print to uiTiAbWords.${NUM}.txt

  • Output:
    • TIAB/uiTiAbWords.${NUM}.txt, used to generate Wc and Dc
      PMIDWords from title and abstract