Text Categorization

Pre-Process: Ui-Ti-Ab-Words

Description:
This file includes all words in title and abstract in the training set (MEDLINE). Also, words are tokenized and filtered out by the rules and algorithm.
Input files:
- uiTiWords.${NUM}.txt
- uiAbWords.${NUM}.txt
Java Files & Algorithm:
- Read in title form uiTiWords.${NUM}.txt
- Read in abstract form uiAbWords.${NUM}.txt
- Combine uiTiWords.${NUM}.txt and uiAbWords.${NUM}.txt by PMID
- Print to uiTiAbWords.${NUM}.txt
Output:
- TIAB/uiTiAbWords.${NUM}.txt, used to generate Wc and Dc
  
  PMID Words from title and abstract