Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

Pre-Process: Ui-Ti-Ab-Words

  • Description:
    This file includes all words in title and abstract in the training set (MEDLINE). Also, words are tokenized and filtered out by the rules and algorithm.

  • Input files:

  • Java Files & Algorithm:
    • Read in title form uiTiWords.${NUM}.txt
    • Read in abstract form uiAbWords.${NUM}.txt
    • Combine uiTiWords.${NUM}.txt and uiAbWords.${NUM}.txt by PMID
    • Print to uiTiAbWords.${NUM}.txt

  • Output:
    • TIAB/uiTiAbWords.${NUM}.txt, used to generate Wc and Dc
      PMIDWords from title and abstract