Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

PreProcess: Restrictwords

  • Description:
    Restricted words are the set of words in the working domain. In the JDI project, the restricted words of tc2007 release (2004 MEDLINE) are words from UMLS-Metathesaurus, 2003AC, MRCON. Uses are allowed to define their own restrictwords list. For the tc2008 release, UMLS-Metathesaurus, 2007AC, MRCON is used.

  • Input:

  • Java File & Algorithm:
    • GenerateRestrictWords.java
      • Read in MRCON (2.7.1.3.4 Concept Names and Sources) to get 7th field (STR) when
        • 2nd field (LAT) is |ENG|, language of term is English
        • 3rd field (TS) is |P| (or |S|: TBD), term status is preferred LUI of the CUI
        • 5th field (STT) is |PF|, string type is preferred form
          FieldCUILATTSLUISTTSUISTRLRL
          Filter ENGP/S PF   
        • break up string into words
          • ignore case
          • remove non-alpha-num char from all words (starts and ends)
          • expand contractions
          • remove punctuation
          • remove words starts with number
          • remove word less than 3 chars

  • Output file:
    • restrictWords.txt
      Restrictword

    • restrictWordsGt1.txt, apply frequency, used in TC.JDI
      Restrictword

  • Notes: