Text Categorization

PreProcess: Restrictwords

  • Description:
    Restricted words are the set of words in the working domain. In the JDI project, the restricted words of tc2007 release (2004 MEDLINE) are words from UMLS-Metathesaurus, 2003AC, MRCON. Uses are allowed to define their own restrictwords list. For the tc2008 release, UMLS-Metathesaurus, 2007AC, MRCON is used.

  • Input:

  • Java File & Algorithm:
    • GenerateRestrictWords.java
      • Read in MRCON (2.7.1.3.4 Concept Names and Sources) to get 7th field (STR) when
        • 2nd field (LAT) is |ENG|, language of term is English
        • 3rd field (TS) is |P| (or |S|: TBD), term status is preferred LUI of the CUI
        • 5th field (STT) is |PF|, string type is preferred form
          FieldCUILATTSLUISTTSUISTRLRL
          Filter ENGP/S PF   
        • break up string into words
          • ignore case
          • remove non-alpha-num char from all words (starts and ends)
          • expand contractions
          • remove punctuation
          • remove words starts with number
          • remove word less than 3 chars

  • Output file:
    • restrictWords.txt
      Restrictword

    • restrictWordsGt1.txt, apply frequency, used in TC.JDI
      Restrictword

  • Notes: