PreProcess: Restrictwords
- Description:
Restricted words are the set of words in the working domain. In the JDI project, the restricted words of tc2007 release (2004 MEDLINE) are words from UMLS-Metathesaurus, 2003AC, MRCON. Uses are allowed to define their own restrictwords list. For the tc2008 release, UMLS-Metathesaurus, 2007AC, MRCON is used.
- Input:
- Java File & Algorithm:
- GenerateRestrictWords.java
- Read in MRCON (2.7.1.3.4 Concept Names and Sources)
to get 7th field (STR) when
- 2nd field (LAT) is |ENG|, language of term is English
- 3rd field (TS) is |P| (or |S|: TBD), term status is preferred LUI of the CUI
- 5th field (STT) is |PF|, string type is preferred form
Field | CUI | LAT | TS | LUI | STT | SUI | STR | LRL
|
---|
Filter | | ENG | P/S | | PF | | |
|
---|
- break up string into words
- ignore case
- remove non-alpha-num char from all words (starts and ends)
- expand contractions
- remove punctuation
- remove words starts with number
- remove word less than 3 chars
- Output file:
- restrictWords.txt
- restrictWordsGt1.txt, apply frequency, used in TC.JDI
- Notes: