Text Categorization

PreProcess: ST Documents

Description:
ST "Document" gets all words associated with ST from MRCONSO.RRF (to replace MRCON) and MRSTY. ST "Document" is ST-Word table for the domain data. The criteria of ST "Document" are defined as follows:
- 2nd filed (LAT): language is |ENG| (English)
- 17th field (SUPRESS): suppressible flag is 'N'. 'N' means not O, E, Y.
- 12th field (SAB) is not in the SAB_Out list
- 13th field (TTY) must in the TTY_In list
- 15th field (STR):
  - Remove Acronyms (except for all Capital source)
  - Normalize STR
  - Filter out normalized string
- 2nd run to:
  - Remove Acronyms from all Capital source (SAB)
  - Remove redundant string with duplicate normalized string
- Find ST-word as stDocument
Input:
- ash:/u03/umls/Releases/2007AC/Full/RRF/META/MRCONSO.RRF
  
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
  CUI LAT TS LUI STT SUI ISPREF AUI SAUI SCUI SDUI SAB TTY CODE STR SRL SUPPRESS CVF
- ash:/u03/umls/Releases/2007AC/Full/ORF/META/MRSTY
  
  CUI TUI STY
- Semantic Type
  
  Index TUI ST abbreviation ST Name
- Semantic Type Groups
  
  ST GROUP ABBR ST GROUP NAME TUI STY

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
CUI	LAT	TS	LUI	STT	SUI	ISPREF	AUI	SAUI	SCUI	SDUI	SAB	TTY	CODE	STR	SRL	SUPPRESS	CVF

Java File & Algorithm:

GenerateStDocument.java

Get ST-word by following procedures (1st run from MRCONSO.RRF):
- 2nd field (LAT) is |ENG| (English)
- 17th field (SUPPRESS) is 'N' (not suppressed)
- 12th field (SAB) is not in the SAB_Out list
- 13th field (TTY) must in the TTY_In list
- 15th field (STR)
  - Remove STR if it is an Acronym. Save these acronyms in Acronym_Out list. Acronym is define the first two characters are uppercase letters. Exceptions are:
    - Acronyms in Acronyms_In list
    - All capital STR sources (SAB/TTY):
      - CDT
      - CCPSS
      - SPN
      - MTHFDA
      - VANDF
      - COSTAR/PT (All capital STR, and no Acronyms)
  - Normalize STR
    - trim, remove space at both the beginning or the end
    - remove NEC, NOS, and their combinations at the end
    - remove ambiguity tag, <n>
    - lower case
    - replace punctuation with space
    - tokenize into words and re-compose new words to a normalized string
  - Filter normalized string
    - Filter out multiple words string (Use only one word)
    - Filter out one-word with length is <= 2 (mix. length is 3)
    - Filter out one-word without any alpha characters
    - Filter out one-word with numerical characters at the beginning
  - Save normalized string if lowercase STR = normalized STR in NormStr_Out list
- Send results out to preNormMRCONSO.txt with normalized string in 19th fields and legal flag in 20th field.
2nd run filter to (go through preNormMRCONSO.txt):
- Do nothing if the 20th field (Legal flag) is not Y (not legal)
- If 20th field is Y (legal)
  - Remove line with STR is Acronyms (in saved Acronym_Out list) from all capital STR sources (SAB)
    - CDT
    - CCPSS
    - SPN
    - MTHFDA
    - VANDF
  - Remove line with normalized STR (19 field) is in saved NormStr_Out list when the lowercase STR (15th field) != normalized STR
  - Mark the 21th field 2nd legal flag
For each string, same word only count once in the word list
Use CUI to get Semantic Type (TUI) from MRSTY
Use TUI to find ST Groups and send to different files
For each ST, print out St-Documents (ST-Words) sorted alphabetically
- Words (stDocument.txt)
- Words only related to one ST Group (stDocument1.txt)
- Words related to multiple ST Groups (stDocument2.txt)

1st run algorithm table:

Field	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
Field	CUI	LAT	TS	LUI	STT	SUI	ISPREF	AUI	SAUI	SCUI	SDUI	SAB	TTY	CODE	STR	SRL	SUPPRESS	CVF
Filter		ENG										Not in SAB_Out list	Must in TTY_In list		Not Acronyms Filter normalized String		N

Output Files:
- stDocument.txt
  
  ST one word String
- preNormMRCONSO.txt and normMRCONSO.txt
  - The first 18 fields are the same as MRCONSO.
  - 19th field: normalized STR (on 15th field), using above algorithm
  - 20th field: legal flag:
    - 0: default setting (should not happen)
    - 2: not English
    - 12: Illegal SAB
    - 13: Illegal TTY
    - 15-A: STR is an acronym
    - 15-N: Illegal Normalize STR
    - T: legal is true
  - 21th field: 2nd (final) run legal flag:
    - If 2oth field is not "T"
      - F: legal is false
    - If 20th field is "T"
      - A: STR is an acronym from all capital source (SAB)
      - N: Duplicated Normalize STR (lowercase STR != normalized string)
      - T: legal is true (normalized string is used in stDocuemnt)
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
  CUI LAT TS LUI STT SUI ISPREF AUI SAUI SCUI SDUI SAB TTY CODE STR SRL SUPPRESS CVF Normalized STR 1st Run Flag 2nd Run Flag
Notes:
- TBD