Text Categorization

PreProcess: ST Documents

  • Description:

    ST "Document" gets all words associated with ST from MRCONSO.RRF (to replace MRCON) and MRSTY. ST "Document" is ST-Word table for the domain data. The criteria of ST "Document" are defined as follows:

    • 2nd filed (LAT): language is |ENG| (English)
    • 17th field (SUPRESS): suppressible flag is 'N'. 'N' means not O, E, Y.
    • 12th field (SAB) is not in the SAB_Out list
    • 13th field (TTY) must in the TTY_In list
    • 15th field (STR):
      • Remove Acronyms (except for all Capital source)
      • Normalize STR
      • Filter out normalized string

    • 2nd run to:
      • Remove Acronyms from all Capital source (SAB)
      • Remove redundant string with duplicate normalized string

    • Find ST-word as stDocument

  • Input:
    • ash:/u03/umls/Releases/2007AC/Full/RRF/META/MRCONSO.RRF
      123456789101112131415161718
      CUILATTSLUISTTSUIISPREFAUISAUISCUISDUISABTTYCODESTRSRLSUPPRESSCVF
    • ash:/u03/umls/Releases/2007AC/Full/ORF/META/MRSTY
      CUITUISTY
    • Semantic Type
      IndexTUIST abbreviationST Name
    • Semantic Type Groups
      ST GROUP ABBRST GROUP NAMETUISTY

  • Java File & Algorithm:
    • GenerateStDocument.java
      • Get ST-word by following procedures (1st run from MRCONSO.RRF):
        • 2nd field (LAT) is |ENG| (English)
        • 17th field (SUPPRESS) is 'N' (not suppressed)
        • 12th field (SAB) is not in the SAB_Out list
        • 13th field (TTY) must in the TTY_In list
        • 15th field (STR)
          • Remove STR if it is an Acronym. Save these acronyms in Acronym_Out list. Acronym is define the first two characters are uppercase letters. Exceptions are:
            • Acronyms in Acronyms_In list
            • All capital STR sources (SAB/TTY):
              • CDT
              • CCPSS
              • SPN
              • MTHFDA
              • VANDF
              • COSTAR/PT (All capital STR, and no Acronyms)
          • Normalize STR
            • trim, remove space at both the beginning or the end
            • remove NEC, NOS, and their combinations at the end
            • remove ambiguity tag, <n>
            • lower case
            • replace punctuation with space
            • tokenize into words and re-compose new words to a normalized string
          • Filter normalized string
            • Filter out multiple words string (Use only one word)
            • Filter out one-word with length is <= 2 (mix. length is 3)
            • Filter out one-word without any alpha characters
            • Filter out one-word with numerical characters at the beginning
          • Save normalized string if lowercase STR = normalized STR in NormStr_Out list
        • Send results out to preNormMRCONSO.txt with normalized string in 19th fields and legal flag in 20th field.
      • 2nd run filter to (go through preNormMRCONSO.txt):
        • Do nothing if the 20th field (Legal flag) is not Y (not legal)
        • If 20th field is Y (legal)
          • Remove line with STR is Acronyms (in saved Acronym_Out list) from all capital STR sources (SAB)
            • CDT
            • CCPSS
            • SPN
            • MTHFDA
            • VANDF
          • Remove line with normalized STR (19 field) is in saved NormStr_Out list when the lowercase STR (15th field) != normalized STR
          • Mark the 21th field 2nd legal flag
      • For each string, same word only count once in the word list
      • Use CUI to get Semantic Type (TUI) from MRSTY
      • Use TUI to find ST Groups and send to different files
      • For each ST, print out St-Documents (ST-Words) sorted alphabetically
        • Words (stDocument.txt)
        • Words only related to one ST Group (stDocument1.txt)
        • Words related to multiple ST Groups (stDocument2.txt)

      • 1st run algorithm table:
        Field123456789101112131415161718
        FieldCUILATTSLUISTTSUIISPREFAUISAUISCUISDUISABTTYCODESTRSRLSUPPRESSCVF
        Filter ENG         Not in SAB_Out listMust in TTY_In list 
      • Not Acronyms
      • Filter normalized String
      •  N 

  • Output Files:
    • stDocument.txt
      STone word String
    • preNormMRCONSO.txt and normMRCONSO.txt
      • The first 18 fields are the same as MRCONSO.
      • 19th field: normalized STR (on 15th field), using above algorithm
      • 20th field: legal flag:
        • 0: default setting (should not happen)
        • 2: not English
        • 12: Illegal SAB
        • 13: Illegal TTY
        • 15-A: STR is an acronym
        • 15-N: Illegal Normalize STR

        • T: legal is true
      • 21th field: 2nd (final) run legal flag:
        • If 2oth field is not "T"
          • F: legal is false
        • If 20th field is "T"
          • A: STR is an acronym from all capital source (SAB)
          • N: Duplicated Normalize STR (lowercase STR != normalized string)
          • T: legal is true (normalized string is used in stDocuemnt)

      123456789101112131415161718192021
      CUILATTSLUISTTSUIISPREFAUISAUISCUISDUISABTTYCODESTRSRLSUPPRESSCVF Normalized STR1st Run Flag2nd Run Flag

  • Notes:
    • TBD