Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

PreProcess: ST Documents

  • Description:

    ST "Document" gets all words associated with ST from MRCONSO.RRF (to replace MRCON) and MRSTY. ST "Document" is ST-Word table for the domain data. The criteria of ST "Document" are defined as follows:

    • 2nd filed (LAT): language is |ENG| (English)
    • 17th field (SUPRESS): suppressible flag is 'N'. 'N' means not O, E, Y.
    • 12th field (SAB) is not in the SAB_Out list
    • 13th field (TTY) must in the TTY_In list
    • 15th field (STR):
      • Remove Acronyms (except for all Capital source)
      • Normalize STR
      • Filter out normalized string

    • 2nd run to:
      • Remove Acronyms from all Capital source (SAB)
      • Remove redundant string with duplicate normalized string

    • Find ST-word as stDocument

  • Input:
    • ash:/u03/umls/Releases/2007AC/Full/RRF/META/MRCONSO.RRF
      123456789101112131415161718
      CUILATTSLUISTTSUIISPREFAUISAUISCUISDUISABTTYCODESTRSRLSUPPRESSCVF
    • ash:/u03/umls/Releases/2007AC/Full/ORF/META/MRSTY
      CUITUISTY
    • Semantic Type
      IndexTUIST abbreviationST Name
    • Semantic Type Groups
      ST GROUP ABBRST GROUP NAMETUISTY

  • Java File & Algorithm:
    • GenerateStDocument.java
      • Get ST-word by following procedures (1st run from MRCONSO.RRF):
        • 2nd field (LAT) is |ENG| (English)
        • 17th field (SUPPRESS) is 'N' (not suppressed)
        • 12th field (SAB) is not in the SAB_Out list
        • 13th field (TTY) must in the TTY_In list
        • 15th field (STR)
          • Remove STR if it is an Acronym. Save these acronyms in Acronym_Out list. Acronym is define the first two characters are uppercase letters. Exceptions are:
            • Acronyms in Acronyms_In list
            • All capital STR sources (SAB/TTY):
              • CDT
              • CCPSS
              • SPN
              • MTHFDA
              • VANDF
              • COSTAR/PT (All capital STR, and no Acronyms)
          • Normalize STR
            • trim, remove space at both the beginning or the end
            • remove NEC, NOS, and their combinations at the end
            • remove ambiguity tag, <n>
            • lower case
            • replace punctuation with space
            • tokenize into words and re-compose new words to a normalized string
          • Filter normalized string
            • Filter out multiple words string (Use only one word)
            • Filter out one-word with length is <= 2 (mix. length is 3)
            • Filter out one-word without any alpha characters
            • Filter out one-word with numerical characters at the beginning
          • Save normalized string if lowercase STR = normalized STR in NormStr_Out list
        • Send results out to preNormMRCONSO.txt with normalized string in 19th fields and legal flag in 20th field.
      • 2nd run filter to (go through preNormMRCONSO.txt):
        • Do nothing if the 20th field (Legal flag) is not Y (not legal)
        • If 20th field is Y (legal)
          • Remove line with STR is Acronyms (in saved Acronym_Out list) from all capital STR sources (SAB)
            • CDT
            • CCPSS
            • SPN
            • MTHFDA
            • VANDF
          • Remove line with normalized STR (19 field) is in saved NormStr_Out list when the lowercase STR (15th field) != normalized STR
          • Mark the 21th field 2nd legal flag
      • For each string, same word only count once in the word list
      • Use CUI to get Semantic Type (TUI) from MRSTY
      • Use TUI to find ST Groups and send to different files
      • For each ST, print out St-Documents (ST-Words) sorted alphabetically
        • Words (stDocument.txt)
        • Words only related to one ST Group (stDocument1.txt)
        • Words related to multiple ST Groups (stDocument2.txt)

      • 1st run algorithm table:
        Field123456789101112131415161718
        FieldCUILATTSLUISTTSUIISPREFAUISAUISCUISDUISABTTYCODESTRSRLSUPPRESSCVF
        Filter ENG         Not in SAB_Out listMust in TTY_In list 
      • Not Acronyms
      • Filter normalized String
      •  N 

  • Output Files:
    • stDocument.txt
      STone word String
    • preNormMRCONSO.txt and normMRCONSO.txt
      • The first 18 fields are the same as MRCONSO.
      • 19th field: normalized STR (on 15th field), using above algorithm
      • 20th field: legal flag:
        • 0: default setting (should not happen)
        • 2: not English
        • 12: Illegal SAB
        • 13: Illegal TTY
        • 15-A: STR is an acronym
        • 15-N: Illegal Normalize STR

        • T: legal is true
      • 21th field: 2nd (final) run legal flag:
        • If 2oth field is not "T"
          • F: legal is false
        • If 20th field is "T"
          • A: STR is an acronym from all capital source (SAB)
          • N: Duplicated Normalize STR (lowercase STR != normalized string)
          • T: legal is true (normalized string is used in stDocuemnt)

      123456789101112131415161718192021
      CUILATTSLUISTTSUIISPREFAUISAUISCUISDUISABTTYCODESTRSRLSUPPRESSCVF Normalized STR1st Run Flag2nd Run Flag

  • Notes:
    • TBD