Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Consumer Health Terminology Acquisition
from UMLS Metathesaurs and MEDLINE N-gram Set

I. Introduction

In addition to terms in the Lexicon, terms from UMLS are retrieved as medical terms and for a better coverage. Lexicon and this retrieved medical terms are used together as dictionary in cSpell.

II. Algorithm for Medical Terms

Medical terms are generated by the following steps:

  • Terms from the latest UMLS:
    • Go through MRCONSO.RRF
    • English String (LAT = ENG)
    • Preferred Term (TS = P, SST = PF, ISPREF = Y)
    • Not Obsolete (not used)
    • Not Abb/Acr (Not used)
    • Semantic type match (from STDEF: St_abb -> STI, MRSTY.RRF: STI -> CUI)
      CategorySemantic Types
      Problem
      • acab, anab, bact, cgab, dsyn, inpo, mobd, neop, patf, sosy
      Interventions
      • diap, lbpr, topp
      Drugs
      • aapp, antb, clnd, drdd, phsu, vita
      • nsba, strd (removed after 2014AA-)
      Anatomy
      • bdsy, blor, bpoc, bsoj, gngm, orga, orgf, phsf, tisu
      population
      • aggp famg, podg, popg
    • Only lower case term (mixed case and upper case contains lots of Abb/Acr, also mostly overlap with lowercase only terms)

    • Program:

      shell>cd ${PRE_PROCESS}/bin
      2017AB
      3
      35

    • output:
      ${PRE_PROCESS}/data/Umls/${RELEASE_AX}/outData/umlsDicBySt.data.ewLc
      (ew: element word, Lc: lowercase)

  • Terms form the useful legacy data (from Gopher and problem list). These terms are static and are retrieved from baseline dictionary (4 files):
    • Input:
      • noCui.data.expo
      • noCui.data.prob
    • Program:

      shell>cd ${PRE_PROCESS}/bin
      2017
      4
      45
      46

    • output:
      ${PRE_PROCESS}/data/Baseline/outData/noCui.ewLc.data

  • Retrieved unigrams from UMLS resources and combine with legacy data:
    • Algorithm:
      • Tokenized unigram
      • coreTerm - remove punctuation at the leading/ending position
      • Filter out digit/punctuation, numbers, unit/measurement
      • Filter out terms/possessive already in Lexicon (or general English dictionary)
      • Customized dictionary (add and remove terms)
    • Program:

      shell>cd ${PRE_PROCESS}/bin
      2017
      5
      53
      54

    • input (${PRE_PROCESS}/data/cSpellDic/${YEAR}/inData/):
      • ewToBeAdded.data
      • ewToBeRemoved.data
      • umlsDicBySt.data.ewLc
      • noCui.ewLc.datao
      • lexicon.enEwLc.dic.addRm
    • output (${PRE_PROCESS}/data/cSpellDic/${YEAR}/outData/):
      • Med.l.dic (l: for using lexicon as English dictionary)
      • EngMed.l.dic

  • Finalized words from above derived Med.l.dic:
    • Get words that are in both Med.l.dic and medline.dic
      56
    • Exclude words that are in Lexicon to gen Dic
      55
      57

    • output (${PRE_PROCESS}/data/cSpellDic/${YEAR}/outData/):
      • Med.cm-l.dic (cm: consumer and medline)
      • EngMed.cm-l.dic (-l: exclude lexicon)

  • Generate LexBuild Candidate list from Med.cm-l.dic:
    • Generate candidate list from Med.cm-l.dic
      60
      Output: ${PreProcess}/data/LexBuild/${YEAR}/outData/cCandidates.data
    • Sort by groouping base and plural forms of element words
      61
      Output: ${PreProcess}/data/LexBuild/${YEAR}/outData/cCandidates.data.gbp