The SPECIALIST Lexicon

CSpell - Consumer Medical Terms from UMLS

I. Pre-Process - Source: CSpell Dictionary, Medical Terms

Medical Terms that are not in the SPELCIALIST Lexicon were collected and added to the CSpell dictionary. These terms are also used as source for Lexicon build. Please see CSpell Dictionary - Medical Terms for details. The basic algorithm are described as follows:

  • Retrieve terms from UMLS - MRCONSO.RRF, which are English, preferred term
  • Matches (34) semantic types from 5 categories (problem, interventions, drugs, anatomy, population)
  • lowercased
  • Combined with static data form gopher and problem list

  • retrieve unigram
  • convert to coreTerm
  • Filter out digit, punctuation, numbers, unit, measurement
  • Filter out terms already in the Lexicon

  • Med.cm-l.dic

II. Process - Generate LMW candidates

  • Input Files:
    • Med.cm-l.dic (source of terms)
    • noCui.data.all (to assign type if no CUI)
    • umlsDicBySt.data.all (to retrieve CUI)
    • nGram.2017.noPipe.core.lc (to retrieve frequency)
  • Run Program:
    • shell>cd ${MULTIWORDS}/bin/20.CSpellMedTerms
      1
      2
  • Algorithm:
    • Get unigram from sources with frequency
    • Associate terms with frequency to unigram
    • Assign CUI or source (CUI_expo or CUI_prob)

    • Re-arrange by grouping singulars and plurals together, then frequency (easier for linguist to tag same term at the same time)
  • Output Files:
    • ${MULTIWORDS}/data/2017/outData/20.CSpellMedTerms/Cand_List
    • Format:
      Element WordFrequency of Element WordTermFrequency of TermCUI/Sources*

      * Field 5: CUI, CUI_expo, CUI_prob

    • cCandidates.data
    • cCandidates.data.gbp

III. Post-Process

  • Get the unigram that has frequency greater than 1500 (same as use MEDLINE)
  • Get unigram (field 1), then use option 10 in 12.CandidateList to auto tag
  • Get multiwords (field 3), then use option 10 in 12.CandidateList to auto tag

  • Use option 3 in 12.CandidateList to add invalid LMW to invalidBaseLmw.data file.

IV. Results

Yearmin. frequencyTypeTotal CandidatesValid LMWInvalid LMW
20171500Unigram1258738
Terms1233371196
Total13581241234