SPECIALIST Lexicon

Consumer Health Terminology Acquisition
_{from UMLS Metathesaurs and MEDLINE N-gram Set}

I. Introduction

In addition to terms in the Lexicon, terms from UMLS are retrieved as medical terms and for a better coverage. Lexicon and this retrieved medical terms are used together as dictionary in cSpell.

II. Algorithm for Medical Terms

Medical terms are generated by the following steps:

Terms from the latest UMLS:

Go through MRCONSO.RRF
English String (LAT = ENG)
Preferred Term (TS = P, SST = PF, ISPREF = Y)
Not Obsolete (not used)
Not Abb/Acr (Not used)

Semantic type match (from STDEF: St_abb -> STI, MRSTY.RRF: STI -> CUI)

Category	Semantic Types
Problem	acab, anab, bact, cgab, dsyn, inpo, mobd, neop, patf, sosy
Interventions	diap, lbpr, topp
Drugs	aapp, antb, clnd, drdd, phsu, vita nsba, strd (removed after 2014AA-)
Anatomy	bdsy, blor, bpoc, bsoj, gngm, orga, orgf, phsf, tisu
population	aggp famg, podg, popg

Only lower case term (mixed case and upper case contains lots of Abb/Acr, also mostly overlap with lowercase only terms)
Program:
shell>cd ${PRE_PROCESS}/bin
2017AB
3
35
output:
${PRE_PROCESS}/data/Umls/${RELEASE_AX}/outData/umlsDicBySt.data.ewLc
(ew: element word, Lc: lowercase)

Terms form the useful legacy data (from Gopher and problem list). These terms are static and are retrieved from baseline dictionary (4 files):
- Input:
  - noCui.data.expo
  - noCui.data.prob
- Program:
  shell>cd ${PRE_PROCESS}/bin
  2017
  4
  45
  46
- output:
  ${PRE_PROCESS}/data/Baseline/outData/noCui.ewLc.data
Retrieved unigrams from UMLS resources and combine with legacy data:
- Algorithm:
  - Tokenized unigram
  - coreTerm - remove punctuation at the leading/ending position
  - Filter out digit/punctuation, numbers, unit/measurement
  - Filter out terms/possessive already in Lexicon (or general English dictionary)
  - Customized dictionary (add and remove terms)
- Program:
  shell>cd ${PRE_PROCESS}/bin
  2017
  5
  53
  54
- input (${PRE_PROCESS}/data/cSpellDic/${YEAR}/inData/):
  - ewToBeAdded.data
  - ewToBeRemoved.data
  - umlsDicBySt.data.ewLc
  - noCui.ewLc.datao
  - lexicon.enEwLc.dic.addRm
- output (${PRE_PROCESS}/data/cSpellDic/${YEAR}/outData/):
  - Med.l.dic (l: for using lexicon as English dictionary)
  - EngMed.l.dic
Finalized words from above derived Med.l.dic:
- Get words that are in both Med.l.dic and medline.dic
  56
- Exclude words that are in Lexicon to gen Dic
  55
  57
- output (${PRE_PROCESS}/data/cSpellDic/${YEAR}/outData/):
  - Med.cm-l.dic (cm: consumer and medline)
  - EngMed.cm-l.dic (-l: exclude lexicon)
Generate LexBuild Candidate list from Med.cm-l.dic:
- Generate candidate list from Med.cm-l.dic
  60
  Output: ${PreProcess}/data/LexBuild/${YEAR}/outData/cCandidates.data
- Sort by groouping base and plural forms of element words
  61
  Output: ${PreProcess}/data/LexBuild/${YEAR}/outData/cCandidates.data.gbp

The SPECIALIST Lexicon