CSpell - Consumer Medical Terms from UMLS
I. Pre-Process - Source: CSpell Dictionary, Medical Terms
Medical Terms that are not in the SPELCIALIST Lexicon were collected and added to the CSpell dictionary. These terms are also used as source for Lexicon build. Please see CSpell Dictionary - Medical Terms for details. The basic algorithm are described as follows:
- Retrieve terms from UMLS - MRCONSO.RRF, which are English, preferred term
- Matches (34) semantic types from 5 categories (problem, interventions, drugs, anatomy, population)
- lowercased
- Combined with static data form gopher and problem list
- retrieve unigram
- convert to coreTerm
- Filter out digit, punctuation, numbers, unit, measurement
- Filter out terms already in the Lexicon
- Med.cm-l.dic
II. Process - Generate LMW candidates
- Input Files:
- Med.cm-l.dic (source of terms)
- noCui.data.all (to assign type if no CUI)
- umlsDicBySt.data.all (to retrieve CUI)
- nGram.2017.noPipe.core.lc (to retrieve frequency)
- Run Program:
shell>cd ${MULTIWORDS}/bin/20.CSpellMedTerms
1
2
- Algorithm:
- Get unigram from sources with frequency
- Associate terms with frequency to unigram
- Assign CUI or source (CUI_expo or CUI_prob)
- Re-arrange by grouping singulars and plurals together, then frequency (easier for linguist to tag same term at the same time)
- Output Files:
- ${MULTIWORDS}/data/2017/outData/20.CSpellMedTerms/Cand_List
- Format:
Element Word | Frequency of Element Word | Term | Frequency of Term | CUI/Sources*
|
* Field 5: CUI, CUI_expo, CUI_prob
- cCandidates.data
- cCandidates.data.gbp
III. Post-Process
- Get the unigram that has frequency greater than 1500 (same as use MEDLINE)
- Get unigram (field 1), then use option 10 in 12.CandidateList to auto tag
- Get multiwords (field 3), then use option 10 in 12.CandidateList to auto tag
- Use option 3 in 12.CandidateList to add invalid LMW to invalidBaseLmw.data file.
IV. Results
Year | min. frequency | Type | Total Candidates | Valid LMW | Invalid LMW
|
---|
2017 | 1500 | Unigram | 125 | 87 | 38
|
Terms | 1233 | 37 | 1196
|
Total | 1358 | 124 | 1234
|