SPECIALIST Lexicon

Synonyms Candidates

I. Setup

program: GetSynonymCandidates.java
Inputs:
- MRCONSO.RRF
- inflVars.data
- cuiPreferredTerm.data
- MRSTY.RRF
- SemGroups.filter.txt
- LRABR.f1.uSort
- LRNOM
Outputs:
- synonymCan.data.*

II. Algorithm

Go through all lines in MRCONSO.RRF to generate sClass (synonym class). A sClass includes:

key: CUI and preferred term
- In UMLS, preferred terms may be capitalized, plural form (not base form), and not in Lexicon
- The preferred form associated with CUI is used as reference for tagging synonym candidates in the sClass
values: candidate, terms has the same CUI and meet following requirements
- format: POS|EUI|coreTerm.lc
- EUI is used to uniquely identify the term because some different terms with different EUIs have same spelling and POS

Example of a synonym class (sClass):

#SYNONYM_CLASS|C0000715|Abattoirs
128|E0203495|abattoir|
128|E0205229|slaughterhouse|

#SYNONYM_CLASS|C0000744|Abetalipoproteinemia
128|E0006481|abetalipoproteinemia|
128|E0217186|acanthocytosis|
128|E0430334|Bassen-Kornzweig syndrome|
128|E0441749|Bassen-Kornzweig disease|

Please note that candidates are case sensitive to preserve the original base form as in the Lexicon
They are sent to linguists to tag [y|n] for valid and invalid synonyms to the CUI|PT

The number are based on UMLS.2014 (11,936,143)

Descriptions	Output Logs
Retrieve all English terms in Lexicon with same CUI Enlgish term: Field-2, LAT = ENG (English only), non-Enlgish: 2,913,250, Enlgish: 9,022,893 same CUI (definition of synonym, same concept) Normalized to coreTermLc (strip initial and final punctuation, then lowercased), used as key for lexRecord look up (inflVars.data) Known to Lexicon not in Lexicon: 5,704,964, Lexicon: 450,234 Inflection of base and POS of noun, verb, and adj. Disallowed Lexicon POS: 620,085, Good lexicon POS: 453,261 Output format: #SYNONYM_CLASS\|CUI\|Preferred Term POS-1\|EUI-1\|Base-1 POS-2\|EUI-2\|Base-2 ...	SynonymCan.data.1.all
Exclude terms with disallowed STI, such as Chemicals and Drugs CuiStiMap: use ./inData/MRSTY.RRF to map CUI to STI disallowedStiSet: ./inData/SemGroups.filter.txt specified disallowed STI (tagged by linguists), such as SemGroup is CHEM. disallowed: 2,867,695, allowed: 6,155,198 Example-1: The following synonym class is removed because of disallowed STI `#SYNONYM_CUI\|C0000098\|1-Methyl-4-phenylpyridinium` `128\|E0020400\|cyperquat\|` `128\|E0319735\|mpp\|`	SynonymCan.data.2.disallow
Exclude terms are acronyms or abbreviations because they drops precision too much. There are too many expansions, such as "AA" has 39 expansions in Lexicon. Preprocess: `shell> flds 1 LRABR \| sort -u > LRABR.f1.uSort` Use LRABR.f1.uSort to check if a term is an abbreviations or acronyms. AcrAbb: 26,596, NotAcrAbb: 426,665 Example-2: lines with abbreviations are removed `128\|E0006443\|abdomen\|` `128\|E0554771\|abdominal\|` `128\|E0689526\|abd\|` `128\|E0689531\|abd\|` `1\|E0006444\|abdominal\|` `1\|E0692924\|abd\|` Example-3: The synonym class is removed, after remove acad, this class has only one candidates, thus is removed! `#SYNONYM_CUI\|C0000876\|Academies` `128\|E0006659\|academy\|` `128\|E0417973\|acad\|` `128\|E0722828\|acad\|`	SynonymCan.data.3.abb
Remove spVars to reduce manual tagging efforts. If a term has a synonym of A, all spVars of that term are synonym of A. Do not add to sClass if EUI exist in the sClass (spVars) Use EUI in inflVar.data Use any base form for terms have spVars (same EUI). spVarNo: 274,469, after remove spVar no: 152,196 SpVars should be added in Post-process Example-4: lines are spVar are removed `#SYNONYM_CUI\|C0000934\|Acclimatization` `128\|E0006730\|acclimation\|` `128\|E0006731\|acclimatisation\|` `128\|E0006731\|acclimatization\|` `128\|E0007239\|adaptation\|` `128\|E0422110\|adaption\|` In the post-process, the deleted spVars will be added back in (if the tag of acclimatisation is [y]), so the record will become (assuming all tags are [y]): `#SYNONYM_CUI\|C0000934\|Acclimatization` `128\|E0006730\|acclimation\|` `128\|E0006731\|acclimatisation\|` `128\|E0006731\|acclimatization\|` `128\|E0007239\|adaptation\|` `128\|E0422110\|adaption\|` Example-5: The synonym class is removed, after remove spVar, this class has only one candidates, thus it is removed! `#SYNONYM_CUI\|C0000880\|Acanthamoeba Keratitis` `128\|E0429790\|acanthameba keratitis\|` `128\|E0429790\|acanthamoeba keratitis\|` => In the post-process, no synonyms will be generated for this sClass.	SynonymCan.data.4.spVar
Remove nominalization of a term. If a term has a synonym of A, all nominalization of that term are synonym of A. Sort sClass by CUI (key) Use nomMap: ./inData/LRNOM, key: EUI of noun, value is a set of EUIs of nominalizations (adj and verb). For implemenation, keep noun, remove its nominalizationof adj and verb nomNo: 819, passNomNo: 151,377 All nominalization are synonyms (use LRNOM). Example-6: lines are nominalization of a noun is removed `#SYNONYM_CUI\|C0001807\|Agressvie behavior` `128\|E0007791\|aggression\|` `128\|E0007793\|aggressiveness\|` `128\|E0528674\|aggressive\|` `1\|E0007792\|aggressive\|` => In the post-process, nominalization of all lines are added as follows: `#SYNONYM_CUI\|C0001807\|Agressvie behavior` `128\|E0007791\|aggression\|` `128\|E0007793\|aggressiveness\|` `128\|E0528674\|aggressive\|` `1\|E0007792\|aggressive\|` `1024\|E02212219\|aggress\|` `1\|E0007792\|aggressive\|`	SynonymCan.data.5.nom
Print sClass with multiple candidates (must have more than 1 term in the sCalss) notMultiCanNo: 96,455, multiCanNo: 54,922	SynonymCan.data sClassNo: 21,655

The SPECIALIST Lexicon