The SPECIALIST Lexicon

Synonyms Candidates

I. Setup

  • program: GetSynonymCandidates.java
  • Inputs:
    • MRCONSO.RRF
    • inflVars.data
    • cuiPreferredTerm.data
    • MRSTY.RRF
    • SemGroups.filter.txt
    • LRABR.f1.uSort
    • LRNOM
  • Outputs:
    • synonymCan.data.*

II. Algorithm

Go through all lines in MRCONSO.RRF to generate sClass (synonym class). A sClass includes:

  • key: CUI and preferred term
    • In UMLS, preferred terms may be capitalized, plural form (not base form), and not in Lexicon
    • The preferred form associated with CUI is used as reference for tagging synonym candidates in the sClass
  • values: candidate, terms has the same CUI and meet following requirements
    • format: POS|EUI|coreTerm.lc
    • EUI is used to uniquely identify the term because some different terms with different EUIs have same spelling and POS
  • Example of a synonym class (sClass):
    #SYNONYM_CLASS|C0000715|Abattoirs
    128|E0203495|abattoir|
    128|E0205229|slaughterhouse|
    
    #SYNONYM_CLASS|C0000744|Abetalipoproteinemia
    128|E0006481|abetalipoproteinemia|
    128|E0217186|acanthocytosis|
    128|E0430334|Bassen-Kornzweig syndrome|
    128|E0441749|Bassen-Kornzweig disease|
    
  • Please note that candidates are case sensitive to preserve the original base form as in the Lexicon
  • They are sent to linguists to tag [y|n] for valid and invalid synonyms to the CUI|PT

The number are based on UMLS.2014 (11,936,143)

DescriptionsOutput Logs
Retrieve all English terms in Lexicon with same CUI
  • Enlgish term: Field-2, LAT = ENG (English only),
  • non-Enlgish: 2,913,250, Enlgish: 9,022,893
  • same CUI (definition of synonym, same concept)
  • Normalized to coreTermLc (strip initial and final punctuation, then lowercased), used as key for lexRecord look up (inflVars.data)
  • Known to Lexicon
  • not in Lexicon: 5,704,964, Lexicon: 450,234
  • Inflection of base and POS of noun, verb, and adj.
  • Disallowed Lexicon POS: 620,085, Good lexicon POS: 453,261
  • Output format:
    #SYNONYM_CLASS|CUI|Preferred Term
    POS-1|EUI-1|Base-1
    POS-2|EUI-2|Base-2
    ...
    
SynonymCan.data.1.all
Exclude terms with disallowed STI, such as Chemicals and Drugs
  • CuiStiMap: use ./inData/MRSTY.RRF to map CUI to STI
  • disallowedStiSet: ./inData/SemGroups.filter.txt specified disallowed STI (tagged by linguists), such as SemGroup is CHEM.
  • disallowed: 2,867,695, allowed: 6,155,198
  • Example-1: The following synonym class is removed because of disallowed STI
    #SYNONYM_CUI|C0000098|1-Methyl-4-phenylpyridinium
    128|E0020400|cyperquat|
    128|E0319735|mpp|
SynonymCan.data.2.disallow
Exclude terms are acronyms or abbreviations because they drops precision too much.
  • There are too many expansions, such as "AA" has 39 expansions in Lexicon.
  • Preprocess:
    shell> flds 1 LRABR | sort -u > LRABR.f1.uSort
  • Use LRABR.f1.uSort to check if a term is an abbreviations or acronyms.
  • AcrAbb: 26,596, NotAcrAbb: 426,665
  • Example-2: lines with abbreviations are removed
    128|E0006443|abdomen|
    128|E0554771|abdominal|
    128|E0689526|abd|
    128|E0689531|abd|
    1|E0006444|abdominal|
    1|E0692924|abd|

  • Example-3: The synonym class is removed, after remove acad, this class has only one candidates, thus is removed!
    #SYNONYM_CUI|C0000876|Academies
    128|E0006659|academy|
    128|E0417973|acad|
    128|E0722828|acad|
SynonymCan.data.3.abb
Remove spVars to reduce manual tagging efforts.
  • If a term has a synonym of A, all spVars of that term are synonym of A.
  • Do not add to sClass if EUI exist in the sClass (spVars)
  • Use EUI in inflVar.data
  • Use any base form for terms have spVars (same EUI).
  • spVarNo: 274,469, after remove spVar no: 152,196
  • SpVars should be added in Post-process
  • Example-4: lines are spVar are removed
    #SYNONYM_CUI|C0000934|Acclimatization
    128|E0006730|acclimation|
    128|E0006731|acclimatisation|
    128|E0006731|acclimatization|
    128|E0007239|adaptation|
    128|E0422110|adaption|

    In the post-process, the deleted spVars will be added back in (if the tag of acclimatisation is [y]), so the record will become (assuming all tags are [y]):
    #SYNONYM_CUI|C0000934|Acclimatization
    128|E0006730|acclimation|
    128|E0006731|acclimatisation|
    128|E0006731|acclimatization|
    128|E0007239|adaptation|
    128|E0422110|adaption|

  • Example-5: The synonym class is removed, after remove spVar, this class has only one candidates, thus it is removed!
    #SYNONYM_CUI|C0000880|Acanthamoeba Keratitis
    128|E0429790|acanthameba keratitis|
    128|E0429790|acanthamoeba keratitis|
    => In the post-process, no synonyms will be generated for this sClass.
SynonymCan.data.4.spVar
Remove nominalization of a term.
  • If a term has a synonym of A, all nominalization of that term are synonym of A.
  • Sort sClass by CUI (key)
  • Use nomMap: ./inData/LRNOM, key: EUI of noun, value is a set of EUIs of nominalizations (adj and verb).
  • For implemenation, keep noun, remove its nominalizationof adj and verb
  • nomNo: 819, passNomNo: 151,377
  • All nominalization are synonyms (use LRNOM).
  • Example-6: lines are nominalization of a noun is removed
    #SYNONYM_CUI|C0001807|Agressvie behavior
    128|E0007791|aggression|
    128|E0007793|aggressiveness|
    128|E0528674|aggressive|
    1|E0007792|aggressive|
    => In the post-process, nominalization of all lines are added as follows:
    #SYNONYM_CUI|C0001807|Agressvie behavior
    128|E0007791|aggression|
    128|E0007793|aggressiveness|
    128|E0528674|aggressive|
    1|E0007792|aggressive|
    1024|E02212219|aggress|
    1|E0007792|aggressive|
SynonymCan.data.5.nom
Print sClass with multiple candidates (must have more than 1 term in the sCalss)
  • notMultiCanNo: 96,455, multiCanNo: 54,922
SynonymCan.data
  • sClassNo: 21,655