The SPECIALIST Lexicon

Generating Synonym Candidates from MetaThesaurus

I. Pre-Process

  • Directory: ${LEXICON_SYNONYMS}
  • program: ./Meta/GetSynonymCandidates.java
  • Inputs:
    • MRCONSO.RRF
    • MRSTY.RRF
    • SemGroups.filter.txt
    • inflVars.data
    • LRABR -> LRABR.f1.uSort
    • LRNOM

    • cuiPreferredTerm.data (generated from step 1)
  • Outputs:
    • synonymCan.data.*

II. Process

  • Directory: ${LEXICON_SYNONYMS}/bin
  • program: GetSynonyms ${year}

    OptionDescriptionsInputsOutputs
    0Set up directories and files
    • LEXICON${YEAR}
    • Metathesaurs${PRE_YEAR}AA
    • ./inData/*
    • ./outData
    1Get CUI|Preferred Terms from MRCONSO.RRF
    • Meta.GetCuiPtFromMrConso.java
    • Meta.MrConsoUtil.java
    • field-2: LAT = ENG (language of term)
    • field-3: TS = P (term status)
    • field-5: STT = PF (String type)
    • field-7: ISPREF = Y (Atom status, preferred: Y)
    • ./inData/MRCONSO.RRF
    • ./outData/Candidates/cuiPreferredTerm.data
      CUIPreferred Term
    2Get raw synonym candidates from MRCONSO.RRF
    • Meta.GetSynonymCandidates.java
    • same CUI
    • English term: Filed-2, LAT = ENG
    • not disallowed STI, such as Chemicals & Drugs, defined in SemGroups.filter.txt, use MRSTY.RRF to map CUI to STI
    • Must known to Lexicon
    • Must have POS of adj, noun, or verb, infl is base
    • Remove acronym => it drops precision
    • Remove spVars => will add them in Post-process
    • Remove nominalization => will add them in Post-process
    • Remove class with only single candidates => remove pure spVar & nom
    • same CUI (definition of synonym, same concept)
    • Filed-2, LAT = ENG (English only)
    • Terms are normalized into lowercased core-terms (strip initial and final punctuation, then lowercased) as key in lookup mapping for Lexical rcord
    • known to Lexicon (design spec.)
    • have POS of adj, noun, or verb (design spec.)
    • infl is base (design spec.)
    • Base form are used n the output
    • ./outData/Candidates/SynonymCan.raw.data
      This is the over-all raw synonym candidate list from the specified (latest) Lexicon and UMLS. This is NOT the candidate list sent to linguist. Use Step-10 and 11 for the candidate list sent to linguists.
      #SYNONYM_CLASS|CUI|Preferred Term
      POS-1|EUI-1|Base-1
      POS-2|EUI-2|Base-2
      ...
      
    • Remove nominalization:
    • The log should not have any error message, such as: **ERR@CuiFilterBySti.FilterCui( ): No STI found for CXXXXXXX
    3
    • Analyze and check raw synonym candidate list
    • Read in and print out, then compare if they are the same.
    • ./outData/Candidates/synonymCan.data
    • ./outData/Candidates/synonymCan.raw.data.out
    • ./outData/Candidates/diff (must be 0)

      next, go to step-10

  • Sent the reuslt (SynonymCan.data) to linguists to tag [y|n] for valid and invalid synonyms.
  • This step is used only when all the synonym candidates are completely tagged (2022+). Accordingly, it is not used (skipped) for 2018 and 2019 releases (which we are still wroking on the first synonym candidate list generated for 2017. That is expected to be completed in the next couple years).