The SPECIALIST Lexicon

Inclusive Filter: N-Gram with CUIs

I. Introduction

A LexMultiWord must have a meaning (concept). If a nGram has a concept (CUI) from Metathesaurus, it is a good LMW candidate.

II. Procedure
The following procedure is used to find valid multiwords from n-grams that have MetaThesaurus CUI:

  • Dir: ${MEDLINE_WORDS}/bin
  • Program:

    shell> 09.MatcherCui ${YEAR}

    StepDescriptionInputsOutputsNotes
    Get CUI stats on Lexicon
    1Add CUI to Lexicon (InflVars)
    • shell>cd ${STMT_DIR}/bin
    • smt.AA
    • Time: 1 hr. 10 min.
    • ${IN_DIR}inflVars.data.f1
    • inflVars.data.cui
    • Use smt to get CUIs for all terms from inflVars.data
    2Analyze and get stats of CUI in Lexicon
    • AnalyzeCuiMapping.java
    • inflVars.data.cui
    • inflVars.data.cui.rpt
    • Get stats for single words|multiwords wtih CUIs
    Get CUI stats on MEDLINE n-gram set
    10Add CUI to nGram
    • shell>cd ${STMT_DIR}/bin
    • smt.AA
    • Time: 16 hr.
    • distilledNGram.${YEAR}.core.lc
      =>run 06.NGramUtil ${YEAR}
      11
    • distilledNGram.${YEAR}.core.f2
    • distilledNGram.${YEAR}.core.cui
    • get term from distilled nGrams (field 2)
    • Use smt to get CUIs for all terms from nGrams
    11Filter out nGrams without CUI
    • FilterCuiFromFile.java
    • distilledNGram.${YEAR}.core
    • distilledNGram.${YEAR}.core.cui
    • distilledNGram.${YEAR}.core.cui.out
    • Filter out nGrams without CUIs (including 1,2,3 substitutions)
    12Tag results of step-11
    • TagCuiTerm.java
    • distilledNGram.${YEAR}.core.cui.out
    • Max. WC (2000000)
    • Min. WC (0)

    • inflVars.data.${YEAR}
    • inflVars.data.current
    • notMwFromCuiTerm.data.${YEAR}
    • notMwFromCuiTerm.data.current
    • distilledNGram.${YEAR}.core.lc.cui.out.stats (the stats between init year and current)
    • distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tag.${MIN_WC}-${MAX_WC}
    • distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tbd.${MIN_WC}-${MAX_WC}
    • distilledNGram.${YEAR}.core.lc.cui.out.current.tag.${MIN_WC}-${MAX_WC}
    • distilledNGram.${YEAR}.core.lc.cui.out.current.tbd.${MIN_WC}-${MAX_WC}
    Tag and calulate precision:
    • sent distilledNGram.${YEAR}.core.cui.out.current.tbd.${MIN_WC}-${MAX_WC} to linguist:
      • tag yes|no|exp
      • Add valid MW to Lexicon
    • Update files from tag result of "yes|no"
      • Update inflVars.data.current from Lexicon
      • Update notMwFromCuiTerm.data.current from no-tag
    • rerun this step until current.tbd is 0
    • Check precision
    20Apply Matcher-Cui on nGram TBD
    • distilledNGram.${YEAR}.core
    Process: Generate multiword candidates from the distilled n-gram set
    30Proc: Apply filter of Lexicon on Distilled nGram (core)
    • ${N_GRAM}/distilledNGram.${YEAR}.core
    • ${IN_DIR}/inflVars.data
    • 30.disNGram.Core.lexicon.out
    • Use core-term of n-gram
    31Proc: Apply matcher of Multiword on nGram (core) from results of Step 30
    • 30.disNGram.Core.lexicon.out
    • 31.disNGram.Core.lexicon.multiword.out
      => Core multiwords from n-grams, no in the Lexicon
    • Remove single word
    • This is used for ML models
    32-0PreProc: Get unique English String from UMLS - MRCONSO.RRF.ENG
    • MRCONSO.RRF.ENG
      => link to MRCONSO.RRF.ENG.${YEAR}AA/AB
    • umlsStr.data
    • Preprocess to get English UMLS String for step 32
    32Proc: Apply matcher of UMLS-Str on distilled nGram (must run 31)
    • 30.disNGram.Core.lexicon.out
    • umlsStr.data
    • 32.disNGram.Core.umlsStr.out
      =>n-grams with CUIs (UMLS-String)
    • A simple hashTable lookup to match n-gram to UMLS String
    33Proc: Apply matcher of Multiword on nGram (core) from results of Step 32
    • 32.disNGram.Core.umlsStr.out
    • 33.disNGram.Core.multiword.out
      => Core multiwords from n-grams th CUIs
    • Remove single word
    • This is used for ML models
    34Proc: Apply matcher of EndWord (top 33) on nGram (core)
    • 33.disNGram.Core.multiword.out
    • endWords.top33.data
      => Must run 10.MatchEndWord first
      option 1, to get the top endword list

      => Manually create endWords.top${NN}.data
      => link endWords.top.data (used for top endWords)
    • 34.disNGram.Core.endword.out
      => Candidate with CUI and top endwords
    • Use the top endWord for matcher
    Post-Process: Auto remove, tag, and resort to final format
    35Post-Proc: Auto remove candidate in the Lexicon and remove/Tag candidates are invalid LMWs based on the previous tags
    • 34.disNGram.Core.endword.out
    • Use 00.CandidateList
      => Files updates are required, proceed Steps 1-3
      • validFile: ${CAND_DIR}/0.LexiconInflVars/inflVars.data.current.1.uSort
      • invalidFile: ${CAND_DIR}/totalTerms.all.lmw.no
    • 35.disNGram.Core.endword.out.autoTag
    • 35.disNGram.Core.endword.out.rmYesNo
    • 35.disNGram.Core.endword.out.rmYesTagNo
    • Filter and tag candidates that are in the preivous year lists
    • Make sure update data and re-run 00.CandidateList for the latest updates
    36Post-Proc: Rearrange (resort) canList by grouping singluars/plurals
    • 35.disNGram.Core.endword.out.rmYesNo
    • 35.disNGram.Core.endword.out.rmYesTagNo
    • 36.disNGram.Core.endword.out.rmYesNo.gsp
      => cp to 36.disNGram.Core.endword.out.rmYesNo.gsp.${YEAR}
    • 36.disNGram.Core.endword.out.rmYesTagNo.gsp
      => cp to 36.disNGram.Core.endword.out.rmYesTagNo.gsp.${YEAR}
      => This file is used for annual candidate list
    • resort and group it to put singular and plural together
    Future Usage
    40PreProc: Get nGram spVar from result of 8.MatcherSpVar
    • medline.2.byM2CES.2.out.30.spVars.2016
    • nGramSpVars.data
    • get the n-grams that match spVar patterns
    41Proc: Apply matcher of nGram-SpVar on nGram
    • 34.disNGram.Core.endword.out
    • nGramSpVars.data
    • 36.disNGram.Core.spVar
    • Lost recall, not use for now