The SPECIALIST Lexicon

Matcher: Parenthetic Acronyms (ACR)

I. Introduction

Acronym expansions are good candidates of multiwords. For example, "Wolf Motor Function Test" is the expansion of acronym "WMFT", it is also a valid multiword in Lexicon (LexMultiWord - LMW). On the other hand, some acronym expansions are not valid LMWs, such as "GLP|good laboratory practice", "SOC|sense of coherence" are not valid LMWs. There are 7,000+ expansions of acronyms and abbreviations in the Lexicon are invalid LMWs.

If a nGram contains a parenthetic acronym pattern (ACR), the term before the left parenthesis can be the expansion of the acronym. The parenthetic acronym pattern is a uppercase term with parenthesis (ACR). This pattern is used in exclusive filters to remove invalid term. The expansion can also be used to generate multiword candidates.

II. Procedure
The following procedure is used to find valid multiwords from n-grams by parenthetic acronym pattern. The processes are enhanced in 2019 to integrated with all LMW candidate generatin models in 00.CandidateList. The processes before 2018 is much more complicated and not integrated with other LMW candidates models.

  • Dir: ${MEDLINE_WORDS}/bin
  • Program:

    shell> 07.MatcherParAcr ${YEAR}

    StepDescriptionInputsOutputsNotes
    Get Candidates from acronym expansions
    1Get nGrams match pattern of [acronym expansion (ACR)]
    • GetParAcrFromNGram.java
    • ${OUT_DATA}/02.NGram/nGrams/nGram.${YEAR}
    • n-gram.${YEAR}.parAcr.rpt
    • n-gram.${YEAR}.parAcr.exp
    • n-gram.${YEAR}.parAcr.pass
    • n-gram.${YEAR}.parAcr.trap
      => matches pattern of [acronym expansino (ACR)], case sensitive
    • Uses the raw nGram Set (not distilled)
    • Uses FilterParentheticAcronym to trap terms with (ACR) pattern from n-gram set
    2Get acronym|acronym expansion from n-grams with (ACR) pattern and identify illegal acronym expansion by
    • Initial character of first and last word of expansion matches the first and last character of acronym
    • Word size of acronym expansion > 1
    • Word size of acronym expansion matches number of chracter of acronym ? (TBD)
    • GetAcronymFromParAcrFile.java
    • n-gram.${YEAR}.parAcr.trap
    • acronymExp.pattern.acr
      => lines match acr
    • acronymExp.pattern.notAcr
      => lines not match acr

    • acronymExp.pattern
      => acronym|acronym expansion, unique
    • find the (ACR) or (ACRs) from n-grams
    • get the ACR
    • get singular form ACR from plural acronym (ACRs)
    • get the expansion
    • check if valid acronym:
      • first and last initial of expasion matches acronym (ignore case)
      • the word no. of expansion > 1 (must be multiword)
    3Filter out (exclude) sub-term of expansions
    • ExcludeSubTermExpFromParAcrFil.java
    • acronymExp.pattern
    • acronymExp.subterm.raw
      => candidates (ACR expansion)
    • acronymExp.subterm.pass
      => candidates (ACR|ACR expansion)
    • acronymExp.subterm.trap
      => subterm of others, invalid
    • if the expansion is a sub-term of other expansion with same acronym, it is not a valid expansion and should be excluded
    4Get core-term for candidates input file
    • acronymExp.subterm.raw
    • acronymExp.subterm.raw.core
    • Get lowercase core-term of candidates
    5Genearte candidate list:
    • Auto remove valid and remove/tag invalid LMWs from candidates (subterm.raw.core):
    • This step have to run after the previous candidate list is completed (so it can tag all LMWs as valid)
    • acronymExp.subterm.raw.core

    • Auto remove LMWs in the Lexicon and tag invalid LMWs for the derived candidate list:
      Use ${MULTIWORDS}/bin/00.CandidateList
      Follow the instruction to:
      • update the previous year ACR/ABB candidate list in step 1.
      • update all candidate lists in step 1.
        • validFile: ${CAND_DIR}/0.LexiconInflVars/inflVars.data.current
          => AutoGen ${CAND_DIR}/0.LexiconInflVars/inflVars.data.current.1.uSort
      • update all invalid LMW files in step 2.
        • invalidFile: ${CAND_DIR}/totalTerms.all.lmw.no

      1
      2
      3
      4
    • ./GenCandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.org.1.autoTag
    • ./GenCandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.org.1.rmYesNo
      => Manually cp to ./CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.Used.rmYesNo
      => This is the Candidate list only include new candidate, link to ${CANDIDATE}/2.MNSMatcherParAcr/acronymExp.tag.data.tag.final.tbd.${YEAR}

    • ./GenCandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.org.1.rmYesTagNo
      => Manually cp to ./CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.Used.rmYesTagNo
      => This is the Candidate list, sent to linguists for lexBuilding (no need to tag after 2018+)
    Algorithm:

    • mkdir GenCandList
    • cd GenCandList
    • Auto remove valid and tag invalid LMWs

    • No need to tag after 2018+
    • After completed the candidate list, re-run Step-5 and update the candidate list in 00.CandidateList. The file of acronymExp.tag.data.tag.final.tbd.2019.org.1.rmYesNo should be empty (0) because valid terms are in the Lexicon and invalid terms are tagged.
    Further Analysis on newly tagged candidates: pre-process
    10Add WC to tagged candidates
    • UpdateWcToTaggedCandidates.java
    • acronymExp.tag.data.tag.new.yesNo
    • ./2.NGram/nGrams/nGramSet.2014.30.core
      => Must run coreterm on n-gram set from option 10 of 6.NGramUtil
    • acronymExp.tag.data.tag.new.yesNo.Wc
    • Update WC information from nGramSet.2014.30.core
    11Get CUIs for tagged candidates
    • UpdateCuiToTaggedCandidates.java
    • acronymExp.tag.data.tag.new.yesNo.Wc
    • SMT configuration file
      => ${STMT_DIR}data/Config/smt.properties
      => Use the latest installed STMT
    • acronymExp.tag.data.tag.new.yesNo.Wc.Cui
    • Update CUI information from SMT:
      • 0: found CUIs with 0 subterm substituition
      • 1: found CUIs with 1 subterm substituition
      • 2: found CUIs with 2 subterms substituition
      • 3: No CUI found within 3 substituition
    12Tag Distilled for tagged candidates
    • UpdateDistToTaggedCandidates.java
    • acronymExp.tag.data.tag.new.yesNo.Wc.Cui
    • ./2.NGram/nGrams/distilledNGram.2014.core
      => Must run coreterm on distilledNGram set from option 11 of 6.NGramUtil
    • acronymExp.tag.data.tag.new.yesNo.WcCui.Dist
    • Update Distilled information (true|false) from nGramSet.2014.30.core
    13Tag SpVar for tagged candidates
    • UpdateSpVarToTaggedCandidates.java
    • acronymExp.tag.data.tag.new.yesNo.Wc.Cui.Dist
    • TBD ./08.Matcher/nGrams/distilledNGram.2014.core
      => TBD. Must run spVar 6.NGramUtil
    • acronymExp.tag.data.tag.new.yesNo.WcCui.Dist.SpVar
    • Update SpVar information (true|false) from TBD
    Further Analysis on newly tagged candidates: process
    15Analyze precision, recall, and f1 on candidates
    • CandidateUtil.AnalyzeCandidatePRF
    acronymExp.tag.data.tag.new.yesNo.rpt Developed, not used for analysis yet
    16Analyze WC histogram on candidates
    • CandidateUtil.AnalyzeCandidateHistogram
    acronymExp.tag.data.tag.new.yesNo.his Developed, not used for analysis yet
    17Analyze WC histogram details (smaller range on lower frequency) on candidates
    • CandidateUtil.AnalyzeCandidateHistogram
    cronymExp.tag.data.tag.new.yesNo.his.min-max.sec.csv Developed, not used for analysis yet
    Precision and Recall Analysis for AMIA full paper
    30Tag Matcher-(ACR): baseline to be used as gold standard
    • Must get the latest inflVars.data from Lexicon
    • Must run Step 7-9 first (for invalidMwForParAcr.data.final)
    • TagCandidateFile.java auto-tag:
      • [y]: if it is in Lexicon (inflVars.data)
      • [n]: invalidMwForParAcr.data.final
      • [tbd]: otherwise
    • inFile: acronymExp.subterm.raw.core
    • validFile: inflVars.data.current
    • invalidFile: invalidMwForParAcr.data.current
    • acronymExp.subterm.raw.core.tag.${YEAR}
    • acronymExp.subterm.raw.core.tag.${YEAR}.no
    • acronymExp.subterm.raw.core.tag.${YEAR}.tbd
    • acronymExp.subterm.raw.core.tag.${YEAR}.yes
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
      => Used as the gold standard for precision and recall
    • Must run step 7-9 first (to update invalidMwForParAcr.data.current)
    • Must update the inflVars.data.current from Lexicon (approve all submit records)
    31Get precision, recall, F1 for Baseline (acronym expansion)
    • GetPRF.java
    • Must run step-30 first
    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • test: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • Output (PRF) is on the screen
    • not goldStd No: must be 0
    • err tag No: must be 0
    • Must finished steps 30
    32Tag (ACR) + Distilled set, PRF
    • CandidateUtil.ApplyDistToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • dist: nGrams/distilledNGram.${YEAR}.core

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.dist
    • Must finished steps 30, 31
    33Tag (ACR) + SpVar, PRF
    • CandidateUtil.ApplySpVarToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • spVar: Candidates/distilledNGram.2014.core.150.sort.term.spVars.latest

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.spVar
    • Must finished steps 30, 31
    34Tag (ACR) + CUI, PRF
    • CandidateUtil.ApplyCuiToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • smt: data/Config/smt.properties

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui
    • Must finished steps 30, 31
    35Tag (ACR) + EndWord, PRF
    • CandidateUtil.ApplyEndWordToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • endWord: inFilterEndWord.data.used

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.endWord
    • Must finished steps 30, 31
    36Tag (ACR) + CUI + SpVar, PRF
    • CandidateUtil.ApplSpVarToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui
    • spVar: Candidates/distilledNGram.2014.core.150.sort.term.spVars.latest

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar
    • Must finished steps 30, 31, 34
    37Tag (ACR) + CUI + SpVar + EndWord, PRF
    • CandidateUtil.ApplyEndWordToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar
    • endWord: inFilterEndWord.data.used

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar.endWord
    • Must finished steps 30, 31, 34, 36
    Frequency (WC) Analysis on (ACR) for AMIA poster paper
    40Add WC to GoldStd
    • CandidateUtil.AddWcToTermTagFile
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • ngram_wc: nGrams/nGramSet.${YEAR}.30.core
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc
    • Must finished n-gram core term
    • Must finished step 30
    41Get Histogram of GoldStd
    • CandidateUtil.GetPRFHistogram
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc.minWc-maxWc.increment.prfHis.csv
    • Should run this one time to get the Max. WC, then use it as input
    Frequency (WC) Analysis on LEXICON for AMIA poster paper
    45Add WC to LMWs and LSWs
    • CandidateUtil.GetSwMwFromLexicon
      => get LMWs and LSWs from Lexicon (inflVars.data)
    • CandidateUtil.AddWcToTermFile
      => Add WC to LSWs
    • CandidateUtil.AddWcToTermFile
      => Add WC to LMWs
    • inflVars.data

    • nGrams/nGramSet.${YEAR}.30.core
    • ./10.LexWords/inflVars.data.lsw
    • ./10.LexWords/inflVars.data.lmw

    • ./10.LexWords/inflVars.data.lsw.wc
    • ./10.LexWords/inflVars.data.lmw.wc
    • This data is used in Figure-1 WC spectrum: no. of terms vs. WC class
    46Get Histogram of LSWs
    • CandidateUtil.GetHistogram
    • ./10.LexWords/inflVars.data.lsw.wc
    • ./10.LexWords/inflVars.data.lsw.wc/minWc-maxWc.incWc.his.csv
    • Should run this one time to get the Max. WC, then use it as input
    47Get Histogram of LMWs
    • ./10.LexWords/inflVars.data.lmw.wc
    • ./10.LexWords/inflVars.data.lmw.wc/minWc-maxWc.incWc.his.csv
    • Should run this one time to get the Max. WC, then use it as input