SPECIALIST Lexicon

Matcher: Parenthetic Acronyms (ACR)

I. Introduction

Acronym expansions are good candidates of multiwords. For example, "Wolf Motor Function Test" is the expansion of acronym "WMFT", it is also a valid multiword in Lexicon (LexMultiWord - LMW). On the other hand, some acronym expansions are not valid LMWs, such as "GLP|good laboratory practice", "SOC|sense of coherence" are not valid LMWs. There are 7,000+ expansions of acronyms and abbreviations in the Lexicon are invalid LMWs.

If a nGram contains a parenthetic acronym pattern (ACR), the term before the left parenthesis can be the expansion of the acronym. The parenthetic acronym pattern is a uppercase term with parenthesis (ACR). This pattern is used in exclusive filters to remove invalid term. The expansion can also be used to generate multiword candidates.

II. Procedure
The following procedure is used to find valid multiwords from n-grams by parenthetic acronym pattern. The processes are enhanced in 2019 to integrated with all LMW candidate generatin models in 00.CandidateList. The processes before 2018 is much more complicated and not integrated with other LMW candidates models.

Dir: ${MEDLINE_WORDS}/bin

Program:

shell> 07.MatcherParAcr ${YEAR}

Step	Description	Inputs	Outputs	Notes
Get Candidates from acronym expansions
1	Get nGrams match pattern of [acronym expansion (ACR)] `GetParAcrFromNGram.java`	${OUT_DATA}/02.NGram/nGrams/nGram.${YEAR}	n-gram.${YEAR}.parAcr.rpt n-gram.${YEAR}.parAcr.exp n-gram.${YEAR}.parAcr.pass n-gram.${YEAR}.parAcr.trap => matches pattern of [acronym expansino (ACR)], case sensitive	Uses the raw nGram Set (not distilled) Uses FilterParentheticAcronym to trap terms with (ACR) pattern from n-gram set
2	Get acronym\|acronym expansion from n-grams with (ACR) pattern and identify illegal acronym expansion by Initial character of first and last word of expansion matches the first and last character of acronym Word size of acronym expansion > 1 Word size of acronym expansion matches number of chracter of acronym ? (TBD) `GetAcronymFromParAcrFile.java`	n-gram.${YEAR}.parAcr.trap	acronymExp.pattern.acr => lines match acr acronymExp.pattern.notAcr => lines not match acr acronymExp.pattern => acronym\|acronym expansion, unique	find the (ACR) or (ACRs) from n-grams get the ACR get singular form ACR from plural acronym (ACRs) get the expansion check if valid acronym: first and last initial of expasion matches acronym (ignore case) the word no. of expansion > 1 (must be multiword)
3	Filter out (exclude) sub-term of expansions `ExcludeSubTermExpFromParAcrFil.java`	acronymExp.pattern	acronymExp.subterm.raw => candidates (ACR expansion) acronymExp.subterm.pass => candidates (ACR\|ACR expansion) acronymExp.subterm.trap => subterm of others, invalid	if the expansion is a sub-term of other expansion with same acronym, it is not a valid expansion and should be excluded
4	Get core-term for candidates input file	acronymExp.subterm.raw	acronymExp.subterm.raw.core	Get lowercase core-term of candidates
5	Genearte candidate list: Auto remove valid and remove/tag invalid LMWs from candidates (subterm.raw.core): This step have to run after the previous candidate list is completed (so it can tag all LMWs as valid)	acronymExp.subterm.raw.core Auto remove LMWs in the Lexicon and tag invalid LMWs for the derived candidate list: Use ${MULTIWORDS}/bin/00.CandidateList Follow the instruction to: update the previous year ACR/ABB candidate list in step 1. update all candidate lists in step 1. validFile: ${CAND_DIR}/0.LexiconInflVars/inflVars.data.current => AutoGen ${CAND_DIR}/0.LexiconInflVars/inflVars.data.current.1.uSort update all invalid LMW files in step 2. invalidFile: ${CAND_DIR}/totalTerms.all.lmw.no `1 2 3 4`	./GenCandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.org.1.autoTag ./GenCandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.org.1.rmYesNo => Manually cp to ./CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.Used.rmYesNo => This is the Candidate list only include new candidate, link to ${CANDIDATE}/2.MNSMatcherParAcr/acronymExp.tag.data.tag.final.tbd.${YEAR} ./GenCandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.org.1.rmYesTagNo => Manually cp to ./CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.Used.rmYesTagNo => This is the Candidate list, sent to linguists for lexBuilding (no need to tag after 2018+)	Algorithm: mkdir GenCandList cd GenCandList Auto remove valid and tag invalid LMWs No need to tag after 2018+ After completed the candidate list, re-run Step-5 and update the candidate list in 00.CandidateList. The file of acronymExp.tag.data.tag.final.tbd.2019.org.1.rmYesNo should be empty (0) because valid terms are in the Lexicon and invalid terms are tagged.
Further Analysis on newly tagged candidates: pre-process
10	Add WC to tagged candidates `UpdateWcToTaggedCandidates.java`	acronymExp.tag.data.tag.new.yesNo ./2.NGram/nGrams/nGramSet.2014.30.core => Must run coreterm on n-gram set from option 10 of `6.NGramUtil`	acronymExp.tag.data.tag.new.yesNo.Wc	Update WC information from nGramSet.2014.30.core
11	Get CUIs for tagged candidates `UpdateCuiToTaggedCandidates.java`	acronymExp.tag.data.tag.new.yesNo.Wc SMT configuration file => ${STMT_DIR}data/Config/smt.properties => Use the latest installed STMT	acronymExp.tag.data.tag.new.yesNo.Wc.Cui	Update CUI information from SMT: 0: found CUIs with 0 subterm substituition 1: found CUIs with 1 subterm substituition 2: found CUIs with 2 subterms substituition 3: No CUI found within 3 substituition
12	Tag Distilled for tagged candidates `UpdateDistToTaggedCandidates.java`	acronymExp.tag.data.tag.new.yesNo.Wc.Cui ./2.NGram/nGrams/distilledNGram.2014.core => Must run coreterm on distilledNGram set from option 11 of `6.NGramUtil`	acronymExp.tag.data.tag.new.yesNo.WcCui.Dist	Update Distilled information (true\|false) from nGramSet.2014.30.core
13	Tag SpVar for tagged candidates `UpdateSpVarToTaggedCandidates.java`	acronymExp.tag.data.tag.new.yesNo.Wc.Cui.Dist TBD ./08.Matcher/nGrams/distilledNGram.2014.core => TBD. Must run spVar `6.NGramUtil`	acronymExp.tag.data.tag.new.yesNo.WcCui.Dist.SpVar	Update SpVar information (true\|false) from TBD
Further Analysis on newly tagged candidates: process
15	Analyze precision, recall, and f1 on candidates	`CandidateUtil.AnalyzeCandidatePRF`	acronymExp.tag.data.tag.new.yesNo.rpt	Developed, not used for analysis yet
16	Analyze WC histogram on candidates	`CandidateUtil.AnalyzeCandidateHistogram`	acronymExp.tag.data.tag.new.yesNo.his	Developed, not used for analysis yet
17	Analyze WC histogram details (smaller range on lower frequency) on candidates	`CandidateUtil.AnalyzeCandidateHistogram`	cronymExp.tag.data.tag.new.yesNo.his.min-max.sec.csv	Developed, not used for analysis yet
Precision and Recall Analysis for AMIA full paper
30	Tag Matcher-(ACR): baseline to be used as gold standard Must get the latest inflVars.data from Lexicon Must run Step 7-9 first (for invalidMwForParAcr.data.final) `TagCandidateFile.java` auto-tag: [y]: if it is in Lexicon (inflVars.data) [n]: invalidMwForParAcr.data.final [tbd]: otherwise	inFile: acronymExp.subterm.raw.core validFile: inflVars.data.current invalidFile: invalidMwForParAcr.data.current	acronymExp.subterm.raw.core.tag.${YEAR} acronymExp.subterm.raw.core.tag.${YEAR}.no acronymExp.subterm.raw.core.tag.${YEAR}.tbd acronymExp.subterm.raw.core.tag.${YEAR}.yes acronymExp.subterm.raw.core.tag.${YEAR}.yesNo => Used as the gold standard for precision and recall	Must run step 7-9 first (to update invalidMwForParAcr.data.current) Must update the inflVars.data.current from Lexicon (approve all submit records)
31	Get precision, recall, F1 for Baseline (acronym expansion) `GetPRF.java` Must run step-30 first	goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo test: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1	Output (PRF) is on the screen	not goldStd No: must be 0 err tag No: must be 0 Must finished steps 30
32	Tag (ACR) + Distilled set, PRF `CandidateUtil.ApplyDistToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1 dist: nGrams/distilledNGram.${YEAR}.core goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.dist	Must finished steps 30, 31
33	Tag (ACR) + SpVar, PRF `CandidateUtil.ApplySpVarToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1 spVar: Candidates/distilledNGram.2014.core.150.sort.term.spVars.latest goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.spVar	Must finished steps 30, 31
34	Tag (ACR) + CUI, PRF `CandidateUtil.ApplyCuiToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1 smt: data/Config/smt.properties goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui	Must finished steps 30, 31
35	Tag (ACR) + EndWord, PRF `CandidateUtil.ApplyEndWordToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1 endWord: inFilterEndWord.data.used goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.endWord	Must finished steps 30, 31
36	Tag (ACR) + CUI + SpVar, PRF `CandidateUtil.ApplSpVarToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui spVar: Candidates/distilledNGram.2014.core.150.sort.term.spVars.latest goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar	Must finished steps 30, 31, 34
37	Tag (ACR) + CUI + SpVar + EndWord, PRF `CandidateUtil.ApplyEndWordToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar endWord: inFilterEndWord.data.used goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar.endWord	Must finished steps 30, 31, 34, 36
Frequency (WC) Analysis on (ACR) for AMIA poster paper
40	Add WC to GoldStd `CandidateUtil.AddWcToTermTagFile`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo ngram_wc: nGrams/nGramSet.${YEAR}.30.core	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc	Must finished n-gram core term Must finished step 30
41	Get Histogram of GoldStd `CandidateUtil.GetPRFHistogram`	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc.minWc-maxWc.increment.prfHis.csv	Should run this one time to get the Max. WC, then use it as input
Frequency (WC) Analysis on LEXICON for AMIA poster paper
45	Add WC to LMWs and LSWs `CandidateUtil.GetSwMwFromLexicon` => get LMWs and LSWs from Lexicon (inflVars.data) `CandidateUtil.AddWcToTermFile` => Add WC to LSWs `CandidateUtil.AddWcToTermFile` => Add WC to LMWs	inflVars.data nGrams/nGramSet.${YEAR}.30.core	./10.LexWords/inflVars.data.lsw ./10.LexWords/inflVars.data.lmw ./10.LexWords/inflVars.data.lsw.wc ./10.LexWords/inflVars.data.lmw.wc	This data is used in Figure-1 WC spectrum: no. of terms vs. WC class
46	Get Histogram of LSWs `CandidateUtil.GetHistogram`	./10.LexWords/inflVars.data.lsw.wc	./10.LexWords/inflVars.data.lsw.wc/minWc-maxWc.incWc.his.csv	Should run this one time to get the Max. WC, then use it as input
47	Get Histogram of LMWs	./10.LexWords/inflVars.data.lmw.wc	./10.LexWords/inflVars.data.lmw.wc/minWc-maxWc.incWc.his.csv	Should run this one time to get the Max. WC, then use it as input

The SPECIALIST Lexicon