SPECIALIST Lexicon

Matcher: Parenthetic Acronyms (ACR)

I. Introduction

Acronym expansions are good candidates of multiwords. For example, "Wolf Motor Function Test" is the expansion of acronym "WMFT", it is also a valid multiword in Lexicon (LexMultiWord - LMW). On the other hand, there are acronym expansion that is not a LMW, such as "GLP|good laboratory practice", "SOC|sense of coherence" are not a valid LMW. There are more than 7,000 valid expansions of acronyms and abbreviations in Lexicon are invalid LMW.

If a nGram contains a parenthetic acronym pattern (ACR), the term before the left parenthesis can be the expansion of the acronym. The parenthetic acronym pattern is a uppercase term with parenthesis (ACR). This pattern is used in exclusive filters to remove invalid term. The expansion can also be used to generate multiword candidates.

II. Procedure
The following procedure is used to find valid multiwords from n-grams by parenthetic acronym pattern:

Dir: ${MEDLINE_WRODS}/bin

Program:

shell> 07.MatcherParAcr ${YEAR}

Step	Description	Inputs	Outputs	Notes
Get Candidates from acronym expansions
1	Get nGrams match pattern of [acronym expansion (ACR)] `GetParAcrFromNGram.java`	${OUT_DATA}/02.NGram/nGrams/nGram.${YEAR}	n-gram.${YEAR}.parAcr.rpt n-gram.${YEAR}.parAcr.exp n-gram.${YEAR}.parAcr.pass n-gram.${YEAR}.parAcr.trap => matches (ACR)	Uses the raw nGram Set (not distilled) Uses FilterParentheticAcronym to trap terms with (ACR) pattern from n-gram set
2	Get acronym\|acronym expansion from n-grams with (ACR) pattern and identify illegal acronym expansion by Initial character of first and last word of expansion matches the first and last character of acronym Word size of acronym expansion > 1 Word size of acronym expansion matches number of chracter of acronym ? (TBD) `GetAcronymFromParAcrFile.java`	n-gram.${YEAR}.parAcr.trap	acronymExp.pattern.acr => lines match acr acronymExp.pattern.notAcr => lines not match acr acronymExp.pattern => acronym\|acronym expansion, unified	find the (ACR) or (ACRs) from n-grams get the ACR get singular form ACR from plural acronym (ACRs) get the expansion check if valid acronym: first and last initial of expasion matches acronym (ignore case) the word no. of expansion > 1 (must be multiword)
3	Filter out (exclude) sub-term of expansions `ExcludeSubTermExpFromParAcrFil.java`	acronymExp.pattern	acronymExp.subterm.raw => candidates (ACR expansion) acronymExp.subterm.pass => candidates (ACR\|ACR expansion) acronymExp.subterm.trap => subterm of others, invalid	if the expansion is a sub-term of other expansion with same acronym, it is not a valid expansion and should be excluded
Pre-process: core-term, valid/invalid files
4	Get core-term for candidates input file	acronymExp.subterm.raw	acronymExp.subterm.raw.core	Get lowercase core-term of candidates
5	Prepare valid/invalid MW files for auto-tagging	All known valid and invalid MW should be updated: Initial data (${MULTIWORDS}/data/current/inData/}, used as baseline to calculate precision and recall valid: 5.1: inflVars.data.${YEAR} => Use inflVars.data.current (link to the daily LexBuild backup) => known valid MW in Lexicon invalid (invalidMwForParAcr.data.${PREV_YEAR}): 5.2.1: notBaseForm.data.${PREV_YEAR} => link to ${LEX_CHECK}/data/Files/notBaseForm.data.${PREV_YEAR} => Known invalid ACR/ABB expansion from LexCheck 5.2.2: invalidMwFromParAcrTag.data.${PREV_YEAR} => invalid tag from (ACR) pattern from previous tagging result => none if it is the first time (no 2013 data for 2014) => Need to be updated after completing tags in the previous year post-process by running through step 7-9 Final current data (used to cal precision and record) valid: 5.3: inflVars.data.final => known valid MW in the latest Lexicon, a snapshot from the the latest Lexicon. By default, it is automatic linked to auto-gen file at 4:00 AM at ${BACKUP}/Routine.lexBuild/Lexicon/${YEAR}/InflVars. If need more recent version, run ${LB_DIR}/Tools/LoadDb/GenScript (2). invalid (invalidMwForParAcr.data.final): 5.4.1: ${LEX_CHECK}/data/Files/notBaseForm.data.${YEAR} => Known invalid ACR/ABB expansion from LexCheck invalidMwFromParAcrTag.data.${YEAR} => invalid tag from (ACR) pattern from current tagging result => need to complete all TBD n-grams => Need to be updated by running through step 7-9 after complete tag for this year 5.4.2: invalidMwForParAcr.data.${YEAR} => ln -sf ./invalidMwForParAcr.data.${YEAR} invalidMwForParAcr.data.final => This file is not available when generate the candidate list	invalidMwForParAcr.data.${PREV_YEAR} invalidMwForParAcr.data.${YEAR} invalidMwForParAcr.data.final	File 5.4.2 is not available to generate the candidate list in Step-6. These files should be updated after completing annual tagging.
Process: Generate Candidates and Stats
6	Generate LMW candidate list for linguists: Tag candidate list (yes\|no\|tbd) Get stats and pecision `TagAcronymExp.java` `${YEAR}`	6.1: candidates: acronymExp.subterm.raw.core valid MW (${MULTIWORDS}/data/current/inData/): 6.2.1: inflVars.data.${INIT_YEAR} => initial Lexicon year, default: 2014 6.2.2: inflVars.data.current =>final (current): generate the latest version from LexBuild invalid MW (${MULTIWORDS}/data/current/inData/): invalidMwForParAcr.data.${PREV_YEAR} => initial: known invalid MW for ACR expansion 6.3.1: notBaseForm.data.${PREV_YEAR}.f1 => initial: not base form from LexCheck (only field 1) invalidMwFromParAcrTag.data.${PREV_YEAR} => initial: invalid MW form from ParAcr Tag (previsou year) invalidMwForParAcr.data.${YEAR} => Final(current): known invalid MW for ACR expansion 6.3.2: notBaseForm.data.${YEAR}.f1 => Final: not base form from LexCheck (only field 1) invalidMwFromParAcrTag.data.${PREV_YEAR} => Final: invalid MW form from ParAcr Tag (current latest)	Tag results from initial Lexicon ${YEAR} acronymExp.tag.data.${YEAR} acronymExp.tag.data.${YEAR}.yes acronymExp.tag.data.${YEAR}.no acronymExp.tag.data.${YEAR}.tbd Tag results from final Lexicon (current) acronymExp.tag.data.final acronymExp.tag.data.final.yes acronymExp.tag.data.final.no acronymExp.tag.data.final.yesNo => = (acronymExp.tag.data.final.yes + acronymExp.tag.data.final.no) acronymExp.tag.data.final.tbd => Make sure complete the previous year ACR/ABB candidates list before generate this one from this year acronymExp.tag.data.final.tbd should has same lineNo as the screen output form the program This is the candidate list before final filter and tag in step 6.1 Tag results of the new candidates (init-TBD) acronymExp.tag.data.tag.new.yesNo (with yes/no tag for further analysis) acronymExp.tag.data.tag.new.yes acronymExp.tag.data.tag.new.no acronymExp.tag.data.tag.new.tbd acronymExp.tag.data.stats => summary: stats and precision	Algorithm: Initial tag: tag candidates based on the data of ${YEAR} Final tag: tag candidates based on the data of final New candidates: TBD from the initial tag if valid => added to Lexicon if no => from LexCheck (notBaseForm.data) or tag results (invalidMwFromParAcrTag.data, continuously updates) =>tagged result is at ${OUT_DATA}/Tagged Others: tag as TBD in the final set, sent to linguist Repeat steps 6-9 until no. of "Final-TBD" is 0 Get the precision when "final-TBD" is 0 Manually Copy acronymExp.tag.data.final.tbd to acronymExp.tag.data.tag.final.tbd.${YEAR}.org
6.1	Auto remove valid and remove/tag invalid LMWs from candidates:	acronymExp.tag.data.final.tbd acronymExp.tag.data.tag.final.tbd.${YEAR}.org Auto remove LMWs in the Lexicon and tag invalid LMWs for the derivedcandidate list: Use ${MULTIWORDS}/bin/00.CandidateList Follow the instruction to: update the previous year ACR/ABB candidate list in step 1. update all candidate lists in step 1. update all invalid LMW files in step 2. `1 2 3` validFile: ${CAND_DIR}/0.LexiconInflVars/inflVars.data.current.1.uSort invalidFile: ${CAND_DIR}/totalTerms.all.lmw.no	./GenCandList/acronymExp.tag.data.tag.final.tbd.2019.org.1 ./GenCandList/acronymExp.tag.data.tag.final.tbd.2019.org.1.autoTag ./GenCandList/acronymExp.tag.data.tag.final.tbd.2019.org.1.rmYesNo => Manually cp to acronymExp.tag.data.tag.final.tbd.2019.used => This is the Candidate list only include new candidate, link to ${CANDIDATE}/2.MNSMatcherParAcr/acronymExp.tag.data.tag.final.tbd.2018 ./GenCandList/acronymExp.tag.data.tag.final.tbd.2019.org.1.rmYesTagNo => Manually cp to acronymExp.tag.data.tag.final.tbd.2019.rmYesTagNo => This is the Candidate list, sent to linguists for lexBuilding (no need to tag after 2018+ after implmenting candExceptions)	Algorithm: mkdir GenCandList cd GenCandList ln -sf ../acronymExp.tag.data.tag.final.tbd.${YEAR}.org . flds 1 acronymExp.tag.data.tag.final.tbd.${YEAR}.org > acronymExp.tag.data.tag.final.tbd.${YEAR}.org.1 Auto remove valid and tag invalid LMWs
Post-Tagging process: Update invalid MWs file
7	Validate linguist's tagged file (to get invalid LMWs) .. `CandidateUtil.GetInvalidMwFromTagFile` Get invalid tag from linguist's tag file `lexAccessLb -n -i:input -o:output` Check linguist's invalid tags by comparing to Lexicon	dir: ${DATA}/current/tagData/07.MatcherParAcr/ ./TagData/${YEAR}/ParAcr.data.${YEAR}.tag (link to the tagged candidate file) => Manully update ParAcr.data.${YEAR}.tag file get tagged files from linguists, convert to *.txt append them to the end of the previous ParAcr.data.${PRE_YEAR}.tag uSort (ParAcr.data.${PRE_YEAR}.uSort) link ./ParAcr.tagged.data to the new uSort file (./TagData/${YEAR}/..)	ParAcr.tagged.data.yes => ParAcr.tagged.data.yes.out ParAcr.tagged.data.no => ParAcr.tagged.data.no.out	Steps 7~9 are needed to update data after new tags are done! [y]: valid expansion and LMW in Lexicon [n]: invalid LMW (not in Lexicon), valid expansion ------------------------------------------------------- [o]: invalid expansion (not used after 2016+) => converted to [n] in the program automatically [e]: valid expansion in Lexicon (bot used after 2016+) => converted to [y] in the program automatically Follow the instruction from the screen result to find invalid (y\|n) tags, and send to linguists for revision. This step is not used after 2016 due to the limited resources. Instead of validate, we used Lexicon to automatically assign the tag [y] and [n] in 7.1
7.1	Assign tag to ACR/ABB candidate file .. `CandidateUtil.AssignTagToTermFile` Update and assign tag [Y] and [N] to the ACR/ABB candidate file Run this until the acronymExp.tag.data.tag.final.tbd.${YEAR} is done (in the Lexicon)	dir: ${DATA}/current/tagData/07.MatcherParAcr/TagData/${YEAR} ParAcr.data.${YEAR} Manually generate ParAcr.data.${YEAR} `shell>cp -rp ../${PRE_YEAR}/ParAcr.data.${PRE_YEAR}.tag .` get acronymExp.tag.data.tag.final.tbd.${YEAR} combine above two files: `cat ParAcr.data.${PRE_YEAR}.tag acronymExp.tag.data.tag.final.tbd.${YEAR} > ParAcr.data.${YEAR}` Make sure there is no duplicated candidates `sort -u ParAcr.data.${YEAR} > ParAcr.data.${YEAR}.uSort`	ParAcr.data.${YEAR}.tag ParAcr.data.${YEAR}.yes ParAcr.data.${YEAR}.no	Step 7.1 ~ 9 are the post-process that should be done after ACR/ABB candidate list are completed, and before generate the next ACR/ABB candidate list
8	Get unique lowercased core-term for no-tag (invalidMw) file .. CandidateUtil.ToCoreTerm	dir: ${DATA}/current/tagData/07.MatcherParAcr/ ParAcr.tagged.data.no `ln -sf ./ParAcr.data.2017.no ./TagData/${YEAR}/ParAcr.tagged.data.no`	ParAcr.tagged.data.no.core	Get the core-term form of invalid MW from tag-file
9	Update invalid MWs file invalidMwFromParAcrTag.data.${YEAR} invalidMwFromParAcrTag.data.${PREV_YEAR} ParAcr.tagged.data.no.core invalidMwForParAcr.data.final => Link to invalidMwForParAcr.data.${YEAR} notBaseForm.data.${YEAR}.f1 invalidMwFromParAcrTag.data.${YEAR}	ParAcr.tagged.data.no.core invalidMwFromParAcrTag.data.${PREV_YEAR} notBaseForm.data.${YEAR}	invalidMwForParAcr.data.final invalidMwFromParAcrTag.data.${YEAR}	Update invalidMwForParAcr.data.final => Run the 7.1-9 first, then go back to Step 5, to re-run Step-6 to generate candidate for ${YEAR}
Further Analysis on newly tagged candidates: pre-process
10	Add WC to tagged candidates `UpdateWcToTaggedCandidates.java`	acronymExp.tag.data.tag.new.yesNo ./2.NGram/nGrams/nGramSet.2014.30.core => Must run coreterm on n-gram set from option 10 of `6.NGramUtil`	acronymExp.tag.data.tag.new.yesNo.Wc	Update WC information from nGramSet.2014.30.core
11	Get CUIs for tagged candidates `UpdateCuiToTaggedCandidates.java`	acronymExp.tag.data.tag.new.yesNo.Wc SMT configuration file => ${STMT_DIR}data/Config/smt.properties => Use the latest installed STMT	acronymExp.tag.data.tag.new.yesNo.Wc.Cui	Update CUI information from SMT: 0: found CUIs with 0 subterm substituition 1: found CUIs with 1 subterm substituition 2: found CUIs with 2 subterms substituition 3: No CUI found within 3 substituition
12	Tag Distilled for tagged candidates `UpdateDistToTaggedCandidates.java`	acronymExp.tag.data.tag.new.yesNo.Wc.Cui ./2.NGram/nGrams/distilledNGram.2014.core => Must run coreterm on distilledNGram set from option 11 of `6.NGramUtil`	acronymExp.tag.data.tag.new.yesNo.WcCui.Dist	Update Distilled information (true\|false) from nGramSet.2014.30.core
13	Tag SpVar for tagged candidates `UpdateSpVarToTaggedCandidates.java`	acronymExp.tag.data.tag.new.yesNo.Wc.Cui.Dist TBD ./08.Matcher/nGrams/distilledNGram.2014.core => TBD. Must run spVar `6.NGramUtil`	acronymExp.tag.data.tag.new.yesNo.WcCui.Dist.SpVar	Update SpVar information (true\|false) from TBD
Further Analysis on newly tagged candidates: process
15	Analyze precision, recall, and f1 on candidates	`CandidateUtil.AnalyzeCandidatePRF`	acronymExp.tag.data.tag.new.yesNo.rpt	Developed, not used for analysis yet
16	Analyze WC histogram on candidates	`CandidateUtil.AnalyzeCandidateHistogram`	acronymExp.tag.data.tag.new.yesNo.his	Developed, not used for analysis yet
17	Analyze WC histogram details (smaller range on lower frequency) on candidates	`CandidateUtil.AnalyzeCandidateHistogram`	cronymExp.tag.data.tag.new.yesNo.his.min-max.sec.csv	Developed, not used for analysis yet
Precision and Recall Analysis for AMIA full paper
30	Tag Matcher-(ACR): baseline to be used as gold standard Must get the latest inflVars.data from Lexicon Must run Step 7-9 first (for invalidMwForParAcr.data.final) `TagCandidateFile.java` auto-tag: [y]: if it is in Lexicon (inflVars.data) [n]: invalidMwForParAcr.data.final [tbd]: otherwise	inFile: acronymExp.subterm.raw.core validFile: inflVars.data.current invalidFile: invalidMwForParAcr.data.current	acronymExp.subterm.raw.core.tag.${YEAR} acronymExp.subterm.raw.core.tag.${YEAR}.no acronymExp.subterm.raw.core.tag.${YEAR}.tbd acronymExp.subterm.raw.core.tag.${YEAR}.yes acronymExp.subterm.raw.core.tag.${YEAR}.yesNo => Used as the gold standard for precision and recall	Must run step 7-9 first (to update invalidMwForParAcr.data.current) Must update the inflVars.data.current from Lexicon (approve all submit records)
31	Get precision, recall, F1 for Baseline (acronym expansion) `GetPRF.java` Must run step-30 first	goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo test: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1	Output (PRF) is on the screen	not goldStd No: must be 0 err tag No: must be 0 Must finished steps 30
32	Tag (ACR) + Distilled set, PRF `CandidateUtil.ApplyDistToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1 dist: nGrams/distilledNGram.${YEAR}.core goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.dist	Must finished steps 30, 31
33	Tag (ACR) + SpVar, PRF `CandidateUtil.ApplySpVarToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1 spVar: Candidates/distilledNGram.2014.core.150.sort.term.spVars.latest goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.spVar	Must finished steps 30, 31
34	Tag (ACR) + CUI, PRF `CandidateUtil.ApplyCuiToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1 smt: data/Config/smt.properties goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui	Must finished steps 30, 31
35	Tag (ACR) + EndWord, PRF `CandidateUtil.ApplyEndWordToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1 endWord: inFilterEndWord.data.used goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.endWord	Must finished steps 30, 31
36	Tag (ACR) + CUI + SpVar, PRF `CandidateUtil.ApplSpVarToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui spVar: Candidates/distilledNGram.2014.core.150.sort.term.spVars.latest goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar	Must finished steps 30, 31, 34
37	Tag (ACR) + CUI + SpVar + EndWord, PRF `CandidateUtil.ApplyEndWordToFile` `CandidateUtil.GetPRF`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar endWord: inFilterEndWord.data.used goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar.endWord	Must finished steps 30, 31, 34, 36
Frequency (WC) Analysis on (ACR) for AMIA poster paper
40	Add WC to GoldStd `CandidateUtil.AddWcToTermTagFile`	in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo ngram_wc: nGrams/nGramSet.${YEAR}.30.core	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc	Must finished n-gram core term Must finished step 30
41	Get Histogram of GoldStd `CandidateUtil.GetPRFHistogram`	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc	acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc.minWc-maxWc.increment.prfHis.csv	Should run this one time to get the Max. WC, then use it as input
Frequency (WC) Analysis on LEXICON for AMIA poster paper
45	Add WC to LMWs and LSWs `CandidateUtil.GetSwMwFromLexicon` => get LMWs and LSWs from Lexicon (inflVars.data) `CandidateUtil.AddWcToTermFile` => Add WC to LSWs `CandidateUtil.AddWcToTermFile` => Add WC to LMWs	inflVars.data nGrams/nGramSet.${YEAR}.30.core	./10.LexWords/inflVars.data.lsw ./10.LexWords/inflVars.data.lmw ./10.LexWords/inflVars.data.lsw.wc ./10.LexWords/inflVars.data.lmw.wc	This data is used in Figure-1 WC spectrum: no. of terms vs. WC class
46	Get Histogram of LSWs `CandidateUtil.GetHistogram`	./10.LexWords/inflVars.data.lsw.wc	./10.LexWords/inflVars.data.lsw.wc/minWc-maxWc.incWc.his.csv	Should run this one time to get the Max. WC, then use it as input
47	Get Histogram of LMWs	./10.LexWords/inflVars.data.lmw.wc	./10.LexWords/inflVars.data.lmw.wc/minWc-maxWc.incWc.his.csv	Should run this one time to get the Max. WC, then use it as input

The SPECIALIST Lexicon