SPECIALIST Lexicon

Multiword Candidates Routine Generation Procedures

In addition to user requested terms/sources, LSG also practice routine procedures to generate multiword candidate list to build the Lexicon:

Sources for Candidate Lists
Programs
Results

I. Expansions of abbreviations and acronyms in Lexicon Annual Releases

During the Lexicon annual release process, there are expansions of abbreviation and acronyms that are not in the Lexicon (base). These are potential multiword candidate.
This process (before 2019) are detailed at Lexicon Validation - 2.13.newEui.abb/acr.tagged.txt.y.${YEAR}
This process is enhanced in 2020+ release and is done before the Lexicon release. The result list is sent to linguist to tag [C|Y|EUI], Please see detailed at Generate LMW candidates from Lexicon - Abb/Acr expansion ${YEAR}

The general guidelines to decide if these expansion are LMWs are:

must has
- a POS
- morphology rules
- a specific meaning to its own
- word order

Examples that should be excluded from LMWs:

Invalid LMWs	Notes
cause of death\|COD\|E0453760 condition on discharge\|COD\|E0453760	not a single POS because "law(s) of articulation". That is a noun with a postmodifying prepositional phrase, rather than being a single NP, it cannot be a Lexbuild base
dead on arraival acquired immune deficiency	law(s) of articulation no morphology rules
1-oleoyl-2-acetyl-sn-glycerol\|OAG\|E0698010 4-phenyl-2,3-dioxo-4-butenolide\|PDOB\|noun	no morphology rules chemical names that are more like formulas than like words
acquired immunodeficiency syndrome test\|AIDS test\|E0776477	We have also declined to make Lexbuild records for names of studies, considering them to be too ephemeral as terms. If those studies have acronyms or abbreviations, the study names can appear as expansions in those records.
portacaval shunted\|PCS\|E0700541	Not a valid base form

Note that the restriction on complex NPs (law of articulation) can be overridden if a term is a true compound, with its own meaning apart from the constituent NP + PP. For instance, “tug of war” is considered a valid LMW, since it has a definition that could not be inferred from the combination of meanings tug + of + war alone. It also undergoes pluralization as a unit – [tug of war]s and not [tug]s of war[s].

Example files:

Lexicon Release	Candidate Files	Status
2015	2.2.4.newEuis.abb.tagged.txt.y.2015 2.2.4.newEuis.acr.tagged.txt.y.2015	Done
2016	2.4.13.newEuis.abb.tagged.txt.y.2016 2.2.4.newEuis.acr.tagged.txt.y.2016	Done
2017	2.13.newEui.abb.tagged.txt.y.2017 2.13.newEui.acr.tagged.txt.y.2017	Done
2018	2.13.newEui.abb.tagged.txt.y.2018 2.13.newEui.acr.tagged.txt.y.2018	Done
2019	2.13.newEui.abb.tagged.txt.y.2019 2.13.newEui.acr.tagged.txt.y.2019	Done
2020	abbAcrExpansions.data.cand.2020	Done
2021	abbAcrExpansions.data.cand.2021	Done
2022	abbAcrExpansions.data.cand.2022	Done

II. (ACR) Pattern Matcher from the MNS

The expansion of acronyms from the latest Medline N-gram Set is another good sources for multiword candidates.

Many expansion of acronyms follow certain pattern in the Medline. The pattern is like: acronym expansion (ACRONYM)

Examples:

Example	Notes
zona pellucida (ZP)	E0216465
major hydrophilic region (MHR)	E0760361
diabetic foot syndrome (DFS)	E0564279
diabetic foot ulcer (DFU)	E0715662
major hydrophilic region (MHR)	E0760361
years lived with disability (YLD)	Invalid LMWs No specific meanionmg no POS
persons who stutter (PWS)	Invalid LMWs No POS No morphology rules
violence against women (VAM)	Invalid LMWs No POS No morpho logy rules
zero-point energy (ZPE)	Invalid LMWs low frequency

Typically, the release of Medline N-gram set (MNS) is 3~9 months behind the release of the Lexicon.

Matcher (ACR): Steps 1-3 (pre_process), Steps 4-5 (Process)
Tag [y|n] for [valid|invalid] LMWs
- [y]: include base, spVar, inflectional variant, ignore case
  => Add to Lexicon
- No need to tag valid expansion
Use 00.CandidateList to:
- Remove candidates in the Lexicon:
- Tag invalid LMWs (AUTO_N) based on previous tagging.
- Monitor AUTO_N

Example files:

Distilled MEDLINE nGram Set	Candidate Files	Status	Notes
2015	acronymExp.tag.data.tag.final.tbd.2015	Done	Tag [y\|n]
2016	acronymExp.tag.data.tag.final.tbd.2016	Done	Tag [y\|n]
2017	acronymExp.tag.data.tag.final.tbd.2017	Done	Tag [y\|n]
2018	acronymExp.tag.data.tag.final.tbd.2018	Done	Tag [y\|n] Monitor AUTO_N
2019	acronymExp.tag.data.tag.final.tbd.2019.Used.rmYesNo	Done	No Tag Auto monitor AUTO_N
2020	acronymExp.tag.data.tag.final.tbd.2020.used.rmYesNo	Done	No Tag Auto monitor AUTO_N
2021	acronymExp.tag.data.tag.final.tbd.2021.used.rmYesNo	Sent-tagging	No Tag Auto monitor AUTO_N

III. CUI-Endword Matcher from the DMNS

N-grams from the latest Distilled Medline N-gram Set that are:

core-term, lowercase
Filter: exclude terms from Lexicon
Matcher: pass terms that are direct match in UMLS-Str (field 15) because such string have CUI
Filter: exclude single Words
Matchers: pass terms that match the top (33+) endWords from Lexicon
- top endWords are: syndrome, protein, disease, proteins, cell, etc..
Pre-Preocess: 06.NGramUtil: Steps 20-21 (core.lc)
Pre-Preocess: Matcher EndWord: Steps 1 (10.MatcherEndWord)
- flds 1 EndWord.1.analysis.stats > EndWord.1.analysis.stats.1
- Get the top N end words (endWords.top.data.${YEAR})
Process: Matcher CUI: Steps 30-33 (09.MatcherCui)
Proocess: Matcher CUI: Steps 34 (09.MatcherCui)
Use Step 35/36 to rearrange the order in candidate list by grouping singulars and plurals together

Example files:

Distilled MEDLINE nGram Set	Candidate Files	Status	Notes
2016	35.disNGram.Core.endword.out.gsp.2016	Done	Top 33 endWords
2017	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2017	Done	Top 43 endWords
2018	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2018	Done	Top 51 endWords
2019	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2019	Done	Top 57 endWords
2020	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2020	Done	Top 80 endWords
2021	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2021	TBD	Top 85 endWords

TBD: In the future, use all high frequency n-gram without endWord Matcher.
=> That is to use 33.disNGram.Core.multiword.out. However, this seems does not have high precision. Maybe use it with SpVars model or Deep learning model.

IV. SpVar Pattern Matcher With Frequency from the DMNS

N-grams matches SpVar pattern are another good sources for multiword candidates. Over 10+ SpVar types were developed to identify spVars from a given corpus.
For example: terms of
bloodpressure
blood pressure
blood-pressure
tradeoff
trade off
trade-off
are in a corpus and matches the spVar types (SVT_SPACE|SVT_PUNC_DASH) in the spVar model. Thus, they are good candidates for LMWs.
Frequency filter (WC) are added to this list for frequency analysis:
Matcher SpVar: Steps 60-61A (08.MatcherSpVar)
Some candidate is automatically tag [AUTO_YES|AUTO_NO]
Should apply highest frequency strategy
Not as productive as expected, not used after 2016+.

Example files:

Distilled MEDLINE nGram Set	Candidate Files	Status	Notes
2015	medline.2.byM2CES.2.out.30.spVars.cui.raw.100.0.500.tag.2015.can medline.2.byM2CES.2.out.30.spVars.cui.raw.1000.0.500.tag.2015.can medline.2.byM2CES.2.out.30.spVars.cui.raw.10000.0.500.tag.2015.can medline.2.byM2CES.2.out.30.spVars.cui.raw.100000.0.500.tag.2015.can medline.2.byM2CES.2.out.30.spVars.cui.raw.1000000.0.500.tag.2015.can	Done	Tag [Y\|N]
2016+	N/A	Postphone due to limited resources

V. Other Patterns: Words from Lexicon + LexSynonym + DMNS
- LMWs: Multiwords from Lexicon
- Substitute subterms by LexSynonyms (1 or 2 substitutions)
- If also in DMNS (maybe plus matchers ...)
- Not in Lexicon
- To be implemented

Post-processes:

Before 2018-
- Tagged Invalid LMW Candidates
  All the invalid terms from tagged candidate list should be retrieved and save to "invalidLmwList.out". These invalid LMW terms and words in the Lexicon should be filtered out from the candidate list.
- Calculate precision
  - Total precision: (newYes + AutoYes)/total candidates
  - New precision: (newYes)/new candidates
  - total candidate = autoYes + autoNo + newYes + newNo
  - autoYes (terms in Lexicon) and autoNo (terms in Invalid LMW candidates)
  - newYes and newNo are manually tagged by linguist

After 2018+

A systematic post-process was implemented to filtered valid and invalid LMW from candidate list:

I. Logic: use the latest Lexicon to tag valid|invalid LMWs
- Valid LMWs:
  - Collect all terms from the latest Lexicon (inflVars). These are valid LMWs.
  - Generate inflVars from LexBuild - postPorcess
- Invalid LMWs:
  - Get all terms from previous candidate list (without |ATUO_N tag)
  - Remove valid LMWs from above
  - The rest are invalid LMWs
  - Tag invalid LMWs from the new candidate list as [AUTO_N]
    =>These can be removed if all of them are [n] after several running through several candidate list.
II. Root Directory:
- Root directory: ${MULTIWORDS}/data/Candidates

III. Program:

Run: ${MULTIWORDS}/bin/00.CandidateList
This program needs to be run:
- once a candidate list is generated
- to remove candidates that is already in the Lexicon (inflVars.data)
- to tag (|AUTO_N) or remove candidates that is previously tagged as invalid LMWs (notBaseForm.data and not LMW.data)
- Also, after the candidate is tagged, it is used to calculate the stats

Step	Description	Input	Output	Notes
1	Aggregate and analyze all previous LMW candidate files => This program is to analyze the precision of candidate list (candidates are valid LMWs)	0.LexiconInflVars/inflVars.data.current 1.LexiconAbbAcrExpansion/newEuis.a[bc][br].tagged.txt.y.20NN 2.MNSMatcherParAcr/acronymExp.tag.data.tag.final.tbd.20NN 3.DMNSMatcherCuiEndWor/disNGram.Core.endword.new.out.gsp.20NN 4.DMNSMatcherSpVarWc/*	prevCand.data prevCand.data.no (invalid LMWs) prevCand.data.yes (valid LMWs) prevCand.data.rpt (detail stats report)	Must update: candidate list if completed tagging inflVars (link to the latest inflVars from LexBuild) Check the latest valid vs. invalid ratio
2	Aggregate and analyze not baseForm/LMW from LexCheck/candidate files => This program is to analyze the precision of invalid LMWs from notBaseForm.data and notLmw.data from the annaul Lexicon tagging	5.LexCheckNotBaseFor/notBaseForm.data.${YEAR} 6.LexCheckNotLmw/notLmw.data.${YEAR}	notBaseLmw.data notBaseLmw.data.no (invalid LMWs) notBaseLmw.data.yes (valid LMWs) notBaseLmw.data.rpt (detail stats report)	Must update: notBaseForm.data.${YEAR} notLmw.data.${YEAR} inflVars (link to latest inflVars from LexBuild) Check the latest valid vs. invalid ratio
3	Combine output files from steps 1 and 2 to get the total data set .	./prevCand.data ./notBaseLmw.data	./totalData.data ./totalData.data.yes ./totalData.data.no	Must run step 1 and 2 Check the latest valid vs. invalid ratio Can be used as tagged data for machine learning model

10	Filter and tag valid/invalid LMWs for a candidate file	./0.LexiconInflVars/inflVars.data.current (valid LMW file) ./totalData.data.no (invalid LMW file) Specify inFile.data outFile.data	outFile.data	Must complete/update steps 1 ~ 3 input the new candidate file (or link to ./inFile.data)

20	Generate DL TtSet from valid/invalid LMWs candidate files
21	Generate DL TtSet from inflVars (valid) and invalid LMWs in n-grams ..

IV. Results: please see previous candidate lists

The SPECIALIST Lexicon