SPECIALIST Lexicon

Inclusive Filter: N-Gram with CUIs

I. Introduction

A LexMultiWord must have a meaning (concept). If a nGram has a concept (CUI) from Metathesaurus, it is a good LMW candidate.

II. Procedure
The following procedure is used to find valid multiwords from n-grams that have MetaThesaurus CUI:

Dir: ${MEDLINE_WORDS}/bin

Program:

shell> 09.MatcherCui ${YEAR}

Step	Description	Inputs	Outputs	Notes
Get CUI stats on Lexicon
1	Add CUI to Lexicon (InflVars) `shell>cd ${STMT_DIR}/bin` `smt.AA` Time: 1 hr. 10 min.	${IN_DIR}inflVars.data.f1	inflVars.data.cui	Use smt to get CUIs for all terms from inflVars.data
2	Analyze and get stats of CUI in Lexicon `AnalyzeCuiMapping.java`	inflVars.data.cui	inflVars.data.cui.rpt	Get stats for single words\|multiwords wtih CUIs
Get CUI stats on MEDLINE n-gram set
10	Add CUI to nGram `shell>cd ${STMT_DIR}/bin` `smt.AA` Time: 16 hr.	distilledNGram.${YEAR}.core.lc =>run `06.NGramUtil ${YEAR}` `11`	distilledNGram.${YEAR}.core.f2 distilledNGram.${YEAR}.core.cui	get term from distilled nGrams (field 2) Use smt to get CUIs for all terms from nGrams
11	Filter out nGrams without CUI `FilterCuiFromFile.java`	distilledNGram.${YEAR}.core distilledNGram.${YEAR}.core.cui	distilledNGram.${YEAR}.core.cui.out	Filter out nGrams without CUIs (including 1,2,3 substitutions)
12	Tag results of step-11 `TagCuiTerm.java`	distilledNGram.${YEAR}.core.cui.out Max. WC (2000000) Min. WC (0) inflVars.data.${YEAR} inflVars.data.current notMwFromCuiTerm.data.${YEAR} notMwFromCuiTerm.data.current	distilledNGram.${YEAR}.core.lc.cui.out.stats (the stats between init year and current) distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tag.${MIN_WC}-${MAX_WC} distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tbd.${MIN_WC}-${MAX_WC} distilledNGram.${YEAR}.core.lc.cui.out.current.tag.${MIN_WC}-${MAX_WC} distilledNGram.${YEAR}.core.lc.cui.out.current.tbd.${MIN_WC}-${MAX_WC}	Tag and calulate precision: sent distilledNGram.${YEAR}.core.cui.out.current.tbd.${MIN_WC}-${MAX_WC} to linguist: tag yes\|no\|exp Add valid MW to Lexicon Update files from tag result of "yes\|no" Update inflVars.data.current from Lexicon Update notMwFromCuiTerm.data.current from no-tag rerun this step until current.tbd is 0 Check precision
20	Apply Matcher-Cui on nGram	TBD	distilledNGram.${YEAR}.core
Process: Generate multiword candidates from the distilled n-gram set
30	Proc: Apply filter of Lexicon on Distilled nGram (core)	${N_GRAM}/distilledNGram.${YEAR}.core ${IN_DIR}/inflVars.data	30.disNGram.Core.lexicon.out	Use core-term of n-gram
31	Proc: Apply matcher of Multiword on nGram (core) from results of Step 30	30.disNGram.Core.lexicon.out	31.disNGram.Core.lexicon.multiword.out => Core multiwords from n-grams, no in the Lexicon	Remove single word This is used for ML models
32-0	PreProc: Get unique English String from UMLS - MRCONSO.RRF.ENG	MRCONSO.RRF.ENG => link to MRCONSO.RRF.ENG.${YEAR}AA/AB	umlsStr.data	Preprocess to get English UMLS String for step 32
32	Proc: Apply matcher of UMLS-Str on distilled nGram (must run 31)	30.disNGram.Core.lexicon.out umlsStr.data	32.disNGram.Core.umlsStr.out =>n-grams with CUIs (UMLS-String)	A simple hashTable lookup to match n-gram to UMLS String
33	Proc: Apply matcher of Multiword on nGram (core) from results of Step 32	32.disNGram.Core.umlsStr.out	33.disNGram.Core.multiword.out => Core multiwords from n-grams th CUIs	Remove single word This is used for ML models
34	Proc: Apply matcher of EndWord (top 33) on nGram (core)	33.disNGram.Core.multiword.out endWords.top33.data => Must run 10.MatchEndWord first option 1, to get the top endword list => Manually create endWords.top${NN}.data => link endWords.top.data (used for top endWords)	34.disNGram.Core.endword.out => Candidate with CUI and top endwords	Use the top endWord for matcher
Post-Process: Auto remove, tag, and resort to final format
35	Post-Proc: Auto remove candidate in the Lexicon and remove/Tag candidates are invalid LMWs based on the previous tags	34.disNGram.Core.endword.out Use 00.CandidateList => Files updates are required, proceed Steps 1-3 validFile: ${CAND_DIR}/0.LexiconInflVars/inflVars.data.current.1.uSort invalidFile: ${CAND_DIR}/totalTerms.all.lmw.no	35.disNGram.Core.endword.out.autoTag 35.disNGram.Core.endword.out.rmYesNo 35.disNGram.Core.endword.out.rmYesTagNo	Filter and tag candidates that are in the preivous year lists Make sure update data and re-run 00.CandidateList for the latest updates
36	Post-Proc: Rearrange (resort) canList by grouping singluars/plurals	35.disNGram.Core.endword.out.rmYesNo 35.disNGram.Core.endword.out.rmYesTagNo	36.disNGram.Core.endword.out.rmYesNo.gsp => cp to 36.disNGram.Core.endword.out.rmYesNo.gsp.${YEAR} 36.disNGram.Core.endword.out.rmYesTagNo.gsp => cp to 36.disNGram.Core.endword.out.rmYesTagNo.gsp.${YEAR} => This file is used for annual candidate list	resort and group it to put singular and plural together
Future Usage
40	PreProc: Get nGram spVar from result of 8.MatcherSpVar	medline.2.byM2CES.2.out.30.spVars.2016	nGramSpVars.data	get the n-grams that match spVar patterns
41	Proc: Apply matcher of nGram-SpVar on nGram	34.disNGram.Core.endword.out nGramSpVars.data	36.disNGram.Core.spVar	Lost recall, not use for now

The SPECIALIST Lexicon