Step | Description | Inputs | Outputs | Notes
|
---|
Get CUI stats on Lexicon
|
---|
1 | Add CUI to Lexicon (InflVars)
shell>cd ${STMT_DIR}/bin
smt.AA
- Time: 1 hr. 10 min.
|
- ${IN_DIR}inflVars.data.f1
|
|
- Use smt to get CUIs for all terms from inflVars.data
|
2 | Analyze and get stats of CUI in Lexicon
|
|
|
- Get stats for single words|multiwords wtih CUIs
|
Get CUI stats on MEDLINE n-gram set
|
---|
10 | Add CUI to nGram
shell>cd ${STMT_DIR}/bin
smt.AA
- Time: 16 hr.
|
- distilledNGram.${YEAR}.core.lc
=>run 06.NGramUtil ${YEAR}
11
|
- distilledNGram.${YEAR}.core.f2
- distilledNGram.${YEAR}.core.cui
|
- get term from distilled nGrams (field 2)
- Use smt to get CUIs for all terms from nGrams
|
11 | Filter out nGrams without CUI
|
- distilledNGram.${YEAR}.core
- distilledNGram.${YEAR}.core.cui
|
- distilledNGram.${YEAR}.core.cui.out
|
- Filter out nGrams without CUIs (including 1,2,3 substitutions)
|
12 | Tag results of step-11
|
- distilledNGram.${YEAR}.core.cui.out
- Max. WC (2000000)
- Min. WC (0)
- inflVars.data.${YEAR}
- inflVars.data.current
- notMwFromCuiTerm.data.${YEAR}
- notMwFromCuiTerm.data.current
|
- distilledNGram.${YEAR}.core.lc.cui.out.stats (the stats between init year and current)
- distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tag.${MIN_WC}-${MAX_WC}
- distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tbd.${MIN_WC}-${MAX_WC}
- distilledNGram.${YEAR}.core.lc.cui.out.current.tag.${MIN_WC}-${MAX_WC}
- distilledNGram.${YEAR}.core.lc.cui.out.current.tbd.${MIN_WC}-${MAX_WC}
| Tag and calulate precision:
- sent distilledNGram.${YEAR}.core.cui.out.current.tbd.${MIN_WC}-${MAX_WC} to linguist:
- tag yes|no|exp
- Add valid MW to Lexicon
- Update files from tag result of "yes|no"
- Update inflVars.data.current from Lexicon
- Update notMwFromCuiTerm.data.current from no-tag
- rerun this step until current.tbd is 0
- Check precision
|
20 | Apply Matcher-Cui on nGram
| TBD
| - distilledNGram.${YEAR}.core
|
|
Process: Generate multiword candidates from the distilled n-gram set
|
---|
30 | Proc: Apply filter of Lexicon on Distilled nGram (core)
| - ${N_GRAM}/distilledNGram.${YEAR}.core
- ${IN_DIR}/inflVars.data
| - 30.disNGram.Core.lexicon.out
|
|
31 | Proc: Apply matcher of Multiword on nGram (core) from results of Step 30
| - 30.disNGram.Core.lexicon.out
- 31.disNGram.Core.lexicon.multiword.out
=> Core multiwords from n-grams, no in the Lexicon
- Remove single word
- This is used for ML models
| 32-0 | PreProc: Get unique English String from UMLS - MRCONSO.RRF.ENG
| - MRCONSO.RRF.ENG
=> link to MRCONSO.RRF.ENG.${YEAR}AA/AB
|
| - Preprocess to get English UMLS String for step 32
32 | Proc: Apply matcher of UMLS-Str on distilled nGram (must run 31)
| - 30.disNGram.Core.lexicon.out
- umlsStr.data
- 32.disNGram.Core.umlsStr.out
=>n-grams with CUIs (UMLS-String)
| - A simple hashTable lookup to match n-gram to UMLS String
33 | Proc: Apply matcher of Multiword on nGram (core) from results of Step 32
| - 32.disNGram.Core.umlsStr.out
- 33.disNGram.Core.multiword.out
=> Core multiwords from n-grams th CUIs
- Remove single word
- This is used for ML models
| 34 | Proc: Apply matcher of EndWord (top 33) on nGram (core)
| - 33.disNGram.Core.multiword.out
- endWords.top33.data
=> Must run 10.MatchEndWord first
option 1, to get the top endword list
=> Manually create endWords.top${NN}.data
=> link endWords.top.data (used for top endWords)
- 34.disNGram.Core.endword.out
=> Candidate with CUI and top endwords
- Use the top endWord for matcher
| Post-Process: Auto remove, tag, and resort to final format
|
---|
35 | Post-Proc: Auto remove candidate in the Lexicon and remove/Tag candidates are invalid LMWs based on the previous tags
| - 34.disNGram.Core.endword.out
- Use 00.CandidateList
=> Files updates are required, proceed Steps 1-3
- validFile: ${CAND_DIR}/0.LexiconInflVars/inflVars.data.current.1.uSort
- invalidFile: ${CAND_DIR}/totalTerms.all.lmw.no
| - 35.disNGram.Core.endword.out.autoTag
- 35.disNGram.Core.endword.out.rmYesNo
- 35.disNGram.Core.endword.out.rmYesTagNo
| - Filter and tag candidates that are in the preivous year lists
- Make sure update data and re-run 00.CandidateList for the latest updates
| 36 | Post-Proc: Rearrange (resort) canList by grouping singluars/plurals
| - 35.disNGram.Core.endword.out.rmYesNo
- 35.disNGram.Core.endword.out.rmYesTagNo
| - 36.disNGram.Core.endword.out.rmYesNo.gsp
=> cp to 36.disNGram.Core.endword.out.rmYesNo.gsp.${YEAR} - 36.disNGram.Core.endword.out.rmYesTagNo.gsp
=> cp to 36.disNGram.Core.endword.out.rmYesTagNo.gsp.${YEAR} => This file is used for annual candidate list
| - resort and group it to put singular and plural together
| Future Usage
|
---|
40 | PreProc: Get nGram spVar from result of 8.MatcherSpVar
| - medline.2.byM2CES.2.out.30.spVars.2016
|
| - get the n-grams that match spVar patterns
| 41 | Proc: Apply matcher of nGram-SpVar on nGram
| - 34.disNGram.Core.endword.out
- nGramSpVars.data
|
| - Lost recall, not use for now
|
|
|
|
|
|
|
|
|
|