Multiword Candidates Routine Generation Procedures
In addition to user requested terms/sources, LSG also practice routine procedures to generate multiword candidate list to build the Lexicon:
Invalid LMWs | Notes | ||||||||
---|---|---|---|---|---|---|---|---|---|
|
|
Note that the restriction on complex NPs (law of articulation) can be overridden if a term is a true compound, with its own meaning apart from the constituent NP + PP. For instance, “tug of war” is considered a valid LMW, since it has a definition that could not be inferred from the combination of meanings tug + of + war alone. It also undergoes pluralization as a unit – [tug of war]s and not [tug]s of war[s].
Lexicon Release | Candidate Files | Status | Notes |
---|---|---|---|
2015 |
Example | Notes |
---|---|
zona pellucida (ZP) | E0216465 |
major hydrophilic region (MHR) | E0760361 |
diabetic foot syndrome (DFS) | E0564279 |
diabetic foot ulcer (DFU) | E0715662 |
major hydrophilic region (MHR) | E0760361 |
years lived with disability (YLD) | Invalid LMWs
|
persons who stutter (PWS) | Invalid LMWs
|
violence against women (VAM) | Invalid LMWs
|
zero-point energy (ZPE) | Invalid LMWs
|
Typically, the release of Medline N-gram set (MNS) is 3~9 months behind the release of the Lexicon.
Distilled MEDLINE nGram Set | Candidate Files | Status | Notes |
---|---|---|---|
2015 | acronymExp.tag.data.tag.final.tbd.2015 | Done |
|
2016 | acronymExp.tag.data.tag.final.tbd.2016 | Done |
|
2017 | acronymExp.tag.data.tag.final.tbd.2017 | Done |
|
2018 | acronymExp.tag.data.tag.final.tbd.2018 | Done |
|
2019 | acronymExp.tag.data.tag.final.tbd.2019.Used.rmYesNo | Done |
|
2020 | acronymExp.tag.data.tag.final.tbd.2020.used.rmYesNo | Done |
|
2021 | acronymExp.tag.data.tag.final.tbd.2021.used.rmYesNo | Sent-tagging |
|
flds 1 EndWord.1.analysis.stats > EndWord.1.analysis.stats.1
Distilled MEDLINE nGram Set | Candidate Files | Status | Notes |
---|---|---|---|
2016 | 35.disNGram.Core.endword.out.gsp.2016 | Done |
|
2017 | 36.disNGram.Core.endword.out.rmYesTagNo.gsp.2017 | Done |
|
2018 | 36.disNGram.Core.endword.out.rmYesTagNo.gsp.2018 | Done |
|
2019 | 36.disNGram.Core.endword.out.rmYesTagNo.gsp.2019 | Done |
|
2020 | 36.disNGram.Core.endword.out.rmYesTagNo.gsp.2020 | Done |
|
2021 | 36.disNGram.Core.endword.out.rmYesTagNo.gsp.2021 | TBD |
|
|
Distilled MEDLINE nGram Set | Candidate Files | Status | Notes |
---|---|---|---|
2015 |
| Done | Tag [Y|N] |
2016+ | N/A | Postphone due to limited resources |
Post-processes:
All the invalid terms from tagged candidate list should be retrieved and save to "invalidLmwList.out". These invalid LMW terms and words in the Lexicon should be filtered out from the candidate list.
A systematic post-process was implemented to filtered valid and invalid LMW from candidate list:
${MULTIWORDS}/bin/00.CandidateList
Step | Description | Input | Output | Notes |
---|---|---|---|---|
1 | Aggregate and analyze all previous LMW candidate files
=> This program is to analyze the precision of candidate list (candidates are valid LMWs) |
|
| Must update:
|
2 | Aggregate and analyze not baseForm/LMW from LexCheck/candidate files
=> This program is to analyze the precision of invalid LMWs from notBaseForm.data and notLmw.data from the annaul Lexicon tagging |
|
| Must update:
|
3 | Combine output files from steps 1 and 2 to get the total data set . |
|
| Must run step 1 and 2
|
10 | Filter and tag valid/invalid LMWs for a candidate file |
|
|
|
20 | Generate DL TtSet from valid/invalid LMWs candidate files | |||
21 | Generate DL TtSet from inflVars (valid) and invalid LMWs in n-grams .. |