Total Data Set Files and Usage
The procedure of adding LMW are:
- Generate N-gram set from MEDLINE (or any other corpus)
- Generate candidate lists from n-gram set
- Linguist manualy tag/add term from candidate lists
It would be very useful if we collect all manually tagged terms. It can be used for:
- Final filter to exclude previous tagged terms
- Use all inflVars to filter out valid LMWs (term already in the Lexicon)
- Use the collected invalid LMWs to tag invalid as a reference for linguists (some invalid term might become valid).
=> Please note that the above two data are changed when the Lexicon is updated, a new candidate list is completed, or a new not base/LMW files is updated in LexCheck
- Use as training/test data set for deep learning models
This manually data are collected from two sources:
- Program: ${MULTIWORDS}/bin/00.CandidateList
3
- Data directory: ${MULTIWORDS}/data/Candidate/
- Algorithm:
- Get total data from
- All previous candidate list
- All not base/LMW files from LexCheck
- Use the lastest Lexicon (InflVars.data) to tag valid/invalid LMWs
- valid LMWs: totalData.data.yes
- invalid LMWs: totalData.data.no
- Use the tagged results to tag new candidate lists:
- valid LMWs: inflVars.data
- invalid LMWs: totalData.data.no
- TBD (new Candidate list): others
- Out Files:
- totalData.data.*
Date | Notes | Total Candidate | Valid LMWs | Invalid LMWs
|
---|
| | totalData.data | totalData.data.yes | totalData.data.no
|
2018-11-15 | 2.MNSMatcherParAcr, 2017 | 31004 | 16331 (52.67%) | 14673 (47.32%)
|
2019-01-03 | 2.MNSMatcherParAcr, 2018 | 31806 | 16924 (53.21%) | 14882 (46.79%)
|
2019-05-20 | 3.DMNSMatcherCuiEndWord, 2017 | 33751 | 16924 (50.14%) | 16827 (49.86%)
|