Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Distilled MEDLINE N-Gram Set
I. Introduction
The MEDLINE n-gram set includes many invalid LMWs that are not needed for most NLP research. LSG developed a set of exclusive filters that filter out these invalid LMWs. The filtering process filtered out about 2/3 of n-grams from MEDLINE n-gram set release. This enhanced/filtered N-Gram set is called the distilled MEDLINE n-gram set.
II. Precision and Recall
This distilled MEDLINE n-gram set has higher precision and same (similar) recall rate in terms of valid multiwords. LSG performs the accuracy test on all developed exclusive filters by applying these filters on Lexicon (valid LMW). The minimum passing rate is 99.99%. In other words, these filters only filter out invalid LMWs without removing valid LMWs. A simple calculation is described as below:
III. Conclusion
The distilled MEDLINE n-gram Set vs. MEDLINE n-gram Set
IV. Release Processes
shell>cd ${MULTIWORDS}/data/${YEAR}/outData/02.NGram/nGrams
shell>wc -l nGramSet.${YEAR}.30
Year | nGram Number |
---|---|
2014 | 17,023,819 |
2015 | 18,148,692 |
2016 | 19,325,338 |
2017 | 21,963,037 |
2018 | 23,171,133 |
2019 | 24,666,816 |
2020 | 26,310,808 |
2021 | 28,103,252 |
2022 | 30,090,771 |
2023 | 32,107,061 |
2024 | 34,160,908 |
2025 | 36,370,468 |
shell>mkdir ${MULTIWORD_DIR}/data/${YEAR}/outData/05.ApplyFilters
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/05.ApplyFilters
shell>ln -sf ../02.NGram/nGrams/nGramSet.${YEAR}.30 nGram.${YEAR}
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/inData
shell>ln -sf nfsvol/lex/Lu/Backup/Releases/UMLS/${YEAR}_AA_release/LEX/NUMBERS/NRVAR NRVAR
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/inData
shell>cp -p ../../${PREV_YEAR}/inData/stopWords.data.${PREV_YEAR} stopWords.data.${YEAR}
shell>ln -sf ./stopWords.data.${YEAR} stopWords.data
shell>cp -p ../../${PREV_YEAR}/inData/unit.data.${PREV_YEAR} unit.data.${YEAR}
shell>ln -sf ./unit.data.${YEAR} unit.data
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>cat invalidLeadTerms.data invalidLeadTerms.data.append > invalidLeadTerms.data.${YEAR}
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidLeadTerms.data.${PREV_YEAR} invalidLeadTerms.data.${YEAR}
shell>ln -sf ./invalidLeadTerms.data.${YEAR} invalidLeadTerms.data.abs
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>mv invalidEndTerms.data invalidEndTerms.data.${YEAR}
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidEndTerms.data.${PREV_YEAR} invalidEndTerms.data.${YEAR}
shell>ln -sf ./invalidEndTerms.data.${YEAR} invalidEndTerms.data.abs
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidLeadEndTermCandidates.data .
03.LeadEndTerm
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/validLeadTerms.data.pat.${PREV_YEAR} validLeadTerms.data.pat.${YEAR}
shell>ln -sf ./validLeadTerms.data.pat.${YEAR} validLeadTerms.data.pat
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/validEndTerms.data.pat.${PREV_YEAR} validEndTerms.data.pat.${YEAR}
shell>ln -sf ./validEndTerms.data.pat.${YEAR} validEndTerms.data.pat
shell>cd ${MULTIWORDS}/bin/05.ApplyFilters ${YEAR}
1
10-14
20-25
30-34
40
or
shell>cd 05.ApplyFiltersAll
shell>runApplyFilersAll ${YEAR}
shell> cp -p ApplyFilters.rpt ApplyFilters.rpt.${YEAR}
shell> cp -p nGram.2018.34.invEndTermPat ../02.NGram/nGrams/distilledNGram.${YEAR}
shell> gtar -czvf distilledNGram.${YEAR}.tgz distilledNGram.${YEAR}
shell> cd ${MULTIWORDS}/bin
shell> 06.NGramUtil ${YEAR}
20
for the MEDLINE nGram set
21
for the distilled nGram set
22
for nGrams (N = 3 ~ 5)
3
4
5
V. Release Logs
VI. Run the Test data on the Lexicon