Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Consumer Health Corpus - The Distilled N-gram Set
I. Introduction
The raw n-gram set include many invalid LMWs. These invalid LMWs are filtered out through a series of exclsudive filters. These filters are designed to increase precision while keep the same (similar) recall in terms of LMWs. In oother words, these filter only filter our invalidLMWs without removing valid LMWs. The filtering processes filter out about 2/3 of n-grams and results in the distilled n-gram set. This distilled n-gram set is used for further processes (matchers) to generate LMW candidates.
II. Process
${LMW_DIR}/bin/21.CSpellHealthCorpus
2017
Option | Description | Inputs - ${IN_DIR}/${OUT_DIR} | Outputs - ${OUT_DIR} | Notes | |||||
---|---|---|---|---|---|---|---|---|---|
Generate the raw n-gram set | |||||||||
10 | Sort nGrams by DC|WC|Term | 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC} | 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC}.dwt
| Sort by DC, WC, alphabetic order of n-gram | |||||
11 | Generate the distilled n-gram set | 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC}.dwt | 21.CSpellHealthCorpus/nGrams/
| Go through exclusive filters
12 | TBD ...Sort all raw n-gram files
| 21.CSpellHealthCorpus/nGrams/nGram.${N}.data
|
| |
II. Filter Details
TBD...
IV. Results
N-grams | File | Zip Size | Actual Size | No. of n-grams |
---|---|---|---|---|
Unigrams | nGram.1.2017.tgz | 0.985 Mb | 2.8 Mb | 194,407 |
bigrams | nGram.2.2017.tgz | 6.6 Mb | 23 Mb | 1,233,365 |
Trigrams | nGram.3.2017.tgz | 18 Mb | 65 Mb | 2,806,783 |
Four-grams | nGram.4.2017.tgz | 29 Mb | 111 Mb | 3,906,380 |
Five-grams | nGram.5.2017.tgz | 39 Mb | 149 Mb | 4,396,030 |
N-gram Set | nGramSet.2017.1.tgz | 92 Mb | 350 Mb | 12,536,965 |