Consumer Health Corpus - The Distilled N-gram Set
I. Introduction
The raw n-gram set include many invalid LMWs. These invalid LMWs are filtered out through a series of exclsudive filters. These filters are designed to increase precision while keep the same (similar) recall in terms of LMWs. In oother words, these filter only filter our invalidLMWs without removing valid LMWs. The filtering processes filter out about 2/3 of n-grams and results in the distilled n-gram set. This distilled n-gram set is used for further processes (matchers) to generate LMW candidates.
II. Process
${LMW_DIR}/bin/21.CSpellHealthCorpus
2017
Option | Description | Inputs - ${IN_DIR}/${OUT_DIR} | Outputs - ${OUT_DIR} | Notes | |||||
---|---|---|---|---|---|---|---|---|---|
Generate the raw n-gram set | |||||||||
10 | Sort nGrams by DC|WC|Term | 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC} | 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC}.dwt
| Sort by DC, WC, alphabetic order of n-gram | |||||
11 | Generate the distilled n-gram set | 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC}.dwt | 21.CSpellHealthCorpus/nGrams/
| Go through exclusive filters
12 | TBD ...Sort all raw n-gram files
| 21.CSpellHealthCorpus/nGrams/nGram.${N}.data
|
| |
II. Filter Details
TBD...
IV. Results
N-grams | File | Zip Size | Actual Size | No. of n-grams |
---|---|---|---|---|
Unigrams | nGram.1.2017.tgz | 0.985 Mb | 2.8 Mb | 194,407 |
bigrams | nGram.2.2017.tgz | 6.6 Mb | 23 Mb | 1,233,365 |
Trigrams | nGram.3.2017.tgz | 18 Mb | 65 Mb | 2,806,783 |
Four-grams | nGram.4.2017.tgz | 29 Mb | 111 Mb | 3,906,380 |
Five-grams | nGram.5.2017.tgz | 39 Mb | 149 Mb | 4,396,030 |
N-gram Set | nGramSet.2017.1.tgz | 92 Mb | 350 Mb | 12,536,965 |