Consumer Health Corpus - The Raw N-gram Set
I. Introduction
The Consumer Health Corpus (used in the CSpell) is used to retrieve the n-gram set for LMW candidate generation.
II. N-gram set Specifications
Document count | Word Count | N-gram |
III. Process
${LMW_DIR}/bin/21.CSpellHealthCorpus
2017
Option | Description | Inputs - ${IN_DIR}/${OUT_DIR} | Outputs - ${OUT_DIR} | Notes |
---|---|---|---|---|
Generate the raw n-gram set | ||||
2 | Convert Xml files to Raw Corpus Text files | CSpellHealthCorpus/Crawl/*/*.html | 21.CSpellHealthCorpus/RawCorpus/*.data
|
|
4 | Generate all raw n-gram files (N = 1-5) | 21.CSpellHealthCorpus/RawCorpus/*.data | 21.CSpellHealthCorpus/nGrams/nGram.${N}.data |
|
6 | Sort all raw n-gram files | 21.CSpellHealthCorpus/nGrams/nGram.${N}.data |
| |
7 | Generate the raw n-gram set | 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt | 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.1 | min wC = 1 |
8 | Zip n-gram set |
|
|
IV. Results
N-grams | File | Zip Size | Actual Size | No. of n-grams |
---|---|---|---|---|
Unigrams | nGram.1.2017.tgz | 0.985 Mb | 2.8 Mb | 194,407 |
bigrams | nGram.2.2017.tgz | 6.6 Mb | 23 Mb | 1,233,365 |
Trigrams | nGram.3.2017.tgz | 18 Mb | 65 Mb | 2,806,783 |
Four-grams | nGram.4.2017.tgz | 29 Mb | 111 Mb | 3,906,380 |
Five-grams | nGram.5.2017.tgz | 39 Mb | 149 Mb | 4,396,030 |
N-gram Set | nGramSet.2017.1.tgz | 92 Mb | 350 Mb | 12,536,965 |