The SPECIALIST Lexicon

Consumer Health Corpus - The Raw N-gram Set

I. Introduction

The Consumer Health Corpus (used in the CSpell) is used to retrieve the n-gram set for LMW candidate generation.

II. N-gram set Specifications

  • Corpus: Consumer Health Corpus (2017)
  • Method:
  • Max. Character size: 50
  • Min. word count: 1
  • Min. document count: 1

  • Total website: 16
  • Total articles/pages (XML files): 17,136

  • Total document count: 17,136
  • Total sentence count: 555,205
  • Total token count: 10,197,915

  • N-gram files
    • File Format - 3 fields
      Document countWord CountN-gram
  • Each grams are sorted by document count, word count, then alphabetic order of n-grams.
  • N-gram set is the concatenated results of n-gram files (N = 1 ~ 5). It is not sorted
  • The lowercased core-terms of N-gram set is sorted and used for further process in LMW candidate generation.

III. Process

  • program:
    ${LMW_DIR}/bin/21.CSpellHealthCorpus
    2017

    OptionDescriptionInputs - ${IN_DIR}/${OUT_DIR}Outputs - ${OUT_DIR}Notes
    Generate the raw n-gram set
    2Convert Xml files to Raw Corpus Text files CSpellHealthCorpus/Crawl/*/*.html 21.CSpellHealthCorpus/RawCorpus/*.data
    • File: each file includes all pages from one website
    • Format: ID|contents
    • Enhanced sentence tokenizer on Unicode
    • Enhanced String trim on Unicode space
    4Generate all raw n-gram files (N = 1-5) 21.CSpellHealthCorpus/RawCorpus/*.data 21.CSpellHealthCorpus/nGrams/nGram.${N}.data
    • Data is relatively small, no need to use split-group-filter model
    • process time: ~3 min.
    6Sort all raw n-gram files 21.CSpellHealthCorpus/nGrams/nGram.${N}.data
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.tdw
    7Generate the raw n-gram set 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.1 min wC = 1
    8Zip n-gram set
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt
    • 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.1
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.${YEAR}.tgz
    • 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.1.tgz

IV. Results

N-gramsFileZip SizeActual SizeNo. of n-grams
UnigramsnGram.1.2017.tgz 0.985 Mb2.8 Mb194,407
bigramsnGram.2.2017.tgz 6.6 Mb23 Mb1,233,365
TrigramsnGram.3.2017.tgz 18 Mb65 Mb2,806,783
Four-gramsnGram.4.2017.tgz 29 Mb111 Mb3,906,380
Five-gramsnGram.5.2017.tgz 39 Mb149 Mb4,396,030
N-gram SetnGramSet.2017.1.tgz 92 Mb350 Mb12,536,965