The SPECIALIST Lexicon

Consumer Health Corpus - The Distilled N-gram Set

I. Introduction

The raw n-gram set include many invalid LMWs. These invalid LMWs are filtered out through a series of exclsudive filters. These filters are designed to increase precision while keep the same (similar) recall in terms of LMWs. In oother words, these filter only filter our invalidLMWs without removing valid LMWs. The filtering processes filter out about 2/3 of n-grams and results in the distilled n-gram set. This distilled n-gram set is used for further processes (matchers) to generate LMW candidates.

II. Process

  • program:
    ${LMW_DIR}/bin/21.CSpellHealthCorpus
    2017

    OptionDescriptionInputs - ${IN_DIR}/${OUT_DIR}Outputs - ${OUT_DIR}Notes
    Generate the raw n-gram set
    10Sort nGrams by DC|WC|Term 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC} 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC}.dwt
    • Format: DC|WC|n-grm
    Sort by DC, WC, alphabetic order of n-gram
    11Generate the distilled n-gram set 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC}.dwt 21.CSpellHealthCorpus/nGrams/
    • distilledNGramSet.data
    • distilledNGramSet.rpt
    • distilledNGramSet.trap
    • distilledNGramSet.exp
    Go through exclusive filters
  • process time: ~3 hr.
  • 12TBD ...Sort all raw n-gram files 21.CSpellHealthCorpus/nGrams/nGram.${N}.data
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.tdw

II. Filter Details
TBD...

IV. Results

N-gramsFileZip SizeActual SizeNo. of n-grams
UnigramsnGram.1.2017.tgz 0.985 Mb2.8 Mb194,407
bigramsnGram.2.2017.tgz 6.6 Mb23 Mb1,233,365
TrigramsnGram.3.2017.tgz 18 Mb65 Mb2,806,783
Four-gramsnGram.4.2017.tgz 29 Mb111 Mb3,906,380
Five-gramsnGram.5.2017.tgz 39 Mb149 Mb4,396,030
N-gram SetnGramSet.2017.1.tgz 92 Mb350 Mb12,536,965