Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Consumer Health Corpus - The Distilled N-gram Set

I. Introduction

The raw n-gram set include many invalid LMWs. These invalid LMWs are filtered out through a series of exclsudive filters. These filters are designed to increase precision while keep the same (similar) recall in terms of LMWs. In oother words, these filter only filter our invalidLMWs without removing valid LMWs. The filtering processes filter out about 2/3 of n-grams and results in the distilled n-gram set. This distilled n-gram set is used for further processes (matchers) to generate LMW candidates.

II. Process

  • program:
    ${LMW_DIR}/bin/21.CSpellHealthCorpus
    2017

    OptionDescriptionInputs - ${IN_DIR}/${OUT_DIR}Outputs - ${OUT_DIR}Notes
    Generate the raw n-gram set
    10Sort nGrams by DC|WC|Term 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC} 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC}.dwt
    • Format: DC|WC|n-grm
    Sort by DC, WC, alphabetic order of n-gram
    11Generate the distilled n-gram set 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.${MIN_WC}.dwt 21.CSpellHealthCorpus/nGrams/
    • distilledNGramSet.data
    • distilledNGramSet.rpt
    • distilledNGramSet.trap
    • distilledNGramSet.exp
    Go through exclusive filters
  • process time: ~3 hr.
  • 12TBD ...Sort all raw n-gram files 21.CSpellHealthCorpus/nGrams/nGram.${N}.data
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt
    • 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.tdw

II. Filter Details
TBD...

IV. Results

N-gramsFileZip SizeActual SizeNo. of n-grams
UnigramsnGram.1.2017.tgz 0.985 Mb2.8 Mb194,407
bigramsnGram.2.2017.tgz 6.6 Mb23 Mb1,233,365
TrigramsnGram.3.2017.tgz 18 Mb65 Mb2,806,783
Four-gramsnGram.4.2017.tgz 29 Mb111 Mb3,906,380
Five-gramsnGram.5.2017.tgz 39 Mb149 Mb4,396,030
N-gram SetnGramSet.2017.1.tgz 92 Mb350 Mb12,536,965