Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Consumer Health Corpus

I. Introduction

A corpus relevant to consumer health data should increase the performance of CSpell. Accordingly, we established a consumer health corpus by collecting health related articles form 16 web sites that were used for answering consumer health questions:

Consumer Health Corpus (from Ashutosh's Crawler, 10.09.17)

SourcesAbbreviationWeb site Base URLArticle No.
Genetic and Rare Diseases - Diseasesgardhttps://rarediseases.info.nih.gov6484
Genetics Home Reference - Conditionsghrhttps://ghr.nlm.nih.gov/condition1215
Genetics Home Reference - Genesghrgeneshttps://ghr.nlm.nih.gov/gene1439
MedlinePlus - Drugsmplusdrugshttps://medlineplus.gov/druginfo/1383
MedlinePlus - Medical Encyclopediamplusencyclopediahttps://medlineplus.gov/ency/4425
MedlinePlus - All Health Topicsmplushealthtopicshttps://www.nlm.nih.gov/medlineplus/all_healthtopics.html1013
MedlinePlus - Herbs and Supplementsmplusherbssupplementshttps://www.nlm.nih.gov/medlineplus/druginfo/herb_All.html153/177
National Eye Instituteneihttps://nei.nih.gov/health36
National Heart, Lung, and Blood Institutenhlbihttp://www.nhlbi.nih.gov/health/health-topics/by-alpha141
National Institute of Allergy and Infectious Diseasesniaidhttps://www.niaid.nih.gov/diseases-conditions/all53
National Institute of Arthritis and Musculoskeletal and Skin Diseasesniamshttps://www.niams.nih.gov/health-topics/all-diseases55
National Institute of Child Health and Human Developmentnichdhttps://www.nichd.nih.gov/health/topics/Pages/index.aspx81
National Institute on Deafness and Other Communication Disordersnidcdhttps://www.nidcd.nih.gov/health/hearing-ear-infections-deafness13/15
National Institute of Diabetes and Digestive and Kidney Diseaseniddkhttps://www.niddk.nih.gov/health-information181/185
National Institute of Mental Healthnimhhttp://www.nimh.nih.gov/health/topics/index.shtml25/26
National Institute of Neurological Disorders and Strokenindshttps://www.ninds.nih.gov/Disorders/All-Disorders439
Centers for Disease Control and Preventioncdchttps://www.cdc.gov/TBD
National Cancer InstitutecancerGovhttps://www.cancer.gov/typesTBD
National Institute on Agingniahttps://www.nia.nih.govTBD
National Institutes of health - Office of Research on Women's Healthwomenhealthhttps://orwh.od.nih.gov/TBD

II. Algorithm

  • A crawler was developed to search articles that are consumer health related. The outputs are stored in XML format.
  • These articles are converted to text format
  • N-gram algorithm is applied to the text
  • Lower case, core-term are used to group the raw unigrams for word count
  • The results (WC|unigram) are used as word frequency data for CSpell

III. Consumer Health Corpus

  • Articles: 17,139
  • Sentences: 550,193
  • Tokens: 10,228,699
  • Unique Word: 192,818
  • Unique CoreTerm.Lc: 109,175
  • Dic Words in Corpus: 48690|8.5886%
  • Dic Words WC: 9,979,195|97.6123%

IV. Notes

  • The special code, such as [NUM], [EMAIL], [URL], need to be consistent with the input data for cSpell. For example, the development set used [CONTACT] for telephone number and email, which results in lower precision on context ranking. Need a cleanup on the pre-process for tagging the corpus.