CSpell

Corpora in CSpell

I. Introduction

The corpus is used to:

II. Corpora Tested in CSpell

Three corpora were tested for comparison:

	Baseline	Consumer Health Corpus	Medline N-gram Set
Resources	7 web sites	20 (16) web sites	The Medline N-gram Set (2017 release)
Statistics	Articles: 8,590 Tokens: 5,771,363 Unique Word:51,190 Dic Words in Corpus: 35,781\|6.3115% Dic Words WC: 5,637,436\|98.4708%	Articles: 17,139 Sentences: 550,193 Tokens: 10,228,699 Unique Word: 192,818 Unique CoreTerm.Lc: 109,175 Dic Words in Corpus: 48690\|8.5886% Dic Words WC: 9,979,195\|97.6123%	Articles: 26,759,399 Sentences: 163,021,640 Tokens: 3,386,661,350 Unique Word: 976,872 (WC > 30) Unique CoreTerm.Lc: 496,388 Dic Words in Corpus: 214,581\|37.8507% Dic Words WC: 3,224,585,163\|97.1439%

PS.

Total words in CSpell Suggesting Dictionary: 566,914
- lexicon.enEwLc.dic.addRm
- customerDic.data
- NRVAR.1.uSort.data
shell> ${PRE_PROCESS}/bin/RunCorpus
shell> ${PRE_PROCESS}/bin/RunPreProc
Setup the dictionary and corpus in CSpell for test
4
71

III. Corpora

Consumer Health Corpus
Ensemble Corpus
- MedlinePlus Medical Encyclopedia
- MedlinePlus Drug
- Genetics Home Reference
- Genetic and Rare Disease frequency asked questions
- NHLBI Health Topics
- NINDS Disorders
- NIH Senior Health
Corpus from MEDLINE (2017)
- Use unigrams from the MEDLINE N-gram Set (2017 release)
Daniel Davis's Crawler (TBD)

IV. Development Tests

Compare above three corpora for word frequency test on cSpell:

Setup: Non-word, Revised GoldStd, unigram model for word frequency, new rank sorting algorithm

Model	Ensemble Corpus	Consumer Health Corpus	Medline.2017
Frequency Only
Frequency - Halil	438\|770\|774 0.5688\|0.5659\|0.5674	438\|770\|774 0.5688\|0.5659\|0.5674	404\|770\|774 0.5247\|0.5220\|0.5233
Frequency - cSpell-Dev-1	536\|769\|774 0.6970\|0.6925\|0.6948	534\|770\|774 0.6935\|0.6899\|0.6917	521\|770\|774 0.6766\|0.6731\|0.6749
Frequency - cSpell-Dev-2	536\|769\|774 0.6970\|0.6925\|0.6948	534\|770\|774 0.6935\|0.6899\|0.6917	522\|770\|774 0.6779\|0.6744\|0.6762
Combined method
Noisy Channel	552\|769\|774 0.7178\|0.7132\|0.7155	551\|770\|774 0.7156\|0.7119\|0.7137	523\|770\|774 0.6792\|0.6757\|0.6775
CSpell Combined Orthographic and Frequency	598\|769\|774 0.7776\|0.7726\|0.7751	598\|769\|774 0.7776\|0.7726\|0.7751	597\|769\|774 0.7763\|0.7713\|0.7738

V. Notes

The corpus is used for word2Vec. It need to be big enough (for recall) to cover the word and frequency. Current corpus need to be enhanced for better context ranking.

Model	Ensemble Corpus	Consumer Health Corpus	Medline.2017
Frequency Only
Frequency - Halil	438\|770\|774 0.5688\|0.5659\|0.5674	438\|770\|774 0.5688\|0.5659\|0.5674	404\|770\|774 0.5247\|0.5220\|0.5233
Frequency - cSpell-Dev-1	536\|769\|774 0.6970\|0.6925\|0.6948	534\|770\|774 0.6935\|0.6899\|0.6917	521\|770\|774 0.6766\|0.6731\|0.6749
Frequency - cSpell-Dev-2	536\|769\|774 0.6970\|0.6925\|0.6948	534\|770\|774 0.6935\|0.6899\|0.6917	522\|770\|774 0.6779\|0.6744\|0.6762
Combined method
Noisy Channel	552\|769\|774 0.7178\|0.7132\|0.7155	551\|770\|774 0.7156\|0.7119\|0.7137	523\|770\|774 0.6792\|0.6757\|0.6775
CSpell Combined Orthographic and Frequency	598\|769\|774 0.7776\|0.7726\|0.7751	598\|769\|774 0.7776\|0.7726\|0.7751	597\|769\|774 0.7763\|0.7713\|0.7738