Corpora in CSpell
I. Introduction
The corpus is used to:
II. Corpora Tested in CSpell
Three corpora were tested for comparison:
Baseline | Consumer Health Corpus | Medline N-gram Set | |
---|---|---|---|
Resources | 7 web sites | 20 (16) web sites | The Medline N-gram Set (2017 release) |
Statistics |
|
|
|
PS.
shell> ${PRE_PROCESS}/bin/RunCorpus
shell> ${PRE_PROCESS}/bin/RunPreProc
4
71
III. Corpora
IV. Development Tests
Model | Ensemble Corpus | Consumer Health Corpus | Medline.2017 |
---|---|---|---|
Frequency Only | |||
Frequency - Halil | 438|770|774 0.5688|0.5659|0.5674 | 438|770|774 0.5688|0.5659|0.5674 | 404|770|774 0.5247|0.5220|0.5233 |
Frequency - cSpell-Dev-1 | 536|769|774 0.6970|0.6925|0.6948 | 534|770|774 0.6935|0.6899|0.6917 | 521|770|774 0.6766|0.6731|0.6749 |
Frequency - cSpell-Dev-2 | 536|769|774 0.6970|0.6925|0.6948 | 534|770|774 0.6935|0.6899|0.6917 | 522|770|774 0.6779|0.6744|0.6762 |
Combined method | |||
Noisy Channel | 552|769|774 0.7178|0.7132|0.7155 | 551|770|774 0.7156|0.7119|0.7137 | 523|770|774 0.6792|0.6757|0.6775 |
CSpell Combined Orthographic and Frequency | 598|769|774 0.7776|0.7726|0.7751 | 598|769|774 0.7776|0.7726|0.7751 | 597|769|774 0.7763|0.7713|0.7738 |
V. Notes