Corpora in CSpell
I. Introduction
The corpus is used to:
II. Corpora Tested in CSpell
Three corpora were tested for comparison:
| Baseline | Consumer Health Corpus | Medline N-gram Set | |
|---|---|---|---|
| Resources | 7 web sites | 20 (16) web sites | The Medline N-gram Set (2017 release) |
| Statistics |
|
|
|
PS.
shell> ${PRE_PROCESS}/bin/RunCorpus
shell> ${PRE_PROCESS}/bin/RunPreProc
4
71
III. Corpora
IV. Development Tests
| Model | Ensemble Corpus | Consumer Health Corpus | Medline.2017 |
|---|---|---|---|
| Frequency Only | |||
| Frequency - Halil | 438|770|774 0.5688|0.5659|0.5674 | 438|770|774 0.5688|0.5659|0.5674 | 404|770|774 0.5247|0.5220|0.5233 |
| Frequency - cSpell-Dev-1 | 536|769|774 0.6970|0.6925|0.6948 | 534|770|774 0.6935|0.6899|0.6917 | 521|770|774 0.6766|0.6731|0.6749 |
| Frequency - cSpell-Dev-2 | 536|769|774 0.6970|0.6925|0.6948 | 534|770|774 0.6935|0.6899|0.6917 | 522|770|774 0.6779|0.6744|0.6762 |
| Combined method | |||
| Noisy Channel | 552|769|774 0.7178|0.7132|0.7155 | 551|770|774 0.7156|0.7119|0.7137 | 523|770|774 0.6792|0.6757|0.6775 |
| CSpell Combined Orthographic and Frequency | 598|769|774 0.7776|0.7726|0.7751 | 598|769|774 0.7776|0.7726|0.7751 | 597|769|774 0.7763|0.7713|0.7738 |
V. Notes