CSpell

CSpell

Performance Tests on Corpora

I. Test Setup

Data: Training Set
Gold Standard: non-word only
Dictionary: CSpell (Lexicon-based)
Corpora:
- Tested 2 different corpora for word frequency score and noisy channel score
- Use the consumer health corpus to train word2vec
Ranking: CSpell

II. Test Results

Corpus	Size	Precision	Recall	F1
MEDLINE	496,388	0.8085	0.7907	0.7995
Consumer Health Corpus	109,818	0.8407	0.7842	0.8115

III. Discussion

The performance of F1 score dropped 1.2% when changing the corpus from consumer health corpus to MEDLINE corpus.
The corpus from MEDLINE is 4.52 times the size of consumer health corpus.
A smaller relevant corpora outperform general large collections that are not necessary related to consumer health data.