Performance Tests on Corpora
I. Test Setup
- Data: Training Set
- Gold Standard: non-word only
- Dictionary: CSpell (Lexicon-based)
- Corpora:
- Tested 2 different corpora for word frequency score and noisy channel score
- Use the consumer health corpus to train word2vec
- Ranking: CSpell
II. Test Results
Corpus | Size | Precision | Recall | F1
|
---|
MEDLINE | 496,388 | 0.8085 | 0.7907 | 0.7995
|
Consumer Health Corpus | 109,818 | 0.8407 | 0.7842 | 0.8115
|
III. Discussion
- The performance of F1 score dropped 1.2% when changing the corpus from consumer health corpus to MEDLINE corpus.
- The corpus from MEDLINE is 4.52 times the size of consumer health corpus.
- A smaller relevant corpora outperform general large collections that are not necessary related to consumer health data.