CSpell

Performance Tests on Corpora

I. Test Setup

  • Data: Training Set
  • Gold Standard: non-word only
  • Dictionary: CSpell (Lexicon-based)
  • Corpora:
    • Tested 2 different corpora for word frequency score and noisy channel score
    • Use the consumer health corpus to train word2vec
  • Ranking: CSpell

II. Test Results

CorpusSizePrecisionRecallF1
MEDLINE496,3880.80850.79070.7995
Consumer Health Corpus109,8180.84070.78420.8115

III. Discussion

  • The performance of F1 score dropped 1.2% when changing the corpus from consumer health corpus to MEDLINE corpus.
  • The corpus from MEDLINE is 4.52 times the size of consumer health corpus.
  • A smaller relevant corpora outperform general large collections that are not necessary related to consumer health data.