CSpell

Performance Tests on Dual Embedding

I. Test Setup

  • Data: Training Set
  • Gold Standard: non-word only
  • Dictionary: CSpell
  • Corpus: consumer Health corpus
  • Ranking: context score

II. Test Results

EmbeddingMatrixesPrecisionRecallF1
SingleIM0.58870.59170.5902
DualIM & OM (neg)0.80350.59170.6815
SingleOM0.62890.58010.6035
DualIM & OM0.63390.57490.6030

III. Discussion

  • Single embedding is the general practice of word2vec. It calculates the context scores by the cosine similarity with word vector [IM] between context and candidates. Thus, the context score means how similar between the context and the candidate, which is not necessary the best predicted word for the given context. IMO, it is not a good model to be used for context score for prediction.
  • Double embedding uses both [IM] and [OM] to calculate the context score for a given context. It fits back in the original word2vec, CBOW model for prediction.
  • In the C code, [IM] is syn0, [OM] is syn1-neg, syn1 is not used.
  • The improvement from single embedding to dual embedding for non-word correction using context score is 9.13% (from 59.02% to 68.15%).