CSpell

Context Score

Introduction

This page describes the ranking algorithm using context to choose a correct word from the suggested candidates for a misspelt word. There are two major approaches:

n-gram model:
n-gram model (bi-gram or tri-gram) seems like a simple and straight forward approach. However, we did not implement this model due to the time constraint because word-embedding are the state-of-art approach (compared than n-gram model) from various research. Also, it is a simple to use with outstanding performance.
word-embedding:
- word2vec, codes, google codes,
- GolVe, codes

In CSpell, we chose the Continuous Bag of Words (CBOW) model in word2vec to rank candidates because CBOW is designed to predict a word from a surrounding context.

Components

Dual embedding in the continuous bag of words model
- Program:
  ${PRE_PROCESS}/RunCorpus
  3
  4
  6 (Best)
  shell> ${DEV}/DL/word2vec/word2vec/word2vec2 -train ${IN_FILE} -outsyn0 ${SYN_0_FILE} -outsyn1 ${SYN_1_FILE} -outsyn1neg ${SYN_1N_FILE} -size 200 -window 5 -cbow 1 -hs 1 -threads 12
- Input:
  - ./Crawl/word2Vec/CorpusW2V.data
- Output:
  - ./Crawl/word2Vec/word2VecNew.syn0 (Input Matrix, word-vec)
  - ./Crawl/word2Vec/word2VecNew.syn1 (Output Matrix)
  - ./Crawl/word2Vec/word2VecNew.syn1n (Output Matrix, with negative sampling, better for prediction)
calculate word vector (word2vec) for context

Source Code:

RankByContext.java: get ranked candidate list or top rank candidate by context
ContextScore.java: java object of context score
Word2VecContext.java: Word2Vc context Utility to get context or context vector
Word2VecScore.java: get score by cosine similarity or inner-dot
DoubleVecUtil.java: basic vector operation in Double

Tests:

Use the Baseline non-word 1-to-1 and split (development set)

Results:

Test Case	Software	Data (Word Vec)	Score Methods	Performance	Notes
	Baseline	Baseline	Cosine	358\|807\|774 0.4436\|0.4625\|0.4529	Baseline
2-1.c.cos.b	CSpell	Baseline	Cosine: [IM]	484\|771\|774 0.6278\|0.6253\|0.6265
2-2.c.cos.0	CSpell	Health Corpora	Cosine: [IM]	443\|770\|774 0.5753\|0.5724\|0.5738	baseline of new Corpus
2-3.c.cbow.0-1	CSpell	Health Corpora	CBOW: [IM] & [OM], syn1 Only use positive scores	406\|678\|774 0.5988\|0.5245\|0.5592	Not used, use syn1neg instead
2-4.c.cbow.0-1n.+0-	CSpell	Health Corpora	CBOW: [IM] & [OM], syn1neg Use only positive (+) scores	429\|524\|774 0.8187\|0.5543\|0.6610
2-5.c.cbow.0-1n.+-0!=	CSpell	Health Corpora	CBOW: [IM] & [OM], syn1neg Rank by +, -, 0	505\|748\|774 0.6751\|0.6525\|0.6636
2-6.c.cbow.0-1n.+0-!=	CSpell	Health Corpora	CBOW: [IM] & [OM], syn1neg Use +, - (only if no +) scores	445\|554\|774 0.8032\|0.5749\|0.6702
2-9.c.cbow.0-1n.+0-!=.cos	CSpell	Health Corpora	CBOW cos: [IM] & [OM], syn1neg *Use +, - (only if no +) scores	446\|554\|774 0.8051\|0.5762\|0.6717	Best (10% improvement)
2-10.c.cbow.0-1n.+0-!=.cos + fixed LC on W2V	CSpell	Health Corpora	CBOW cos: [IM] & [OM], syn1neg *Use +, - (only if no +) scores	457\|562\|774 0.8231\|0.5904\|0.6841	Best (11% improvement)
Final	CSpell	Health Corpora	CBOW cos: [IM] & [OM], syn1neg *Use +, - (only if no +) scores	458\|564\|774 0.8121\|0.5917\|0.6846	Best (11% improvement)

* Word2Vec Score Algorithm:

Word2VecScore.java: Use Cosine Similarity score
ContextScoreComparator.java: to sort the context score
RankByContext.java:
- If top rank score != 0: candidate = topRank
  - If topRank score > 0.0 => use it to correct, the bigger the positive score means the word is closer to the prediction
  - If topRank score < 0.0 => use it to correct, the smaller the negative score means the word is farer to the prediction
- If top rank score = 0:
  - If only 1 candidate => use it to correct, even we don't have any information of Wrod2Vec score for the candidate
  - If have multiple candidate, no correct.
    Word2Vec score = 0.0 means we don't have any information on the candidate. Thus, we don't know if 0.0 is better than a negative score.
- Context scores might be positive, zero, or negative. A zero context score means the target word does not have a word vector, which was not chosen over a negative score.