Context Score
Introduction
This page describes the ranking algorithm using context to choose a correct word from the suggested candidates for a misspelt word. There are two major approaches:
In CSpell, we chose the Continuous Bag of Words (CBOW) model in word2vec to rank candidates because CBOW is designed to predict a word from a surrounding context.
Components
${PRE_PROCESS}/RunCorpus
3
4
6
(Best)
shell> ${DEV}/DL/word2vec/word2vec/word2vec2 -train ${IN_FILE} -outsyn0 ${SYN_0_FILE} -outsyn1 ${SYN_1_FILE} -outsyn1neg ${SYN_1N_FILE} -size 200 -window 5 -cbow 1 -hs 1 -threads 12
Source Code:
Tests:
Test Case | Software | Data (Word Vec) | Score Methods | Performance | Notes |
---|---|---|---|---|---|
Baseline | Baseline | Cosine | 358|807|774 0.4436|0.4625|0.4529 | Baseline | |
2-1.c.cos.b | CSpell | Baseline | Cosine: [IM] | 484|771|774 0.6278|0.6253|0.6265 | |
2-2.c.cos.0 | CSpell | Health Corpora | Cosine: [IM] | 443|770|774 0.5753|0.5724|0.5738 | baseline of new Corpus |
2-3.c.cbow.0-1 | CSpell | Health Corpora | CBOW: [IM] & [OM], syn1 Only use positive scores | 406|678|774 0.5988|0.5245|0.5592 | Not used, use syn1neg instead |
2-4.c.cbow.0-1n.+0- | CSpell | Health Corpora | CBOW: [IM] & [OM], syn1neg Use only positive (+) scores | 429|524|774 0.8187|0.5543|0.6610 | |
2-5.c.cbow.0-1n.+-0!= | CSpell | Health Corpora | CBOW: [IM] & [OM], syn1neg Rank by +, -, 0 | 505|748|774 0.6751|0.6525|0.6636 | |
2-6.c.cbow.0-1n.+0-!= | CSpell | Health Corpora | CBOW: [IM] & [OM], syn1neg Use +, - (only if no +) scores | 445|554|774 0.8032|0.5749|0.6702 | |
2-9.c.cbow.0-1n.+0-!=.cos | CSpell | Health Corpora | CBOW cos: [IM] & [OM], syn1neg *Use +, - (only if no +) scores | 446|554|774 0.8051|0.5762|0.6717 | Best (10% improvement) |
2-10.c.cbow.0-1n.+0-!=.cos + fixed LC on W2V | CSpell | Health Corpora | CBOW cos: [IM] & [OM], syn1neg *Use +, - (only if no +) scores | 457|562|774 0.8231|0.5904|0.6841 | Best (11% improvement) |
Final | CSpell | Health Corpora | CBOW cos: [IM] & [OM], syn1neg *Use +, - (only if no +) scores | 458|564|774 0.8121|0.5917|0.6846 | Best (11% improvement) |
* Word2Vec Score Algorithm:
Word2VecScore.java
: Use Cosine Similarity score
ContextScoreComparator.java
: to sort the context score
RankByContext.java
: