CSpell

Context Score

Introduction

This page describes the ranking algorithm using context to choose a correct word from the suggested candidates for a misspelt word. There are two major approaches:

  • n-gram model:
    n-gram model (bi-gram or tri-gram) seems like a simple and straight forward approach. However, we did not implement this model due to the time constraint because word-embedding are the state-of-art approach (compared than n-gram model) from various research. Also, it is a simple to use with outstanding performance.
  • word-embedding:

In CSpell, we chose the Continuous Bag of Words (CBOW) model in word2vec to rank candidates because CBOW is designed to predict a word from a surrounding context.

Components

  • Dual embedding in the continuous bag of words model
    • Program:
      ${PRE_PROCESS}/RunCorpus
      3
      4
      6 (Best)
      shell> ${DEV}/DL/word2vec/word2vec/word2vec2 -train ${IN_FILE} -outsyn0 ${SYN_0_FILE} -outsyn1 ${SYN_1_FILE} -outsyn1neg ${SYN_1N_FILE} -size 200 -window 5 -cbow 1 -hs 1 -threads 12
    • Input:
      • ./Crawl/word2Vec/CorpusW2V.data
    • Output:
      • ./Crawl/word2Vec/word2VecNew.syn0 (Input Matrix, word-vec)
      • ./Crawl/word2Vec/word2VecNew.syn1 (Output Matrix)
      • ./Crawl/word2Vec/word2VecNew.syn1n (Output Matrix, with negative sampling, better for prediction)
  • calculate word vector (word2vec) for context

Source Code:

  • RankByContext.java: get ranked candidate list or top rank candidate by context
  • ContextScore.java: java object of context score
  • Word2VecContext.java: Word2Vc context Utility to get context or context vector
  • Word2VecScore.java: get score by cosine similarity or inner-dot
  • DoubleVecUtil.java: basic vector operation in Double

Tests:

  • Use the Baseline non-word 1-to-1 and split (development set)
  • Results:

    Test CaseSoftwareData (Word Vec)Score MethodsPerformanceNotes
    BaselineBaselineCosine358|807|774
    0.4436|0.4625|0.4529
    Baseline
    2-1.c.cos.bCSpellBaselineCosine: [IM]484|771|774
    0.6278|0.6253|0.6265
    2-2.c.cos.0CSpellHealth CorporaCosine: [IM]443|770|774
    0.5753|0.5724|0.5738
    baseline of new Corpus
    2-3.c.cbow.0-1CSpellHealth CorporaCBOW: [IM] & [OM], syn1
    Only use positive scores
    406|678|774
    0.5988|0.5245|0.5592
    Not used, use syn1neg instead
    2-4.c.cbow.0-1n.+0-CSpellHealth CorporaCBOW: [IM] & [OM], syn1neg
    Use only positive (+) scores
    429|524|774
    0.8187|0.5543|0.6610
    2-5.c.cbow.0-1n.+-0!=CSpellHealth CorporaCBOW: [IM] & [OM], syn1neg
    Rank by +, -, 0
    505|748|774
    0.6751|0.6525|0.6636
    2-6.c.cbow.0-1n.+0-!=CSpellHealth CorporaCBOW: [IM] & [OM], syn1neg
    Use +, - (only if no +) scores
    445|554|774
    0.8032|0.5749|0.6702
    2-9.c.cbow.0-1n.+0-!=.cosCSpellHealth CorporaCBOW cos: [IM] & [OM], syn1neg
    *Use +, - (only if no +) scores
    446|554|774
    0.8051|0.5762|0.6717
    Best (10% improvement)
    2-10.c.cbow.0-1n.+0-!=.cos + fixed LC on W2VCSpellHealth CorporaCBOW cos: [IM] & [OM], syn1neg
    *Use +, - (only if no +) scores
    457|562|774
    0.8231|0.5904|0.6841
    Best (11% improvement)
    FinalCSpellHealth CorporaCBOW cos: [IM] & [OM], syn1neg
    *Use +, - (only if no +) scores
    458|564|774
    0.8121|0.5917|0.6846
    Best (11% improvement)

* Word2Vec Score Algorithm:

  • Word2VecScore.java: Use Cosine Similarity score
  • ContextScoreComparator.java: to sort the context score
  • RankByContext.java:
    • If top rank score != 0: candidate = topRank
      • If topRank score > 0.0 => use it to correct, the bigger the positive score means the word is closer to the prediction
      • If topRank score < 0.0 => use it to correct, the smaller the negative score means the word is farer to the prediction
    • If top rank score = 0:
      • If only 1 candidate => use it to correct, even we don't have any information of Wrod2Vec score for the candidate
      • If have multiple candidate, no correct.
        Word2Vec score = 0.0 means we don't have any information on the candidate. Thus, we don't know if 0.0 is better than a negative score.

    • Context scores might be positive, zero, or negative. A zero context score means the target word does not have a word vector, which was not chosen over a negative score.