CSpell

Ensemble Algorithm

The high level algorithm of ensemble method for spelling correction are described as follows.

I. Source code:
LinearWeightedEnsembleSpellCorrection.java

II. Algorithm

  • Input text: read in text of the whole question
  • Converted to List<Span> processSpans: remove header, such as SUBJECT:, EMAIL:, etc.
  • Converted to fixed: preProcessed text to handle contractions, informational expression, puntuaction, split digits, etc.

  • Converted to List<CoreMap> sentences: use CoreNLP for annotation, treat the whole text as 1 sentence
  • Converted to List<CoreLabel> tokenAnns: Token separated by space and punctuation (NLPCore)
  • ProcessTokens to get:
    • List<String> origTokens: Separated by space and period (end of sentences) only.
    • List<String> modTokens: Tag [MUM] and others
    • List&Integer> begins: the beginning position of modToken in the origTokens list
    • List&Integer> positions: the index of modToken in the origTokens list
    • List&Integer> origPositions: the beginning position of origToken in the origTokens list

  • correct to get corrected text:
    Go through each origToken

    • skip token if it is "clinicaltrials", [no letter token], [leading number], [ProNoun]

    • Get Candidates:
      • LinkedHashSet<String> suggestions: single word suggestions
      • Map<String,String> mergeSuggestions: merge suggestions, key: merge suggestion, value: before merge tokens

      • Real-Word:
        token in the dictionary & selected real-word option
        • token.length( ) > 3
        • mergeSuggestion: higher than frequency and use word Vector
        • candidates with higher frequency threshold and merge suggestion
      • Non-Word:
        token not in the dictionary & selected real-word option
        • token.length( ) > 2
        • mergeSuggestion: no frequency check and not use word Vector
        • candidates with no frequency check and not use word Vector

    • Rank Candidates:
      Find the candidate with highest score
      • Real-Word:
        Score = 0.15 * w2vScore + 0.25 * corpusScore + 0.6
      • Non-Word:
        Score = 0.15 * w2vScore + 0.25 * corpusScore + 0.2 * (overlapScore + phoneticScore + edScore)

      Where:

      ScoreSource CodeNotes
      edScoreDictionaryBasedSpellChecker.getEditSimScore( )
      • Get cost from Jazzy Edit Distance:
        • Delete: 95
        • Insert: 95
        • Substitute: 100
        • Transpose: 90
        • Case: 10
      • Add split penalty: 95
      • Normalized score: 0.0 ~ 1.0 (= (1000-cost)/1000)
      phoneticScoreDictionaryBasedSpellChecker.getPhoneticSimScore( )
      • Get Metaphone codes of src/tar from Jazzy double Metaphone
      • Get cost of codes from Jazzy edit distance
      • Get penalty cost of src/tar
      • Normalized score: 0.0 ~ 1.0 (= (1000-cost)/1000)
      overlapScoreOverLapUtil.leadTrailOverlap( )
      corpusScoreCorpusFrequencyCounts.getUnigramScore( )
      w2vScoreWord2Vector.getSimScore( )