Ensemble Algorithm
The high level algorithm of ensemble method for spelling correction are described as follows.
I. Source code:
LinearWeightedEnsembleSpellCorrection.java
II. Algorithm
text
: read in text of the whole question
List<Span> processSpans
: remove header, such as SUBJECT:, EMAIL:, etc.
fixed
: preProcessed text to handle contractions, informational expression, puntuaction, split digits, etc.
List<CoreMap> sentences
: use CoreNLP for annotation, treat the whole text as 1 sentence
List<CoreLabel> tokenAnns
: Token separated by space and punctuation (NLPCore)
ProcessTokens
to get:
List<String> origTokens
: Separated by space and period (end of sentences) only.
List<String> modTokens
: Tag [MUM] and others
List&Integer> begins
: the beginning position of modToken in the origTokens list
List&Integer> positions
: the index of modToken in the origTokens list
List&Integer> origPositions
: the beginning position of origToken in the origTokens list
correct
to get corrected text:
LinkedHashSet<String> suggestions
: single word suggestions
Map<String,String> mergeSuggestions
: merge suggestions, key: merge suggestion, value: before merge tokens
Where:
Score | Source Code | Notes |
---|---|---|
edScore | DictionaryBasedSpellChecker.getEditSimScore( ) |
|
phoneticScore | DictionaryBasedSpellChecker.getPhoneticSimScore( ) |
|
overlapScore | OverLapUtil.leadTrailOverlap( ) | |
corpusScore | CorpusFrequencyCounts.getUnigramScore( ) | |
w2vScore | Word2Vector.getSimScore( ) |