Corrector
This page describes the corrector algorithm that replaces the spelling errors with top ranked candidates to update the text.
I. One-To-One
- Finding: Find the top rank candidate (TokenObj)
- Correction: Add to the outTokenList
- Java:
OneToOneSplitCorrector.AddToFlatMapList
- Example:
Input | ... | dianosed | ...
|
---|
Top Candidate | ... | diagnosed | ...
|
---|
Correction | ... | diagnosed | ...
|
---|
II. Split
- Finding: Find the top rank candidate (TokenObj)
- Correction: use FlatMap to the outTokenList
The top rank candidate (the split words) needs to be flat mapped to a list of TokenObjs and then add to the outTokenList.
- Java:
OneToOneSplitCorrector.AddToFlatMapList
.
- Example:
Input | ... | brokenbonecannotsleep | ...
|
---|
Top Candidate | ... | broken bone can not sleep | ...
|
---|
Correction | ... | broken | | bone | | can | | not | | sleep | ...
|
---|
III. Merge
- Finding: Find the top rank candidate (TokenObj)
- Correction:
- Update tokens for all MergeObjs
- Go through all MergeObjs
- update tokens before target merge start
- update merge at target
- add tokens after the last MergeObj
- Java:
ProcessNonWordMerge.CorrectTokenListByMerge
- Example:
Input | ... | problems | | dur | | ing | | her | | pregnancies.
|
---|
Correction-1 | ... | problems |
|
---|
Correction-2 | ... | problems | | during
|
---|
Correction-3 | ... | problems | | during | | her | | pregnancies.
|
---|
* MergeObj:
tarWord | mergeWord | coreMergeWord | mergeNo | tarIndex | startIndex | endIndex | tarPos | startPos | endPos
|
- xxxIndex is the index in the original text (including space tokens), used in merge operation to correct the input text
- xxxPos is the index in the non-space token list, used to find the context for context scores.
- coreMergeWord is used to take care of ending punctuation. Such as "disap point ment." to "disappointment."