Non-word Spelling (1-To-1)
I. Introduction
This page describes the processes for non-word spelling (1-to-1) detection and correction.
II. Processes
- Detector:
NonWordDetector.java
- non-word: invalid word, not in checkDic. checkDic includes EW, NUM, etc.)
- Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
- Candidates:
OneToOneCandidates.java
- max. length of word <= 25 (configurable:
CS_CAN_NW_1TO1_WORD_MAX_LENGTH
)
Longer non-word generate too many candidates and results in slower speed performance. This variable is used to resolve this issue. The recall might decreased if this value is set too small.
- Edit Dist <= 2
- candidate is in the suggDic (valid word)
- Ranker:
RankNonWordByMode.java
,
uses the top ranked candidate in the two-stage ranking system for correction:
- Stage-1:
- Orthographic score
- Edit Distance Similarity score
- Phonetic Similarity score (Double Metaphone)
- Overlap Similarity score
- Find the top orthographic score
- Stage 1 Range factor for qualifying candidate = 0.08 (configurable:
CS_RANKER_NW_S1_RANK_RANGE_FAC
)
All candidates within the distance of 0.08 of the top orthographic score are selected as qualified candidates to go to stage-2 for final ranking. That is cnadidates have top 92% of orthographic score as the highest candidate will be qualified as candidates for stage-2 ranking.
- The ranks by orthographic score in this stage is disregarded in stage-2
- Stage-2:
Use chain comparators in a sequential order of the following scores:
- Corrector:
OneToONeCorrector.java
- Update the focus token with the top rank candidate
- Update process history to non-word-1-to-1
III. Development Test
- True-Positive non-word 1-to-1:
Id | Source | Original Word | Corrected Word
|
---|
TP-1 | 10023 | knoledge | knowledge
|
TP-2 | 10040 | truely | truly
|
TP-3 | 10475 | diagnost | diagnosed
|
TP-4 | 6 | diagnosised | diagnosed
|
... | ... | ... | ...
|
- TP-3, 4: the correction changed when the context is changed!
- diagnost -> diagnosis
- was diagnost -> was diagnosed
- diagnost with -> diagnosed with
- was diagnost with -> was diagnosed with
- diagnosised -> diagnosis
- was diagnosised with -> was diagnosed with
- False-Positive non-word 1-to-1:
Id | Source | Original Word | Corrected Word | Correct Word
|
---|
FP-1 | 10058 | B | be | B
|
FP-2 | 10084 | i.e. | ice. | i.e.
|
FP-3 | 11144 | clancy | chancy | clumsy
|
FP-4 | 11588 | baging | bagging | begging
|
... | ... | ... | ... | ...
|
- FP-1, 2: could be improved by word length and case
- FP-3: the distance is too far away
- False-Negative non-word 1-to-1:
Id | Source | Original Word | Corrected Word | Correct Word
|
---|
FN-1 | 10285 | hitiala | hitiala | hiatal
|
FN-2 | 10714 | havy | have | heavy
|
FN-3 | 10 | ewings | ewings | ewing's
|
FN-4 | 11144 | traumatologo | traumatologo | traumatologist
|
FN-5 | 11186 | segmens | segment | segments
|
- FP-3: possessive
- FP-4: the distance is too far away
- FP-5: inflectional variants