CSpell Score - Two-stage Ranking System
I. Introduction
CSpell uses the combination of above scoring component for ranking to reach better performance. More investigation and tests are planned to be performed to reach even better results when more resources are available.
II. Comparison on Different Ranking (Scoring) System
Tests on non-word on the development set with different ranking methods for the function mode of 1-to-1 and Split.
Ranking Mode | Raw data | Performance
|
---|
Orthographic | 593|770|774 | 0.7701|0.7661|0.7681
|
Frequency | 534|770|774 | 0.6935|0.6899|0.6917
|
Context | 458|564|774 | 0.8121|0.5917|0.6846
|
|
Noisy Channel | 551|770|774 | 0.7156|0.7119|0.7137
|
Ensemble | 586|769|774 | 0.7620|0.7571|0.7596
|
CSpell | 603|724|774 | 0.8329|0.7791|0.8051
|
III. CSpell Two-stage Ranking System
- Previous used the combination of all scoring to sort (Combo): by Orthographic, Noisy Channel, Frequency, Context
=> This ranking system can be improved if we use two-stage ranking system, as described below.
- The two-stage ranking system is similar to determining a championship in sports through a regular season selection (stage 1) with the best team in the playoffs (stage 2) winning the championship.
- Qualified the candidates using Orthographic, then rank
- Stage-1, qualified round:
The non-word spelling and split candidate generator that relies on edit distance measure alone generates irrelevant candidates. Orthographic similarity scores were used to exclude irrelevant candidates. All candidates were ranked by orthographic similarity score first in the stage 1.
That is to find candidates that have higher orthographic scores to be qualified to go the step-2 for playoff. All qualified candidates go to stage-2 are treat equally and their ranking in stage-1 was ignored.
- Find the top Orthographic score
- Find all candidates within a range of orthographic score
- All these candidates are qualified legit correction because there is very little difference in term of orthographic score. They are qualified for the playoffs (further ranking).
- We use a factor of top orthographic score for the criteria to qualify candidates for playoff. Empirical value is 0.08.
- Stage-2, playoff round:
Find the best candidate using chain comparators to rank the selected candidates in stage 2 by the context score, then noisy channel scores in a sequential order.
- Find the candidate with highest ranked by context score (best precision)
- Find the candidate with highest ranked by noisy channel score (better precision than frequency score)
- Find the candidate with the orthographic score is greater than 2.7 (empirical value) if only 1 candidate qualified in the playoff.
Example: Lactoccocus lactis to Lactococcus lactis
- Results:
Ranking Mode | Raw data | Performance
|
---|
1 stage ranking
|
Combo: O, N, F, C | 599|770|774 | 0.7779|0.7739|0.7759
|
Single method in the playoff round
|
Orthographic | 593|770|774 | 0.7701|0.7661|0.7681
|
Context | 438|479|774 | 0.9144|0.5659|0.6991
|
Frequency | 590|724|774 | 0.8149|0.7623|0.7877
|
Noisy Channel | 591|724|774 | 0.8163|0.7636|0.7891
|
Combo methods in the playoff round
|
CSpell: C, F | 602|724|774 | 0.8315|0.7778|0.8037
|
CSpell: C, N | 603|724|774 | 0.8329|0.7791|0.8051
|
CSpell: C, N, F | 603|724|774 | 0.8329|0.7791|0.8051
|
CSpell: C, N, F, O | 614|770|774 | 0.7974|0.7933|0.7953
|
CSpell: C, F, O | 613|770|774 | 0.7961|0.7920|0.7940
|
- Observation for deriving the algorithm:
- From previous test, context seems have the best precision (worse recall) while Orthographic has the best recall and F1.
- Use 2 stage ranking system:
- Stage-1, qualified round: use orthographic score to qualify and select candidates (within certain range from the top score).
- Stage-2, playoff round: use high precision methods (context score, then Noisy Channel) to find the top rank. Frequency is not needed because it is covered by Noisy Channel.
- Example:
- Misspelling havy
- Stage-1:
- Candidates within edit distance of 2: 24,253
- Above candidates exist in the CSpell dictionary: 411
Order | Candidate | Orthographic Score | Token Similarity | Phonetic Similarity | Overlap Similarity
|
---|
1 | heavy | 2.25000000 | 0.91000000 | 1.00000000 | 0.80000000
|
2 | hav | 2.20400000 | 0.90400000 | 1.00000000 | 0.75000000
|
3 | have | 2.20000000 | 0.90000000 | 1.00000000 | 0.75000000
|
4 | hava | 2.20000000 | 0.90000000 | 1.00000000 | 0.75000000
|
5 | hay | 2.13400000 | 0.90400000 | 0.90000000 | 0.75000000
|
6 | wavy | 2.13000000 | 0.90000000 | 0.90000000 | 0.75000000
|
7 | hazy | 2.13000000 | 0.90000000 | 0.90000000 | 0.75000000
|
8 | navy | 2.13000000 | 0.90000000 | 0.90000000 | 0.75000000
|
... | ... | ... | ... | ... | ...
|
14 | cavy | 2.13000000 | 0.90000000 | 0.90000000 | 0.75000000
|
Orthographic score threshold is: 2.07 (= 2.25 x 0.92)
|
---|
15 | hairy | 1.92000000 | 0.81000000 | 0.90000000 | 0.60000000
|
16 | happy | 1.92000000 | 0.81000000 | 0.90000000 | 0.60000000
|
17 | haven | 1.92000000 | 0.81000000 | 0.90000000 | 0.60000000
|
18 | harry | 1.92000000 | 0.81000000 | 0.90000000 | 0.60000000
|
... | ... | ... | ... | ... | ...
|
71 | hair | 1.83000000 | 0.80000000 | 0.90000000 | 0.50000000
|
... | ... | ... | ... | ... | ...
|
295 | lady | 1.56000000 | 0.80000000 | 0.80000000 | 0.25000000
|
... | ... | ... | ... | ... | ...
|
411 | aavp | 1.36000000 | 0.80000000 | 0.80000000 | 0.00000000
|
- Only 14 out of 411 candidates are above the orthographic score threshold and selected in stage-1. The threshold is 92% (configurable, determined empirically, used as default) of the highest orthographic score.
- Stage-2:
- The 14 selected candidates are re-ranked in stage-2 using context score and noisy channel score. The orthographic scores are disregarded. Accordingly, different candidate are used for correction in different context. For example:
Ex-Id | Input Text | Output Correction | Context (noisy channel) scores:
|
---|
heavy | have | hay | wary
|
---|
1 | havy | have | 0.0000 (0.00198412) | 0.0000 (0.14933089) | 0.0000 (0.00032499) | 0.0000 (0.00004276)
|
2 | havy duty | heavy duty | 0.0597 | -0.0302 | -0.0053 | 0.0074
|
3 | havy diabetes | have diabetes | -0.0667 | 0.0586 | -0.0518 | -0.0813
|
4 | havy fever | hay fever | -0.1331 | 0.2280 | 0.2292 | -0.0391
|
5 | havy lines | wavy lines | -0.0170 | -0.0410 | -0.0702 | 0.1495
|
- Ex-1, havy is corrected to have because there is no context (context scores are 0 for all candidates) and have has the highest noisy channel score (0.1493308).
- Ex-2-5, the candidate with the highest context score is used for correction.
- Ex-4, have and hay have very close context scores because have fever and hay fever are valid terms in the training corpus.
- In out tests, other combined ranking system (such as Ensemble or noisy channel) seems to have a hard time to get to the appropriate corrections with context even context score is used. For example, diagnost is always changed to diagnosis regardless of the context when using other combined ranking systems. With our two-stage ranking system, diagnost are corrected to the diagnosis and was diagnosed.
- A more comprehensive corpus could increase the precision and recall for using context score. Theoretically, if the corpus covers all the corrections in the application, context score will be the best. In such cases, context score could be dominated over other scores for final candidate selection (in stage-2).