CSpell

CSpell Score - Two-stage Ranking System

I. Introduction

CSpell uses the combination of above scoring component for ranking to reach better performance. More investigation and tests are planned to be performed to reach even better results when more resources are available.

II. Comparison on Different Ranking (Scoring) System

Tests on non-word on the development set with different ranking methods for the function mode of 1-to-1 and Split.

Ranking Mode	Raw data	Performance
Orthographic	593\|770\|774	0.7701\|0.7661\|0.7681
Frequency	534\|770\|774	0.6935\|0.6899\|0.6917
Context	458\|564\|774	0.8121\|0.5917\|0.6846

Noisy Channel	551\|770\|774	0.7156\|0.7119\|0.7137
Ensemble	586\|769\|774	0.7620\|0.7571\|0.7596
CSpell	603\|724\|774	0.8329\|0.7791\|0.8051

III. CSpell Two-stage Ranking System

Previous used the combination of all scoring to sort (Combo): by Orthographic, Noisy Channel, Frequency, Context
=> This ranking system can be improved if we use two-stage ranking system, as described below.
The two-stage ranking system is similar to determining a championship in sports through a regular season selection (stage 1) with the best team in the playoffs (stage 2) winning the championship.
Qualified the candidates using Orthographic, then rank
- Stage-1, qualified round:
  The non-word spelling and split candidate generator that relies on edit distance measure alone generates irrelevant candidates. Orthographic similarity scores were used to exclude irrelevant candidates. All candidates were ranked by orthographic similarity score first in the stage 1. That is to find candidates that have higher orthographic scores to be qualified to go the step-2 for playoff. All qualified candidates go to stage-2 are treat equally and their ranking in stage-1 was ignored.
  - Find the top Orthographic score
  - Find all candidates within a range of orthographic score
  - All these candidates are qualified legit correction because there is very little difference in term of orthographic score. They are qualified for the playoffs (further ranking).
  - We use a factor of top orthographic score for the criteria to qualify candidates for playoff. Empirical value is 0.08.
- Stage-2, playoff round:
  Find the best candidate using chain comparators to rank the selected candidates in stage 2 by the context score, then noisy channel scores in a sequential order.
  - Find the candidate with highest ranked by context score (best precision)
  - Find the candidate with highest ranked by noisy channel score (better precision than frequency score)
  - Find the candidate with the orthographic score is greater than 2.7 (empirical value) if only 1 candidate qualified in the playoff.
    Example: Lactoccocus lactis to Lactococcus lactis

Results:

Ranking Mode	Raw data	Performance
1 stage ranking
Combo: O, N, F, C	599\|770\|774	0.7779\|0.7739\|0.7759
Single method in the playoff round
Orthographic	593\|770\|774	0.7701\|0.7661\|0.7681
Context	438\|479\|774	0.9144\|0.5659\|0.6991
Frequency	590\|724\|774	0.8149\|0.7623\|0.7877
Noisy Channel	591\|724\|774	0.8163\|0.7636\|0.7891
Combo methods in the playoff round
CSpell: C, F	602\|724\|774	0.8315\|0.7778\|0.8037
CSpell: C, N	603\|724\|774	0.8329\|0.7791\|0.8051
CSpell: C, N, F	603\|724\|774	0.8329\|0.7791\|0.8051
CSpell: C, N, F, O	614\|770\|774	0.7974\|0.7933\|0.7953
CSpell: C, F, O	613\|770\|774	0.7961\|0.7920\|0.7940

Observation for deriving the algorithm:

From previous test, context seems have the best precision (worse recall) while Orthographic has the best recall and F1.
Use 2 stage ranking system:
- Stage-1, qualified round: use orthographic score to qualify and select candidates (within certain range from the top score).
- Stage-2, playoff round: use high precision methods (context score, then Noisy Channel) to find the top rank. Frequency is not needed because it is covered by Noisy Channel.

Example:

Misspelling havy

Stage-1:

Candidates within edit distance of 2: 24,253

Above candidates exist in the CSpell dictionary: 411

Order	Candidate	Orthographic Score	Token Similarity	Phonetic Similarity	Overlap Similarity
1	heavy	2.25000000	0.91000000	1.00000000	0.80000000
2	hav	2.20400000	0.90400000	1.00000000	0.75000000
3	have	2.20000000	0.90000000	1.00000000	0.75000000
4	hava	2.20000000	0.90000000	1.00000000	0.75000000
5	hay	2.13400000	0.90400000	0.90000000	0.75000000
6	wavy	2.13000000	0.90000000	0.90000000	0.75000000
7	hazy	2.13000000	0.90000000	0.90000000	0.75000000
8	navy	2.13000000	0.90000000	0.90000000	0.75000000
...	...	...	...	...	...
14	cavy	2.13000000	0.90000000	0.90000000	0.75000000
Orthographic score threshold is: 2.07 (= 2.25 x 0.92)
15	hairy	1.92000000	0.81000000	0.90000000	0.60000000
16	happy	1.92000000	0.81000000	0.90000000	0.60000000
17	haven	1.92000000	0.81000000	0.90000000	0.60000000
18	harry	1.92000000	0.81000000	0.90000000	0.60000000
...	...	...	...	...	...
71	hair	1.83000000	0.80000000	0.90000000	0.50000000
...	...	...	...	...	...
295	lady	1.56000000	0.80000000	0.80000000	0.25000000
...	...	...	...	...	...
411	aavp	1.36000000	0.80000000	0.80000000	0.00000000

Only 14 out of 411 candidates are above the orthographic score threshold and selected in stage-1. The threshold is 92% (configurable, determined empirically, used as default) of the highest orthographic score.

Stage-2:

The 14 selected candidates are re-ranked in stage-2 using context score and noisy channel score. The orthographic scores are disregarded. Accordingly, different candidate are used for correction in different context. For example:

Ex-Id	Input Text	Output Correction	Context (noisy channel) scores:
Ex-Id	Input Text	Output Correction	heavy	have	hay	wary
1	havy	have	0.0000 (0.00198412)	0.0000 (0.14933089)	0.0000 (0.00032499)	0.0000 (0.00004276)
2	havy duty	heavy duty	0.0597	-0.0302	-0.0053	0.0074
3	havy diabetes	have diabetes	-0.0667	0.0586	-0.0518	-0.0813
4	havy fever	hay fever	-0.1331	0.2280	0.2292	-0.0391
5	havy lines	wavy lines	-0.0170	-0.0410	-0.0702	0.1495

Ex-1, havy is corrected to have because there is no context (context scores are 0 for all candidates) and have has the highest noisy channel score (0.1493308).
Ex-2-5, the candidate with the highest context score is used for correction.
Ex-4, have and hay have very close context scores because have fever and hay fever are valid terms in the training corpus.
In out tests, other combined ranking system (such as Ensemble or noisy channel) seems to have a hard time to get to the appropriate corrections with context even context score is used. For example, diagnost is always changed to diagnosis regardless of the context when using other combined ranking systems. With our two-stage ranking system, diagnost are corrected to the diagnosis and was diagnosed.

A more comprehensive corpus could increase the precision and recall for using context score. Theoretically, if the corpus covers all the corrections in the application, context score will be the best. In such cases, context score could be dominated over other scores for final candidate selection (in stage-2).

Ranking Mode	Raw data	Performance
Orthographic	593\|770\|774	0.7701\|0.7661\|0.7681
Frequency	534\|770\|774	0.6935\|0.6899\|0.6917
Context	458\|564\|774	0.8121\|0.5917\|0.6846

Noisy Channel	551\|770\|774	0.7156\|0.7119\|0.7137
Ensemble	586\|769\|774	0.7620\|0.7571\|0.7596
CSpell	603\|724\|774	0.8329\|0.7791\|0.8051

Ranking Mode	Raw data	Performance
1 stage ranking
Combo: O, N, F, C	599\|770\|774	0.7779\|0.7739\|0.7759
Single method in the playoff round
Orthographic	593\|770\|774	0.7701\|0.7661\|0.7681
Context	438\|479\|774	0.9144\|0.5659\|0.6991
Frequency	590\|724\|774	0.8149\|0.7623\|0.7877
Noisy Channel	591\|724\|774	0.8163\|0.7636\|0.7891
Combo methods in the playoff round
CSpell: C, F	602\|724\|774	0.8315\|0.7778\|0.8037
CSpell: C, N	603\|724\|774	0.8329\|0.7791\|0.8051
CSpell: C, N, F	603\|724\|774	0.8329\|0.7791\|0.8051
CSpell: C, N, F, O	614\|770\|774	0.7974\|0.7933\|0.7953
CSpell: C, F, O	613\|770\|774	0.7961\|0.7920\|0.7940