CSpell

CSpell Score - Two-stage Ranking System

I. Introduction

CSpell uses the combination of above scoring component for ranking to reach better performance. More investigation and tests are planned to be performed to reach even better results when more resources are available.

II. Comparison on Different Ranking (Scoring) System

Tests on non-word on the development set with different ranking methods for the function mode of 1-to-1 and Split.

Ranking ModeRaw dataPerformance
Orthographic593|770|7740.7701|0.7661|0.7681
Frequency534|770|7740.6935|0.6899|0.6917
Context458|564|7740.8121|0.5917|0.6846
Noisy Channel551|770|7740.7156|0.7119|0.7137
Ensemble586|769|7740.7620|0.7571|0.7596
CSpell603|724|7740.8329|0.7791|0.8051

III. CSpell Two-stage Ranking System

  • Previous used the combination of all scoring to sort (Combo): by Orthographic, Noisy Channel, Frequency, Context
    => This ranking system can be improved if we use two-stage ranking system, as described below.

  • The two-stage ranking system is similar to determining a championship in sports through a regular season selection (stage 1) with the best team in the playoffs (stage 2) winning the championship.
  • Qualified the candidates using Orthographic, then rank
    • Stage-1, qualified round:
      The non-word spelling and split candidate generator that relies on edit distance measure alone generates irrelevant candidates. Orthographic similarity scores were used to exclude irrelevant candidates. All candidates were ranked by orthographic similarity score first in the stage 1. That is to find candidates that have higher orthographic scores to be qualified to go the step-2 for playoff. All qualified candidates go to stage-2 are treat equally and their ranking in stage-1 was ignored.
      • Find the top Orthographic score
      • Find all candidates within a range of orthographic score
      • All these candidates are qualified legit correction because there is very little difference in term of orthographic score. They are qualified for the playoffs (further ranking).
      • We use a factor of top orthographic score for the criteria to qualify candidates for playoff. Empirical value is 0.08.
    • Stage-2, playoff round:
      Find the best candidate using chain comparators to rank the selected candidates in stage 2 by the context score, then noisy channel scores in a sequential order.
      • Find the candidate with highest ranked by context score (best precision)
      • Find the candidate with highest ranked by noisy channel score (better precision than frequency score)
      • Find the candidate with the orthographic score is greater than 2.7 (empirical value) if only 1 candidate qualified in the playoff.
        Example: Lactoccocus lactis to Lactococcus lactis
  • Results:

    Ranking ModeRaw dataPerformance
    1 stage ranking
    Combo: O, N, F, C599|770|7740.7779|0.7739|0.7759
    Single method in the playoff round
    Orthographic593|770|7740.7701|0.7661|0.7681
    Context438|479|7740.9144|0.5659|0.6991
    Frequency590|724|7740.8149|0.7623|0.7877
    Noisy Channel591|724|7740.8163|0.7636|0.7891
    Combo methods in the playoff round
    CSpell: C, F602|724|7740.8315|0.7778|0.8037
    CSpell: C, N603|724|7740.8329|0.7791|0.8051
    CSpell: C, N, F603|724|7740.8329|0.7791|0.8051
    CSpell: C, N, F, O614|770|7740.7974|0.7933|0.7953
    CSpell: C, F, O613|770|7740.7961|0.7920|0.7940

  • Observation for deriving the algorithm:
    • From previous test, context seems have the best precision (worse recall) while Orthographic has the best recall and F1.
    • Use 2 stage ranking system:
      • Stage-1, qualified round: use orthographic score to qualify and select candidates (within certain range from the top score).
      • Stage-2, playoff round: use high precision methods (context score, then Noisy Channel) to find the top rank. Frequency is not needed because it is covered by Noisy Channel.
    • Example:
      • Misspelling havy
      • Stage-1:
        • Candidates within edit distance of 2: 24,253
        • Above candidates exist in the CSpell dictionary: 411
          OrderCandidateOrthographic ScoreToken SimilarityPhonetic SimilarityOverlap Similarity
          1heavy2.250000000.910000001.000000000.80000000
          2hav2.204000000.904000001.000000000.75000000
          3have2.200000000.900000001.000000000.75000000
          4hava2.200000000.900000001.000000000.75000000
          5hay2.134000000.904000000.900000000.75000000
          6wavy2.130000000.900000000.900000000.75000000
          7hazy2.130000000.900000000.900000000.75000000
          8navy2.130000000.900000000.900000000.75000000
          ..................
          14cavy2.130000000.900000000.900000000.75000000
          Orthographic score threshold is: 2.07 (= 2.25 x 0.92)
          15hairy1.920000000.810000000.900000000.60000000
          16happy1.920000000.810000000.900000000.60000000
          17haven1.920000000.810000000.900000000.60000000
          18harry1.920000000.810000000.900000000.60000000
          ..................
          71hair1.830000000.800000000.900000000.50000000
          ..................
          295lady1.560000000.800000000.800000000.25000000
          ..................
          411aavp1.360000000.800000000.800000000.00000000
        • Only 14 out of 411 candidates are above the orthographic score threshold and selected in stage-1. The threshold is 92% (configurable, determined empirically, used as default) of the highest orthographic score.
      • Stage-2:
        • The 14 selected candidates are re-ranked in stage-2 using context score and noisy channel score. The orthographic scores are disregarded. Accordingly, different candidate are used for correction in different context. For example:
          Ex-IdInput TextOutput CorrectionContext (noisy channel) scores:
          heavyhavehaywary
          1havyhave0.0000
          (0.00198412)
          0.0000
          (0.14933089)
          0.0000
          (0.00032499)
          0.0000
          (0.00004276)
          2havy dutyheavy duty0.0597-0.0302-0.00530.0074
          3havy diabeteshave diabetes-0.06670.0586-0.0518-0.0813
          4havy feverhay fever-0.13310.22800.2292-0.0391
          5havy lineswavy lines-0.0170-0.0410-0.07020.1495

        • Ex-1, havy is corrected to have because there is no context (context scores are 0 for all candidates) and have has the highest noisy channel score (0.1493308).
        • Ex-2-5, the candidate with the highest context score is used for correction.
        • Ex-4, have and hay have very close context scores because have fever and hay fever are valid terms in the training corpus.
        • In out tests, other combined ranking system (such as Ensemble or noisy channel) seems to have a hard time to get to the appropriate corrections with context even context score is used. For example, diagnost is always changed to diagnosis regardless of the context when using other combined ranking systems. With our two-stage ranking system, diagnost are corrected to the diagnosis and was diagnosed.
    • A more comprehensive corpus could increase the precision and recall for using context score. Theoretically, if the corpus covers all the corrections in the application, context score will be the best. In such cases, context score could be dominated over other scores for final candidate selection (in stage-2).