CSpell

Orthographic Score

Introduction

This page describes the ranking algorithm to choose a correct word from the suggested candidates for a spelling error word.

Algorithm

Orthographic score is a weighted sum of the following 3 similarity scores.

That is:
Orthographic score = wf1 * Token similarity score + wf2 * Phonetic similarity score + wf3 * Overlap similarity score

where:

  • wf1 = 1.00 (configurable: CS_ORTHO_SCORE_ED_DIST_FAC)
  • wf2 = 0.70 (configurable: CS_ORTHO_SCORE_PHONETIC_FAC)
  • wf3 = 0.80 (configurable: CS_ORTHO_SCORE_OVERLAP_FAC)

Source Code:

  • OrthographicScore.java
  • OrthographicScoreComparator.java
  • RankingByOrthographic.java
    => Get the candidate with top orthographic score

Example:
Orthographic score between misspelling truely and candidate truly

  • Token similarity score:
    There is a delete operation ('e' is delete) from truely to truly. The normalized delete cost is 0.096.
    The token similarity score is calculated:
    = 1.0 - 0.096
    = 0.904
  • Phonetic similarity score:
    The phonetic representation (Double Metaphone code) of truely and truly are the same [TRL], thus the phonetic similarity score is 1.0
  • Leading/trailing character overlap similarity score:
    = (leading overlap characters + trailing overlap characters) / the length of longer terms
    = (3+2)/6
    = 0.83

  • Orthographic score:
    = 1.00 * 0.904 + 0.70 * 1.0 + 0.80 * 0.83
    = 2.27