Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Orthographic Score

Introduction

This page describes the ranking algorithm to choose a correct word from the suggested candidates for a spelling error word.

Algorithm

Orthographic score is a weighted sum of the following 3 similarity scores.

That is:
Orthographic score = wf1 * Token similarity score + wf2 * Phonetic similarity score + wf3 * Overlap similarity score

where:

  • wf1 = 1.00 (configurable: CS_ORTHO_SCORE_ED_DIST_FAC)
  • wf2 = 0.70 (configurable: CS_ORTHO_SCORE_PHONETIC_FAC)
  • wf3 = 0.80 (configurable: CS_ORTHO_SCORE_OVERLAP_FAC)

Source Code:

  • OrthographicScore.java
  • OrthographicScoreComparator.java
  • RankingByOrthographic.java
    => Get the candidate with top orthographic score

Example:
Orthographic score between misspelling truely and candidate truly

  • Token similarity score:
    There is a delete operation ('e' is delete) from truely to truly. The normalized delete cost is 0.096.
    The token similarity score is calculated:
    = 1.0 - 0.096
    = 0.904
  • Phonetic similarity score:
    The phonetic representation (Double Metaphone code) of truely and truly are the same [TRL], thus the phonetic similarity score is 1.0
  • Leading/trailing character overlap similarity score:
    = (leading overlap characters + trailing overlap characters) / the length of longer terms
    = (3+2)/6
    = 0.83

  • Orthographic score:
    = 1.00 * 0.904 + 0.70 * 1.0 + 0.80 * 0.83
    = 2.27