Orthographic Score
Introduction
This page describes the ranking algorithm to choose a correct word from the suggested candidates for a spelling error word.
Algorithm
Orthographic score is a weighted sum of the following 3 similarity scores.
That is:
Orthographic score = wf1 * Token similarity score + wf2 * Phonetic similarity score + wf3 * Overlap similarity score
where:
- wf1 = 1.00 (configurable:
CS_ORTHO_SCORE_ED_DIST_FAC
)
- wf2 = 0.70 (configurable:
CS_ORTHO_SCORE_PHONETIC_FAC
)
- wf3 = 0.80 (configurable:
CS_ORTHO_SCORE_OVERLAP_FAC
)
Source Code:
- OrthographicScore.java
- OrthographicScoreComparator.java
- RankingByOrthographic.java
=> Get the candidate with top orthographic score
Example:
Orthographic score between misspelling truely and candidate truly
- Token similarity score:
There is a delete operation ('e' is delete) from truely to truly. The normalized delete cost is 0.096.
The token similarity score is calculated:
= 1.0 - 0.096
= 0.904
- Phonetic similarity score:
The phonetic representation (Double Metaphone code) of truely and truly are the same [TRL], thus the phonetic similarity score is 1.0
- Leading/trailing character overlap similarity score:
= (leading overlap characters + trailing overlap characters) / the length of longer terms
= (3+2)/6
= 0.83
- Orthographic score:
= 1.00 * 0.904 + 0.70 * 1.0 + 0.80 * 0.83
= 2.27