CSpell

Token Similarity Score

Introduction

The token similarity score is calculated by cost of edit distance and the penalty of splits. It is different from edit distance:

  • Edit distance is calculated by deletion, insertion, substitution, and transposition with cost value of 1 for each operation. Thus, the value of edit distance of two strings are counted as the minimum number of operations required to transform from one string into the other one.
  • Token similarity uses different cost values to calculate edit distance cost. It is combined with split penalty, then normalized with ceiling of 1000 (see below).

Source Codes & Algorithm:

  • EditDistance.java
    => Enhanced features with options to specify costs:
    • delete cost
    • insert cost
    • substitute (replace) cost
    • transpose (swap) cost
    • case change cost
    • Enhanced flag for cases of: swap + case changes

  • Examples:

    String-1String-2Edit DistanceNotes
    SpellSpell0same strings
    SpellSpel1delete
    SpellSpeell1insert
    SpellSpall1substitute
    SpellSepll1transpose
    Spellspell1case change
    SpellSp ell1split
    SpellSeplls2transpose, insert

  • TokenSimilarity.java
    • Get total cost of edit distance for two strings:
      • delete cost = 96 (enhanced from 95)
      • insert cost = 90 (enhanced from 95)
      • substitute cost = 100
      • transpose cost = 94 (enhanced from 95)
      • case change cost = 10
    • Get penalty for split
      • split penalty (cost) = insert cost = 90, for each split
    • Similarity Score:
      • total cost = Edit Distance cost + split penalty
      • similarity score = (ceiling - cost)/ceiling, where ceiling = 1000
      • 0.00 <= similarity score <= 1.00

    • Examples:

      String-1String-2Edit Distance CostSplit PenaltyToken Similarity ScoreNotes
      SpellSpell001.000same strings
      SpellSpel9600.904delete
      SpellSpeell9000.910insert
      SpellSpall10000.900substitute
      SpellSepll9400.906transpose
      Spellspell1000.990case change
      SpellSp ell90900.820split
      SpellSeplls18400.816transpose, insert