Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Token Similarity Score

Introduction

The token similarity score is calculated by cost of edit distance and the penalty of splits. It is different from edit distance:

  • Edit distance is calculated by deletion, insertion, substitution, and transposition with cost value of 1 for each operation. Thus, the value of edit distance of two strings are counted as the minimum number of operations required to transform from one string into the other one.
  • Token similarity uses different cost values to calculate edit distance cost. It is combined with split penalty, then normalized with ceiling of 1000 (see below).

Source Codes & Algorithm:

  • EditDistance.java
    => Enhanced features with options to specify costs:
    • delete cost
    • insert cost
    • substitute (replace) cost
    • transpose (swap) cost
    • case change cost
    • Enhanced flag for cases of: swap + case changes

  • Examples:

    String-1String-2Edit DistanceNotes
    SpellSpell0same strings
    SpellSpel1delete
    SpellSpeell1insert
    SpellSpall1substitute
    SpellSepll1transpose
    Spellspell1case change
    SpellSp ell1split
    SpellSeplls2transpose, insert

  • TokenSimilarity.java
    • Get total cost of edit distance for two strings:
      • delete cost = 96 (enhanced from 95)
      • insert cost = 90 (enhanced from 95)
      • substitute cost = 100
      • transpose cost = 94 (enhanced from 95)
      • case change cost = 10
    • Get penalty for split
      • split penalty (cost) = insert cost = 90, for each split
    • Similarity Score:
      • total cost = Edit Distance cost + split penalty
      • similarity score = (ceiling - cost)/ceiling, where ceiling = 1000
      • 0.00 <= similarity score <= 1.00

    • Examples:

      String-1String-2Edit Distance CostSplit PenaltyToken Similarity ScoreNotes
      SpellSpell001.000same strings
      SpellSpel9600.904delete
      SpellSpeell9000.910insert
      SpellSpall10000.900substitute
      SpellSepll9400.906transpose
      Spellspell1000.990case change
      SpellSp ell90900.820split
      SpellSeplls18400.816transpose, insert