Token Similarity Score
Introduction
The token similarity score is calculated by cost of edit distance and the penalty of splits. It is different from edit distance:
- Edit distance is calculated by deletion, insertion, substitution, and transposition with cost value of 1 for each operation. Thus, the value of edit distance of two strings are counted as the minimum number of operations required to transform from one string into the other one.
- Token similarity uses different cost values to calculate edit distance cost. It is combined with split penalty, then normalized with ceiling of 1000 (see below).
Source Codes & Algorithm:
- EditDistance.java
=> Enhanced features with options to specify costs:
- delete cost
- insert cost
- substitute (replace) cost
- transpose (swap) cost
- case change cost
- Enhanced flag for cases of: swap + case changes
- Examples:
String-1 | String-2 | Edit Distance | Notes
|
---|
Spell | Spell | 0 | same strings
|
Spell | Spel | 1 | delete
|
Spell | Speell | 1 | insert
|
Spell | Spall | 1 | substitute
|
Spell | Sepll | 1 | transpose
|
Spell | spell | 1 | case change
|
Spell | Sp ell | 1 | split
|
Spell | Seplls | 2 | transpose, insert
|
- TokenSimilarity.java
- Get total cost of edit distance for two strings:
- delete cost = 96 (enhanced from 95)
- insert cost = 90 (enhanced from 95)
- substitute cost = 100
- transpose cost = 94 (enhanced from 95)
- case change cost = 10
- Get penalty for split
- split penalty (cost) = insert cost = 90, for each split
- Similarity Score:
- total cost = Edit Distance cost + split penalty
- similarity score = (ceiling - cost)/ceiling, where ceiling = 1000
- 0.00 <= similarity score <= 1.00
- Examples:
String-1 | String-2 | Edit Distance Cost | Split Penalty | Token Similarity Score | Notes
|
---|
Spell | Spell | 0 | 0 | 1.000 | same strings
|
Spell | Spel | 96 | 0 | 0.904 | delete
|
Spell | Speell | 90 | 0 | 0.910 | insert
|
Spell | Spall | 100 | 0 | 0.900 | substitute
|
Spell | Sepll | 94 | 0 | 0.906 | transpose
|
Spell | spell | 10 | 0 | 0.990 | case change
|
Spell | Sp ell | 90 | 90 | 0.820 | split
|
Spell | Seplls | 184 | 0 | 0.816 | transpose, insert
|