CSpell

Overlap Similarity Score

Introduction:

This page describes the algorithm of calculating similarity score for leading and trailing characters overlapping. This algorithm is enhanced for better ranking from the Ensemble by adding weights on the leading and ending overlapping (see examples at the end) and use it as tiebreaker when the score of edit distance and phonetic are tie.

Algorithm:

  • not case sensitive
  • Similarity score is between 0.00 and 1.00
    • Enhanced Overlap Algorithm:
      conditionformulaexample
      leadOverLap = minLen Score = (1.0*leadOverlap + 0.1*trailOverlap)/(1.0*maxLen)
      • 123 and 123123
      • leadOverLap = 3 (123)
      • trailOverLap = 3 (123)
      • minLength = 3 (123)
      • maxLength = 6 (123123)
      • score = (3 + 0.3)/6 = 0.55
      trailOverLap = minLen Score = (0.1*leadOverlap + 1.0*trailOverlap)/(1.0*maxLen)
      • ends and leadends
      • leadOverLap = 0
      • trailOverLap = 4 (ends)
      • minLength = 4 (ends)
      • maxLength = 8 (leadends)
      • score = (0 + 4)/8 = 0.50
      else Score = (1.0*leadOverLap + 1.0*trailOverLap)/(1.0*maxLen)
      • dianosed and diagnosed
      • leadOverLap = 3 (dia)
      • trailOverLap = 5 (nosed)
      • minLength = 8 (dianosed)
      • maxLength = 9 (diagnosed)
      • score = (3 + 5)/9 = 0.89

      where:

      • minLen is the shorter length of two input strings
      • maxLen is the longer length of two input strings
      • leadOverLap is the length of matching characters at the beginning
      • trailOverLap is the length of matching characters at the end

    • More examples:
      ExampleString 1string 2CalculationScoreNotes
      Ex-1spelspell=(4+0.1)/50.82spel is closer to spell than speil
      spelspeil=(3+1)/50.80
      Ex-2spellsspell=(0.1+5)/60.85spell is closer to sspell than nspell
      spellnspell=(0+5)/60.83

Source code:

  • OverlapSimilarity.java