Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Overlap Similarity Score

Introduction:

This page describes the algorithm of calculating similarity score for leading and trailing characters overlapping. This algorithm is enhanced for better ranking from the Ensemble by adding weights on the leading and ending overlapping (see examples at the end) and use it as tiebreaker when the score of edit distance and phonetic are tie.

Algorithm:

  • not case sensitive
  • Similarity score is between 0.00 and 1.00
    • Enhanced Overlap Algorithm:
      conditionformulaexample
      leadOverLap = minLen Score = (1.0*leadOverlap + 0.1*trailOverlap)/(1.0*maxLen)
      • 123 and 123123
      • leadOverLap = 3 (123)
      • trailOverLap = 3 (123)
      • minLength = 3 (123)
      • maxLength = 6 (123123)
      • score = (3 + 0.3)/6 = 0.55
      trailOverLap = minLen Score = (0.1*leadOverlap + 1.0*trailOverlap)/(1.0*maxLen)
      • ends and leadends
      • leadOverLap = 0
      • trailOverLap = 4 (ends)
      • minLength = 4 (ends)
      • maxLength = 8 (leadends)
      • score = (0 + 4)/8 = 0.50
      else Score = (1.0*leadOverLap + 1.0*trailOverLap)/(1.0*maxLen)
      • dianosed and diagnosed
      • leadOverLap = 3 (dia)
      • trailOverLap = 5 (nosed)
      • minLength = 8 (dianosed)
      • maxLength = 9 (diagnosed)
      • score = (3 + 5)/9 = 0.89

      where:

      • minLen is the shorter length of two input strings
      • maxLen is the longer length of two input strings
      • leadOverLap is the length of matching characters at the beginning
      • trailOverLap is the length of matching characters at the end

    • More examples:
      ExampleString 1string 2CalculationScoreNotes
      Ex-1spelspell=(4+0.1)/50.82spel is closer to spell than speil
      spelspeil=(3+1)/50.80
      Ex-2spellsspell=(0.1+5)/60.85spell is closer to sspell than nspell
      spellnspell=(0+5)/60.83

Source code:

  • OverlapSimilarity.java