CSpell

Performance Tests on Edit Distance Similarity Score

I. Test Setup

  • Data: Training Set
  • Gold Standard: non-word only
  • Dictionary: CSpell (Lexicon-based)
  • Corpus: none
  • Ranking: Edit Distance and Orthographic ranking

II. Test Results

  • Tests on various weighting factors (WF) on costs of the edit distance (delete, insert, substitute, and transpose). The WF for orthographic is 1.0, 0.0, 0.0 (only edit distance is used to represent orthographic score).

    IDDeleteInsertSubstituteTransposePrecisionRecallF1Notes
    11.001.001.001.000.74650.74940.7479Same ratio of WF
    20.950.950.950.950.74650.74940.7479
    31.000.950.950.950.74390.74680.7453Increased 1 WF
    40.951.000.950.950.70140.70410.7028
    50.950.951.000.950.74900.75190.7505
    60.950.950.951.000.74770.75060.7492
    70.900.950.950.950.71300.71580.7144Decreased 1 WF
    80.950.900.950.950.75550.75840.7569
    90.950.950.900.950.70400.70670.7054
    100.950.950.950.900.74000.74290.7415
    110.950.901.000.950.75930.76230.7608Try and error to find the WF for best F1
    120.930.901.000.970.75800.76100.7595
    130.970.901.000.930.76060.76360.7621
    99-10.900.901.000.950.75290.75580.7544
    99-20.900.900.951.000.74900.75190.7505
    99-30.950.951.000.900.74520.74810.7466

  • Tests on various weighting factors (WF) on costs of the edit distance (delete, insert, substitute, and transpose). The WF for orthographic is 1.0, 1.0, 1.0.
    IDDeleteInsertSubstituteTransposePrecisionRecallF1
    99-40.970.901.000.930.75930.76230.7608
    99-50.950.951.000.900.76060.76360.7621

III. Discussion

  • The result of tests 1-2 are the same. That is the same ratio of weighting factors leads to same results
  • From the results of test 3-6, we observed the higher the weighting factor of substitute cost, the better the F1 score.
  • From the results of test 7-10, we observed the lower the weighting factor of insert cost, the better the F1 score.
  • Find the best F1 by try and error from tests 11-99-3.

  • Use test 13 for the weighting factors for costs of delete, insert, substitute and transpose.