CSpell

Dictionary Functions - Check Proper Noun

I. Introduction

Proper nouns should be checked separately for spelling errors to increase the performance. Proper nouns could include mixed cases as shown in the table below.

CapitalizedAachen, Beyer, Colgate
Mixed Caseszur Hausen, ABC Medical Center, al-Tawil
lower caseamicon, coll, dang
upper caseBCDE, BSMMU, CINAHL

II. Approaches

Three approaches are compared as follows:

  • By Algorithm:
    • As implemented in baseline, proper nouns are detected by algorithm:
      • Capitalized case
  • By Data - case sensitive:
    • Use proper nouns from Lexicon
    • Use case sensitive dictionary
  • By Data - case insensitive:
    • Use proper nouns from Lexicon
    • Use non-case sensitive dictionary

III. Results

Test result with Single-Word, English-Word as dictionary:

ApproachTP|Ret|RelPrecisionRecallF1
Algorithm521|710|8140.73380.64000.6837
Data-Case537|755|8140.71130.65790.6845
Data-No Case537|751|8140.71500.65790.6863

  • With data approach, F1 and recall are increased, precision is decreased.
  • The [TP] is the same between two data approaches, the difference in retrieval includes 4 [FP]:
    • 14276 prego preg => Prego, no case is not right
    • 16167 thier ther => Thier, no case is not right
    • 17055 veracruz vera cruz => Veracruz, no case is good
    • 17991 gujarat gujar at => Gujarat, no case is good

    => It is about 50% correct for using case-sensitive approach, and result in worse precision and F1 compared to case-non-sensitive approach (because F1 and precision are all above 70%). Thus, the data non-sensitive approach is implemented. One of the main reason for using case insensitive is that users (consumers) might put lowercase/upper case/mixed case for proper nouns. So the chance is 50/50.
  • Use data - case sensitive could increase the recall (by finding more spelling errors), but, it will rely on the ranking algorithm to find the correct word for improving precision.