CSpell

Dictionary Functions - Check Abb/Acr

I. Introduction

Abbreviations and acronyms should be checked separately for spelling errors to increase the performance. Abbreviations and acronyms could include mixed cases as shown in the table below.

CapitalizedAra Us, Ast, Aza
Mixed CasesAChE, ACh-esterase, Ala, araA, araCTP, Armenian S.S.R.
lower casecdc kinase subunit 1, eppin, mael
upper caseAA, AB, ABC

II. Approaches

Three approaches are compared as follows:

  • By Algorithm:
    • As implemented in baseline, abbreviations and acronyms are detected by algorithm:
      • Upper case
      • Length <= 4
  • By Data - case sensitive:
    • Use abb/acr Lexicon
    • Use case sensitive dictionary
  • By Data - case insensitive:
    • Use abb/acr Lexicon
    • Use non-case sensitive dictionary

III. Results

Test cSpell with Single-Word, English-Word as dictionary:

ApproachTP|Ret|RelPrecisionRecallF1
Algorithm521|710|8140.73380.64000.6837
Data-Case522|710|8140.73520.64130.6850
Data-No Case521|703|8140.74110.64000.6869

  • With data approach, F1 and precision are increased, recall is about the same. It is a better way to filter out abbreviations and acronyms.
  • The difference between two data approaches includes 1 [TP] and 6 [FP]:
    • [TP] 16282 rply reply => RplY, case is good

    • [FP] 10008 c-pap cpap => C-PAP, no case is good
    • [FP] 12374 ajd ad => AJD, no case is bad
    • [FP] 13598 t1d ted => T1D, no case is good
    • [FP] 21 ihhs ihsa => IHHs, no case is good
    • [FP] 21 ihhs iha => IHHs, no case is good
    • [FP] 73 ing wing => ING, no case is bad

    => It is about 57% (4:3) better for using not-case-sensitive approach, and result in better precision and F1 compared to case-sensitive approach (because F1 and precision are all above 70%). Thus, the data non-sensitive approach is implemented. One of the main reason for using case insensitive is that users (consumers) might put lowercase/upper case/mixed case for abb/acr. So the chance is 50/50.
  • Use data -case sentivie could increase the recall (by find more spelling errors), but, it will rely on the ranking algorithm to find the correct word for improving precision.

IV. Results of Combining ProperNoun and AA

Test cSpell with Single-Word, English-Word as dictionary:

ApproachTP|Ret|RelPrecisionRecallF1
Algorithm521|710|8140.73380.64000.6837
Data-Case538|757|8140.71070.66090.6849
Data-No Case537|745|8140.72080.65970.6889