CSpell

Non-word Correction

This page describes the algorithm for non-word correction.

I. Functions

II. Results on Training Set

Tests CSpell ranking mode on the development set for non-word with different function modes:

Function ModeRaw dataPerformance
ESpell230|1180|7740.1949|0.2972|0.2354
Jazzy (ASpell)186|393|7740.4733|0.2403|0.3188
Ensemble552|825|7740.6691|0.7132|0.6904
CSpell, non-dictionary-based
non-dictionary-based340|373|7740.9115|0.4393|0.5929
CSpell, non-word, Single Function
1-to-1588|699|7740.8412|0.7597|0.7984
Split365|469|7740.7783|0.4716|0.5873
Merge343|382|7740.8979|0.4432|0.5934
CSpell, non-word, Combined Functions
1-to-1 + Split603|724|7740.8329|0.7791|0.8051
1-to-1 + Split + Merge606|731|7740.8290|0.7829|0.8053

From the results:

  • The performance is improved 11.5% from Baseline

III. Examples

  • ND (non-dictionary-based):

    IDInputOutputNotes
    ND-1"Good""Good"Xml/Html handler
    ND-2plspleaseInformal Expression handler
    ND-320years20 yearsLeading Digit Splitter
    ND-4from2007from 2007Ending Digit Splitter
    ND-5volunteers(healthy)volunteers (healthy)Leading Punctuation Splitter
    ND-6pain.help!pain. help!Ending Punctuation Splitter
    ND-7pain.pls help!pain. please help!Combo
    ND-8visit at pain.com!visit at pain.com!No correction!
    • Splitters and handlers are used in a Java-8 stream operation for non-dictionary-based corrections.

  • NW, Merge:

    IDInputOutputNotes
    M-1dur ingduringMerge
    M-2non drugnondrugMerge
    M-3non proteinnon-proteinMerge with hyphen
    M-4non surgicalnon surgicalNo merge
    • Example 2,3,4: depends on the spVars and the context and frequency to decide if it merges, merged with space or hyphen
    • "non" is an element-non-word, it is used for non-word merge operation.
    • Most element word are valid single word. However, few of them are invalid single words, such as "non", "se", "pre", "vitro", "vivo", "intra". They are element-non-words and only exist in the multiwords:
      multiwordElement-non-word
      non surgicalnon
      in vitrovitro
      in vivo grownvivo
      intra articular routeintra
      per sese

  • NW, 1To1:

    IDInputOutput
    1-1good diagnosisedgood diagnosis
    1-2was diagnosised withwas diagnosed with
    • diagnosised is corrected to diagnosis (best orthographic score) in example 1-1. However, it is corrected to diagnosed in Example 1-2 with context. From these 2 example, we observed that this unsupervised context score model captured certain syntactical and semantic regularities.

  • NW, Split:

    IDInputOutput
    S-1thankyouthank you
    S-2shuntfrom2007.howshunt from 2007. how
    • Example S-2 shows a combined corrections from ND and NW splitter.