CSpell

Non-word Spelling (1-To-1)

I. Introduction

This page describes the processes for non-word spelling (1-to-1) detection and correction.

II. Processes

  • Detector:
    NonWordDetector.java
    • non-word: invalid word, not in checkDic. checkDic includes EW, NUM, etc.)
    • Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
  • Candidates:
    OneToOneCandidates.java
    • max. length of word <= 25 (configurable: CS_CAN_NW_1TO1_WORD_MAX_LENGTH)
      Longer non-word generate too many candidates and results in slower speed performance. This variable is used to resolve this issue. The recall might decreased if this value is set too small.

    • Edit Dist <= 2
    • candidate is in the suggDic (valid word)
  • Ranker:
    RankNonWordByMode.java,
    uses the top ranked candidate in the two-stage ranking system for correction:
    • Stage-1:
      • Orthographic score
        • Edit Distance Similarity score
        • Phonetic Similarity score (Double Metaphone)
        • Overlap Similarity score
      • Find the top orthographic score
      • Stage 1 Range factor for qualifying candidate = 0.08 (configurable: CS_RANKER_NW_S1_RANK_RANGE_FAC)
        All candidates within the distance of 0.08 of the top orthographic score are selected as qualified candidates to go to stage-2 for final ranking. That is cnadidates have top 92% of orthographic score as the highest candidate will be qualified as candidates for stage-2 ranking.
      • The ranks by orthographic score in this stage is disregarded in stage-2
    • Stage-2:
      Use chain comparators in a sequential order of the following scores:
    • Corrector:
      OneToONeCorrector.java
      • Update the focus token with the top rank candidate
      • Update process history to non-word-1-to-1

    III. Development Test

    • True-Positive non-word 1-to-1:
      IdSourceOriginal WordCorrected Word
      TP-110023knoledgeknowledge
      TP-210040truelytruly
      TP-310475diagnostdiagnosed
      TP-46diagnosiseddiagnosed
      ............
      • TP-3, 4: the correction changed when the context is changed!
        • diagnost -> diagnosis
        • was diagnost -> was diagnosed
        • diagnost with -> diagnosed with
        • was diagnost with -> was diagnosed with

        • diagnosised -> diagnosis
        • was diagnosised with -> was diagnosed with
    • False-Positive non-word 1-to-1:
      IdSourceOriginal WordCorrected WordCorrect Word
      FP-110058BbeB
      FP-210084i.e.ice.i.e.
      FP-311144clancychancyclumsy
      FP-411588bagingbaggingbegging
      ...............
      • FP-1, 2: could be improved by word length and case
      • FP-3: the distance is too far away
    • False-Negative non-word 1-to-1:
      IdSourceOriginal WordCorrected WordCorrect Word
      FN-110285hitialahitialahiatal
      FN-210714havyhaveheavy
      FN-310ewingsewingsewing's
      FN-411144traumatologotraumatologotraumatologist
      FN-511186segmenssegmentsegments
      • FP-3: possessive
      • FP-4: the distance is too far away
      • FP-5: inflectional variants