CSpell

Non-word Split

I. Introduction

This page describes the processes for non-word split detection and correction.

II. Processes

  • Detector:
    NonWordDetector.java
    • non-word: invalid word, not in checkDic. checkDic includes EW, NUM, etc.
    • Not Exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
  • Candidates:
    SplitCandidates.java
    • SplitNo <= 5 (configurable: CS_CAN_NW_MAX_SPLIT_NO)
    • is a multiword (in mwDic)
    • each word (unigram) in the candidate is in splitDic, splitDic does not include pure aA, such as "er"
    • unigram is not digit, unit, etc. (already split in ND splitter)
  • Ranker:
    RankNonWordByMode.java,
    uses the top ranked candidate in the two-stage ranking system for correction:
    • Stage-1:
      • Orthographic score
        • Edit Distance Similarity
        • Phonetic Similarity (Double Metaphone)
        • Overlap Similarity
      • Find the top orthographic score
      • All candidates within the distance of 0.08 of top orthographic score are selected as qualified candidates to go to stage-2 for final ranking
      • The ranks by orthographic score in this stage is disregarded in stage-2
    • Stage-2:
      Use chain comparators in a sequential order of the following scores:
      • Context Score (Dual embedding Word2Vec)
        • context radius = 2 (configurable, CS_NW_SPLIT_CONTEXT_RADIUS)
          This value is not used/implemented in CSpell because CSpell combine non-word split and 1-to-1 correction module together.

        • topScore != 0
      • Noisy Channel Score
  • Corrector:
    SplitCorrector.java
    • Update the focus token with top rank split candidate
    • FlatMap the split word to inTokenlist
    • Update process history to non-word-split

III. Development Test

  • True-Positive Non-word Split:
    IdSourceOriginal WordSplit Word
    TP-110225aftercareemailaftercare email
    TP-210225facebooksharefacebook share
    TP-310225friendsharefriend share
    TP-412616leftsideleft side
    TP-513090viceversavice versa
    TP-613509inthisin this
    TP-714849shuntfrom2007.Howshunt from 2007. How
    TP-814849oftendooften do
    TP-914knowaboutknow about
    TP-1016928thankyouthank you
    TP-1117942everytimeevery time
    TP-1218175ofcourseof course
    TP-1318611aquestiona question
    TP-1418855backalsoback also
    TP-1526diseaseanydisease any
    TP-167saythissay this
    TP-1788ilosti lost
    • TP-7: involved splitter operation from ND and NW:
      • Input: shuntfrom2007.How
      • ND: shuntfrom 2007. How
      • NW: shunt from 2007. How
  • False-Positive Non-word Split:
    IdSourceOriginal WordSplit WordCorrect Words
    FP-112235counterindicativecounter indicativecontraindicated
    FP-212271earthmoversearth moversearthmovers
    FP-313014orthopaedicianorthopaedic ianorthopaedician
    FP-413165iami amiam (error?)
    FP-513922shoudiceshou diceshouldice
    FP-61nonethingnone thingnothing
    FP-74diseardis eardisease
    FP-861metopticmet opticmetopic
    FP-97chromezonechrome zonechromosome
    FP-1012574biletanbile tanbiletan
    • TP-6, 7: too far away
    • TP-4: error in the goldStd set.
    • TP-2, 3, 5, 10: Need more coverage in the corpus and dictionary
  • False-Negative Non-word Split:
    IdSourceOriginal WordCorrected WordCorrect Word
    FN-110025u-creatininecreatinineurine creatinine
    FN-211186tbinthetbinthetb in the
    FN-311243menimgtisneefmenimgtisneefmeningitis needs
    FN-412271area!unfortionatlyarea! unfortionatlyarea! unfortunately
    FN-512616camedowncame downcame down
    FN-614514ihavehavei have
    FN-714alotalota lot
    FN-816519eye-doctoreye-doctoreye doctor
    FN-918203pthrpeptidepthrpeptidepthr peptide
    FN-1088polipsremovedpolipsremovedpolyps removed
    • TP-1, 3, 4, 9, 10: multiple operation involved (not in the design scope)
    • TP-2: TB was no in the split dictionary
    • TP-5, 6, 7: need further investigation. Maybe to separate Split and 1-To-1 into two class in NW.
    • TP-8: spVars