CSpell

Non-word Split

I. Introduction

This page describes the processes for non-word split detection and correction.

II. Processes

Detector:
NonWordDetector.java
- non-word: invalid word, not in checkDic. checkDic includes EW, NUM, etc.
- Not Exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
Candidates:
SplitCandidates.java
- SplitNo <= 5 (configurable: CS_CAN_NW_MAX_SPLIT_NO)
- is a multiword (in mwDic)
- each word (unigram) in the candidate is in splitDic, splitDic does not include pure aA, such as "er"
- unigram is not digit, unit, etc. (already split in ND splitter)
Ranker:
RankNonWordByMode.java,
uses the top ranked candidate in the two-stage ranking system for correction:
- Stage-1:
  - Orthographic score
    - Edit Distance Similarity
    - Phonetic Similarity (Double Metaphone)
    - Overlap Similarity
  - Find the top orthographic score
  - All candidates within the distance of 0.08 of top orthographic score are selected as qualified candidates to go to stage-2 for final ranking
  - The ranks by orthographic score in this stage is disregarded in stage-2
- Stage-2:
  Use chain comparators in a sequential order of the following scores:
  - Context Score (Dual embedding Word2Vec)
    - context radius = 2 (configurable, CS_NW_SPLIT_CONTEXT_RADIUS)
      This value is not used/implemented in CSpell because CSpell combine non-word split and 1-to-1 correction module together.
    - topScore != 0
  - Noisy Channel Score
Corrector:
SplitCorrector.java
- Update the focus token with top rank split candidate
- FlatMap the split word to inTokenlist
- Update process history to non-word-split

III. Development Test

True-Positive Non-word Split:

Id	Source	Original Word	Split Word
TP-1	10225	aftercareemail	aftercare email
TP-2	10225	facebookshare	facebook share
TP-3	10225	friendshare	friend share
TP-4	12616	leftside	left side
TP-5	13090	viceversa	vice versa
TP-6	13509	inthis	in this
TP-7	14849	shuntfrom2007.How	shunt from 2007. How
TP-8	14849	oftendo	often do
TP-9	14	knowabout	know about
TP-10	16928	thankyou	thank you
TP-11	17942	everytime	every time
TP-12	18175	ofcourse	of course
TP-13	18611	aquestion	a question
TP-14	18855	backalso	back also
TP-15	26	diseaseany	disease any
TP-16	7	saythis	say this
TP-17	88	ilost	i lost

TP-7: involved splitter operation from ND and NW:
- Input: shuntfrom2007.How
- ND: shuntfrom 2007. How
- NW: shunt from 2007. How

False-Positive Non-word Split:

Id	Source	Original Word	Split Word	Correct Words
FP-1	12235	counterindicative	counter indicative	contraindicated
FP-2	12271	earthmovers	earth movers	earthmovers
FP-3	13014	orthopaedician	orthopaedic ian	orthopaedician
FP-4	13165	iam	i am	iam (error?)
FP-5	13922	shoudice	shou dice	shouldice
FP-6	1	nonething	none thing	nothing
FP-7	4	disear	dis ear	disease
FP-8	61	metoptic	met optic	metopic
FP-9	7	chromezone	chrome zone	chromosome
FP-10	12574	biletan	bile tan	biletan

TP-6, 7: too far away
TP-4: error in the goldStd set.
TP-2, 3, 5, 10: Need more coverage in the corpus and dictionary

False-Negative Non-word Split:

Id	Source	Original Word	Corrected Word	Correct Word
FN-1	10025	u-creatinine	creatinine	urine creatinine
FN-2	11186	tbinthe	tbinthe	tb in the
FN-3	11243	menimgtisneef	menimgtisneef	meningitis needs
FN-4	12271	area!unfortionatly	area! unfortionatly	area! unfortunately
FN-5	12616	camedown	came down	came down
FN-6	14514	ihave	have	i have
FN-7	14	alot	alot	a lot
FN-8	16519	eye-doctor	eye-doctor	eye doctor
FN-9	18203	pthrpeptide	pthrpeptide	pthr peptide
FN-10	88	polipsremoved	polipsremoved	polyps removed

TP-1, 3, 4, 9, 10: multiple operation involved (not in the design scope)
TP-2: TB was no in the split dictionary
TP-5, 6, 7: need further investigation. Maybe to separate Split and 1-To-1 into two class in NW.
TP-8: spVars