CSpell

Real-word Split

This page describes the processes for real-word split detection and correction.

I. Processes

Detector:
RealWordDetector.java
- Not corrected previously in the CSpell pipeline.
- real-word: valid word (in checkDic)
- Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, measurement, Aa, proper noun)
- focus token has word2Vec
- focus token has length >= 4 (configurable: CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH)
- focus token: WC >= 200 (configurable: CS_DETECTOR_RW_SPLIT_WORD_MIN_WC)

Candidates:
SplitCandidates.java

Get splitSet from all possible split as in the non-word
- SplitNo <= 2 (configurable: CS_CAN_RW_MAX_SPLIT_NO)
The split candidate is a Lexicon multiword

If not a multiword, check if it is a valid split candidate:

Check short split words in split candidate

short split word <= 2 (configurable: CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO)
The total number of short split word <= maxShortSplitWordNo (2);
length of short plit word <= 3 (configurable: CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH)
The default value of short split word is a word with length of 3 or less.

Heuristic rules are used to avoid split into invalid many short split words, For examples:

Src	Candidate	Notes
17942.txt	"something" -> "so me thing"	both "so" and "me" are short split word, two of them means it is not a valid split
16369.txt	"suggestion" -> "suggest i on"	both "i" and "on" are short split word, two of them means it is not a valid split
60.txt	"upon" -> "up on"
30.txt	"soon" -> "so on"
12353.txt	"another" -> "a not her", "an other"	"a not her" is an invalid candidate, "an other" is a valid candidate.
15721.txt	"anyone" -> "any one"

keep: "away" -> "a way", "along" -> "a long", etc.

Check all split words (element words) in split candidate
- in splitDic (Not pure Aa)
- has context score (word2Vec)
- WC > min. threshold (200 configurable: CS_CAN_RW_SPLIT_CAND_MIN_WC)
  example: ploytension -> poly tension
- not unit
  examples:
  
  Src Candidate Notes
  17536.txt "inversion" -> "in version" where "in" is a unit, short for "inch"
  10136.txt "everyday" -> "every day" where "day" is a unit
- not proper noun
  examples:
  
  Src Candidate Notes
  16661.txt "human" -> "hu man" where "Hu" is a proper noun
  16481.txt "children" -> "child ren" where "Ren" is a proper noun

Ranker:
RankRealWordSplitByContext.java
- Rank split candidates by context scores
  - context radius = 2 (configurable, CS_RW_SPLIT_CONTEXT_RADIUS)
- Validate the top rank candidate
  compare the top ranked candidate to the original token for correction:
  - orgScore < 0
    - & topScore > 0
    - & topScore < 0 & topScore * RealWord_Split_Confidence_Factor > orgScore
  - orgScore > 0
    - & topScore * RealWord_Split_Confidence_Factor > orgScore
  - orgScore = 0
    - No real-word split correction because no word2Vec information on the original word, this case is filtered out in the detection
  where:
  - orgScore: is the context score of the original token
  - topScore: is the context score of the top candidate
  - RealWord_Split_Confidence_Factor = 0.01 (Configurable: CS_RANKER_RW_SPLIT_C_FAC)
- TBD: the ranking can be improved if n-gram frequency is available. The frequency with context will be a better ranking source for split candidate
Corrector:
ProcRealWordSplit.java, ProcRealWordSplit.java
- FlatMap the split word (OneToOneSplitCorrector.AddOneToOneSplitCorrection)
- Update process history to real-word-split

Src	Candidate	Notes
17536.txt	"inversion" -> "in version"	where "in" is a unit, short for "inch"
10136.txt	"everyday" -> "every day"	where "day" is a unit

Src	Candidate	Notes
16661.txt	"human" -> "hu man"	where "Hu" is a proper noun
16481.txt	"children" -> "child ren"	where "Ren" is a proper noun

II. Development Tests

Tested different real-word merge factor on the revised real-word included gold standard from the training set with the following setup:

CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH=4
CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH=3
CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO=2

Function	Confidence Factor	Context Radius	Max. SplitNo	Raw data	Performance
NW (1-to-1, Split, Merge)	N/A	N/A	2	604\|775\|964	0.7794\|0.6266\|0.6947

NW + RW_SPLIT	0.00	2	5	605\|789\|964	0.7668\|0.6276\|0.6902
NW + RW_SPLIT	0.01	2	5	605\|789\|964	0.7668\|0.6276\|0.6902
NW + RW_SPLIT	0.02	2	5	605\|790\|964	0.7658\|0.6276\|0.6899
NW + RW_SPLIT	0.03	2	5	605\|790\|964	0.7658\|0.6276\|0.6899
NW + RW_SPLIT	0.05	2	5	605\|791\|964	0.7649\|0.6276\|0.6895
NW + RW_SPLIT	0.10	2	5	605\|792\|964	0.7639\|0.6276\|0.6891
NW + RW_SPLIT	0.20	2	5	605\|792\|964	0.7639\|0.6276\|0.6891
NW + RW_SPLIT	0.40	2	5	605\|809\|964	0.7478\|0.6276\|0.6825
NW + RW_SPLIT	0.60	2	5	607\|835\|964	0.7269\|0.6297\|0.6748
NW + RW_SPLIT	0.80	2	5	608\|875\|964	0.6949\|0.6307\|0.6612

NW + RW_SPLIT	0.01	9	0	604\|775\|964	0.7794\|0.6266\|0.6947
NW + RW_SPLIT	0.01	9	1	606\|777\|964	0.7799\|0.6286\|0.6962
NW + RW_SPLIT	0.01	9	2	606\|777\|964	0.7799\|0.6286\|0.6962
NW + RW_SPLIT	0.01	9	3	606\|777\|964	0.7799\|0.6286\|0.6962
NW + RW_SPLIT	0.01	9	4	606\|777\|964	0.7799\|0.6286\|0.6962
NW + RW_SPLIT	0.01	9	5	606\|777\|964	0.7799\|0.6286\|0.6962

Bigger the confidence factor increases the [TP] and [FP]. Value of 0.01 seems reach the best F1.
Bigger the context radius decreases the [TP] and [FP] first, then increase [TP] and [FP], value of 9 seems to reach the best F1.
=> real-word split involves understand the meaning of the text, software needs more context for better precision.
The value of max. split No. does not seems have too much impact on F1. Use empirical value of 2 as default. There are not too much possibility that a merged word happen to be real-word. Use 2 (instead of bigger number) could save running time and increase speed performance.

III. Observations from Training Set

[TP] real-word split:

ID Source Original Words Split Word
TP-1 10349 along a long
TP-2 10349 along a long
TP-3 13165 iam i am
TP-4 18669 iam i am
- 10349.txt: "sounding in my ear every time for along time."
- TP-3 and TP-4 are done in the ND splitter
[FP] real-word split:

ID Source Original Words Split Word
FP-1 10349 along a long
FP-2 10061 however how ever
FP-3 39 without with out
FP-4 39 because be cause
FP-5 41 anywhere any where

ID	Source	Original Words	Split Word
TP-1	10349	along	a long
TP-2	10349	along	a long
TP-3	13165	iam	i am
TP-4	18669	iam	i am

ID	Source	Original Words	Split Word
FP-1	10349	along	a long
FP-2	10061	however	how ever
FP-3	39	without	with out
FP-4	39	because	be cause
FP-5	41	anywhere	any where

[FN] real-word split:

ID	Source	Original Words	Merged Word
FN-3	13864	apart	a part

FN-3: The original input text is ... I donate my self to be apart of this study. The word2Vec need to be improved by bigger corpus. This split case is very sensitive with context as shown in follows:

Input	Output	Notes
apart	apart
apart of	a part of
apart of this	apart of this
apart of this study	apart of this study
apart of this group	a part of this group	Good
apart of this process	a part of this process	Good
apart of this effect	a part of this effect	Good

be apart	be apart
be apart of	be a part of	Good
to be apart of	to be apart of
not be apart of	not be a part of	Good

weeks apart of	weeks apart of	Good
weeks apart of 160 mg	weeks apart of 160 mg	Good
distance apart of	distance apart of	Good
distance apart of the	distance apart of the	Good
apart from	apart from	Good