CSpell

Real-word Merge

This page describes the processes for real-word merge detection and correction.

I. Processes

Detector:
RealWordMergeDetector.java
- Not corrected previously in the CSpell pipeline
- real-word: valid word (in splitDic)
- Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
Candidates:
MergeCandidates.java
- mergeNo <= 2 (configurable: CS_CAN_RW_MAX_MERGE_NO)
- merge with hyphen is false (configurable: CS_CAN_RW_MERGE_WITH_HYPHEN)
  only merge with space " ", (no merge with hyphen "-")
- context (adjacent tokens) is not an exception (url, email, ...)
- orgWords (before merge words) is not a multiwords (not in mwDic)
- candidate is a valid word (in suggDic), not abbreviations or acronyms (not in aaDic)
- candidate has context score (not zero)
- Word count of candidate >= 15 (configurable: CS_CAN_RW_SPLIT_CAND_MIN_WC)
- Not a short word merge
  - short word is the length less than 3
  - the total number of short words should be less than 2
  - Examples:
    Input text Candidate Notes
    me at meat
    invalid candidate
    2 short words (me and at)
    source: 80.txt and 16734.txt
Ranker:
RankRealWordMergeByContext.java,
Rank merge candidates by context scores
- context radius = 2 (configurable, CS_RW_MERGE_CONTEXT_RADIUS)
Validate the top rank candidate
Compare the top ranked candidate to the original token for correction:
- orgScore < 0
  - & topScore > 0
  - & topScore < 0 & topScore * RealWord_Merge_Confidence_Factor > orgScore
- orgScore > 0
  - & topScore * RealWord_Merge_Confidence_Factor > orgScore
- orgScore = 0
  - No real-word merge correction because no word2Vec information on the original word
where:
- orgScore: is the context score of the original token
- topScore: is the context score of the top candidate
- RealWord_Merge_Confidence_Factor = 0.60 (Configurable: CS_RANKER_RW_MERGE_C_FAC)
Corrector:
MergeCorrector.java
- reconstruct the text by updating the whole inTokenList with all mergeObjs
- Update process history to real-word-merge
- The corrector need to take care of contains and overlap cases for all mergeObjs before the merge operation. This is a quick fix. The best way is to correct the merge right after the merge (TBD). Also, current merge operation is first come first serves, maybe this sequential order of merge and other spelling correction can be improved by frequency or other score systems.

Input text	Candidate	Notes
me at	meat	invalid candidate 2 short words (me and at) source: 80.txt and 16734.txt

II. Development Tests

Tested different real-word merge factor on the revised real-word included gold standard from the training set.

Function	Confidence Factor	Context Radius	Max. MergeNo	Raw data	Performance
NW (1-to-1, Split, Merge)	N/A	N/A	2	604\|775\|964	0.7794\|0.6266\|0.6947

NW + RW_MERGE	0.20	2	2	609\|783\|964	0.7778\|0.6317\|0.6972*
NW + RW_MERGE	0.25	2	2	610\|785\|964	0.7771\|0.6328\|0.6975
NW + RW_MERGE	0.30	2	2	610\|783\|964	0.7791\|0.6328\|0.6983
NW + RW_MERGE	0.33	2	2	610\|785\|964	0.7771\|0.6328\|0.6975
NW + RW_MERGE	0.40	2	2	610\|783\|964	0.7791\|0.6328\|0.6983
NW + RW_MERGE	0.50	2	2	610\|786\|964	0.7761\|0.6328\|0.6971
NW + RW_MERGE	0.55	2	2	612\|787\|964	0.7776\|0.6349\|0.6990
NW + RW_MERGE	0.60	2	2	613\|786\|964	0.7799\|0.6359\|0.7006
NW + RW_MERGE Fixed LC on W2V	0.60	2	2	614\|788\|964	0.7792\|0.6369\|0.7009
NW + RW_MERGE	0.70	2	2	613\|790\|964	0.7759\|0.6359\|0.6990
NW + RW_MERGE	0.80	2	2	614\|791\|964	0.7762\|0.6369\|0.6997
NW + RW_MERGE	0.90	2	2	614\|792\|964	0.7753\|0.6369\|0.6993
NW + RW_MERGE	1.00	2	2	615\|794\|964	0.7746\|0.6384\|0.6997

NW + RW_MERGE	0.60	1	2	610\|783\|964	0.7791\|0.6328\|0.6983
NW + RW_MERGE	0.60	2	2	613\|786\|964	0.7799\|0.6359\|0.7006
NW + RW_MERGE	0.60	3	2	611\|784\|964	0.7793\|0.6338\|0.6991
NW + RW_MERGE	0.60	4	2	609\|783\|964	0.7778\|0.6317\|0.6972
NW + RW_MERGE	0.60	5	2	608\|782\|964	0.7775\|0.6307\|0.6964
NW + RW_MERGE	0.60	6	2	610\|784\|964	0.7781\|0.6328\|0.6979
NW + RW_MERGE	0.60	7	2	607\|779\|964	0.7792\|0.6297\|0.6965
NW + RW_MERGE	0.60	8	2	607\|778\|964	0.7802\|0.6297\|0.6969
NW + RW_MERGE	0.60	9	2	607\|779\|964	0.7792\|0.6297\|0.6965
NW + RW_MERGE	0.60	10	2	606\|778\|964	0.7789\|0.6286\|0.6958

NW + RW_MERGE	0.60	2	1	613\|786\|964	0.7799\|0.6359\|0.7006
NW + RW_MERGE	0.60	2	2	613\|786\|964	0.7779\|0.6359\|0.7006
NW + RW_MERGE	0.60	2	3	613\|786\|964	0.7799\|0.6359\|0.7006
NW + RW_MERGE	0.60	2	4	613\|786\|964	0.7799\|0.6359\|0.7006

Bigger the confidence factor increases the [TP] and [FP]. Value of 0.6 seems reach the best F1.
Bigger the context radius decreases the [TP] and [FP], Value of 2 seems reach the best F1. We trained word2vec with a window size of 5, which is the same spec of context radius of 2 (1 token + 2 adjacent tokens on each sides). It is best to use same specification for the training and application.
If the relevance of global context in the article us of interest, we suggest to use larger window size in training and the equivalent window in the application.
The value of max. merge No. does not seems have too much impact on F1. The bigger of max. merge No. has slower speed performance. Use empirical value of 2 as default.

III. Observations from Development test set

[TP] real-word merge:

ID	Source	Original Words	Merged Word
TP-1	1	on set	onset
TP-2	39	under developed	underdeveloped
TP-3	39	some what	somewhat
TP-4	62	life long	lifelong
TP-5	11579	anti psychotic	antipsychotic
TP-6	13645	non prescription	nonprescription
TP-7	13864	my self	myself
TP-8	14296	some one	someone
TP-9	15759	anti depresants	antidepressants
TP-10	16974	non drug	nondrug
TP-11	18766	some times	sometimes
TP-12	12745	extra corporeal	extracorporeal

TP-9, depresants is corrected to "depressants" from nw_1-to-1, then merge to "antidepressants" in rw_merge (the only merge candidate).

[FP] real-word merge:

ID Source Original Words Merged Word
FP-2 12261 a while awhile
FP-3 16481 me anyt meant
FP-5 18903 over time overtime
FP-6 12630 every day everyday
- FP-1 & 4 are caused by different annotations between brat ([CONTACT]) and corpus Word2Vec ([EMAIL]).
- TBD: Check on the Word2Vec scores, a bigger corpus might have better recall to cover these cases.

ID	Source	Original Words	Merged Word
FP-2	12261	a while	awhile
FP-3	16481	me anyt	meant
FP-5	18903	over time	overtime
FP-6	12630	every day	everyday

[FN] real-word merge:

ID	Source	Original Words	Merged Word
FN-1	24	some thing	something
FN-2	30	there after	thereafter
FN-3	33	web site	website
FN-4	74	great full	grateful
FN-5	74	use full	useful
FN-6	11225	over read	overread
FN-7	11435	some time	sometime
FN-8	11579	with out	without
FN-9	11579	worth while	worthwhile
FN-10	11757	care taker	caretaker
FN-11	12271	in to	into
FN-12	12520	post menopause	postmenopause
FN-13	12646	what ever	whatever
FN-14	12800	through out	throughout
FN-15	13287	grand child	grandchild
FN-16	16823	after noon	afternoon
FN-17	16829	grand father	grandfather
FN-18	19818	boy friend	boyfriend

FN-4, 5 involves more correction more than real-word merge
TBD: Check on the Word2Vec scores, a bigger corpus might have better recall to cover these cases.