CSpell

Real-word Merge

This page describes the processes for real-word merge detection and correction.

I. Processes

  • Detector:
    RealWordMergeDetector.java
    • Not corrected previously in the CSpell pipeline
    • real-word: valid word (in splitDic)
    • Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
  • Candidates:
    MergeCandidates.java
    • mergeNo <= 2 (configurable: CS_CAN_RW_MAX_MERGE_NO)
    • merge with hyphen is false (configurable: CS_CAN_RW_MERGE_WITH_HYPHEN)
      only merge with space " ", (no merge with hyphen "-")
    • context (adjacent tokens) is not an exception (url, email, ...)
    • orgWords (before merge words) is not a multiwords (not in mwDic)
    • candidate is a valid word (in suggDic), not abbreviations or acronyms (not in aaDic)
    • candidate has context score (not zero)
    • Word count of candidate >= 15 (configurable: CS_CAN_RW_SPLIT_CAND_MIN_WC)
    • Not a short word merge
      • short word is the length less than 3
      • the total number of short words should be less than 2
      • Examples:
        Input textCandidateNotes
        me atmeat
        • invalid candidate
        • 2 short words (me and at)
        • source: 80.txt and 16734.txt
  • Ranker:
    RankRealWordMergeByContext.java,
  • Rank merge candidates by context scores
    • context radius = 2 (configurable, CS_RW_MERGE_CONTEXT_RADIUS)
  • Validate the top rank candidate
    Compare the top ranked candidate to the original token for correction:
    • orgScore < 0
      • & topScore > 0
      • & topScore < 0 & topScore * RealWord_Merge_Confidence_Factor > orgScore
    • orgScore > 0
      • & topScore * RealWord_Merge_Confidence_Factor > orgScore
    • orgScore = 0
      • No real-word merge correction because no word2Vec information on the original word

    where:

    • orgScore: is the context score of the original token
    • topScore: is the context score of the top candidate
    • RealWord_Merge_Confidence_Factor = 0.60 (Configurable: CS_RANKER_RW_MERGE_C_FAC)
  • Corrector:
    MergeCorrector.java
    • reconstruct the text by updating the whole inTokenList with all mergeObjs
    • Update process history to real-word-merge

    • The corrector need to take care of contains and overlap cases for all mergeObjs before the merge operation. This is a quick fix. The best way is to correct the merge right after the merge (TBD). Also, current merge operation is first come first serves, maybe this sequential order of merge and other spelling correction can be improved by frequency or other score systems.

II. Development Tests

Tested different real-word merge factor on the revised real-word included gold standard from the training set.

FunctionConfidence FactorContext RadiusMax. MergeNoRaw dataPerformance
NW (1-to-1, Split, Merge)N/AN/A2604|775|9640.7794|0.6266|0.6947
NW + RW_MERGE0.2022609|783|9640.7778|0.6317|0.6972*
NW + RW_MERGE0.2522610|785|9640.7771|0.6328|0.6975
NW + RW_MERGE0.3022610|783|9640.7791|0.6328|0.6983
NW + RW_MERGE0.3322610|785|9640.7771|0.6328|0.6975
NW + RW_MERGE0.4022610|783|9640.7791|0.6328|0.6983
NW + RW_MERGE0.5022610|786|9640.7761|0.6328|0.6971
NW + RW_MERGE0.5522612|787|9640.7776|0.6349|0.6990
NW + RW_MERGE0.6022613|786|9640.7799|0.6359|0.7006
NW + RW_MERGE
Fixed LC on W2V
0.6022614|788|9640.7792|0.6369|0.7009
NW + RW_MERGE0.7022613|790|9640.7759|0.6359|0.6990
NW + RW_MERGE0.8022614|791|9640.7762|0.6369|0.6997
NW + RW_MERGE0.9022614|792|9640.7753|0.6369|0.6993
NW + RW_MERGE1.0022615|794|9640.7746|0.6384|0.6997
NW + RW_MERGE0.6012610|783|9640.7791|0.6328|0.6983
NW + RW_MERGE0.6022613|786|9640.7799|0.6359|0.7006
NW + RW_MERGE0.6032611|784|9640.7793|0.6338|0.6991
NW + RW_MERGE0.6042609|783|9640.7778|0.6317|0.6972
NW + RW_MERGE0.6052608|782|9640.7775|0.6307|0.6964
NW + RW_MERGE0.6062610|784|9640.7781|0.6328|0.6979
NW + RW_MERGE0.6072607|779|9640.7792|0.6297|0.6965
NW + RW_MERGE0.6082607|778|9640.7802|0.6297|0.6969
NW + RW_MERGE0.6092607|779|9640.7792|0.6297|0.6965
NW + RW_MERGE0.60102606|778|9640.7789|0.6286|0.6958
NW + RW_MERGE0.6021613|786|9640.7799|0.6359|0.7006
NW + RW_MERGE0.6022613|786|9640.7779|0.6359|0.7006
NW + RW_MERGE0.6023613|786|9640.7799|0.6359|0.7006
NW + RW_MERGE0.6024613|786|9640.7799|0.6359|0.7006

  • Bigger the confidence factor increases the [TP] and [FP]. Value of 0.6 seems reach the best F1.
  • Bigger the context radius decreases the [TP] and [FP], Value of 2 seems reach the best F1. We trained word2vec with a window size of 5, which is the same spec of context radius of 2 (1 token + 2 adjacent tokens on each sides). It is best to use same specification for the training and application.
  • If the relevance of global context in the article us of interest, we suggest to use larger window size in training and the equivalent window in the application.
  • The value of max. merge No. does not seems have too much impact on F1. The bigger of max. merge No. has slower speed performance. Use empirical value of 2 as default.

III. Observations from Development test set

  • [TP] real-word merge:
    IDSourceOriginal WordsMerged Word
    TP-11on setonset
    TP-239under developedunderdeveloped
    TP-339some whatsomewhat
    TP-462life longlifelong
    TP-511579anti psychoticantipsychotic
    TP-613645non prescriptionnonprescription
    TP-713864my selfmyself
    TP-814296some onesomeone
    TP-915759anti depresantsantidepressants
    TP-1016974non drugnondrug
    TP-1118766some timessometimes
    TP-1212745extra corporealextracorporeal
    • TP-9, depresants is corrected to "depressants" from nw_1-to-1, then merge to "antidepressants" in rw_merge (the only merge candidate).

  • [FP] real-word merge:
    IDSourceOriginal WordsMerged Word
    FP-212261a whileawhile
    FP-316481me anytmeant
    FP-518903over timeovertime
    FP-612630every dayeveryday
    • FP-1 & 4 are caused by different annotations between brat ([CONTACT]) and corpus Word2Vec ([EMAIL]).
    • TBD: Check on the Word2Vec scores, a bigger corpus might have better recall to cover these cases.

  • [FN] real-word merge:
    IDSourceOriginal WordsMerged Word
    FN-124some thingsomething
    FN-230there afterthereafter
    FN-333web sitewebsite
    FN-474great fullgrateful
    FN-574use fulluseful
    FN-611225over readoverread
    FN-711435some timesometime
    FN-811579with outwithout
    FN-911579worth whileworthwhile
    FN-1011757care takercaretaker
    FN-1112271in tointo
    FN-1212520post menopausepostmenopause
    FN-1312646what everwhatever
    FN-1412800through outthroughout
    FN-1513287grand childgrandchild
    FN-1616823after noonafternoon
    FN-1716829grand fathergrandfather
    FN-1819818boy friendboyfriend
    • FN-4, 5 involves more correction more than real-word merge
    • TBD: Check on the Word2Vec scores, a bigger corpus might have better recall to cover these cases.