CSpell

Merge Candidates

I. Introduction

Merge correction is used to correct split errors. A merge correction merges a series of tokens to a single word as the correct word. Merge candidates are all possible merged word retrieved from the target word. This page use the non-word merge as example to illustrate the merge process.

II. Algorithm (Non-Word Merge)

  • Detector:
    • Dictionary/MergeSpellChecker.IsValidWord
    • detect the token is a non-word (OOV)
    • Check both tokenStr and rmEndPuncStr (which remove the ending punctuation, such as ?|...), such as merge happen at the end of sentence.
    • Not in the Lexicon.noAa.Dic
    • Also check exceptions: digit, punctuation, url, email, ... (no merge)
    • Pure abbreviations or acronyms are considered as non-word errors.
      Example: dur ing, where "dur" matches "DUR|E0446524" for "drug use review", while "ing" matches "ING|E0439350" for "isotope nephrogram". However, they are considered as non-word (the dictionary does not include Aa for merge), dur is a OOV and starts the merge process.
  • Candidate Generator:
    • Candidates/MergeCandidates.java
    • Find the merged word by merging the target word and neighbor word within the specified number of spaces in both directions (before and after).
    • Use the merged word as candidate if it is in the dictionary
    • MergeObj is used for the merge operation
    • Example: If the input is "A big disap point ment comes", "disap" is detected as an OOV for merge case. A merge size of 2 generates 5 possible candidates:
      • abigdisap
      • bigdisappoint
      • disappointment
      • bigdisap
      • disappoint

      Only "disappointment" and "disappoint" are in the suggestDic and used as candidates
    • Also merge for hyphen ("-"), in addition to space (" ").
      • non prescription -> nonprescription
      • non prescription -> non-prescription
    • The candidates need to be:
      • in the suggestion Dic
      • not a known multiword (such as "non clinical")
      • not a known abbreviation or acronym (so "c d" does not merge to "cd")
  • Ranker:
    • Rankers/RankMerge*.java
    • frequency (better recall)
    • word embedding (better precision)
      => Use word embedding for ranking merge candidates is a much more complicated application than one-to-one or split because different merge candidate might have different context.
    • combined (one-stage: context score, then frequency)
    • TBD: noisy channel for merge cases is not implemented and tested due to the limited resources and not too many merge cases.

    • Not too many merge cases. Most case have only 1 merge candidate (in such case, no ranking is needed).
  • Corrector:
    • Corrector/ProcessNonWordMerge.java
    • Go through all merge objects to perform merge operation
    • Correct the merge and reconstruct the whole text.
    • The merge operation handles:
      • contain: Use the longer candidates
        • Example: imple ment ation
        • choose implementation over implement because implementation contains implement
      • overlap: use the first one in the queue
        • Example: proto col or
        • chose protocol or over proto color (not a very good example)

III. Merge Examples

Input Tokenscorrected wordEdit DisNotes
non-word merge correction
anyt imeanytime16481.txt
non prescriptionnonprescription13645.txt
dur ingduring73.txt
stiff n essstiffness1-119980475.txt
ver yvery1-136586815.txt
e mailemail1-135588237.txt
a mam1-135787225.txt
real-word merge correction
tricho rhino phalangealtrichorhinophalangeal12.txt
some timessometimesuse frequency, or word embedding
use fullusefuluse frequency, or word embedding, then word correction
cryo surgerycryosurgery
post menopausepostmenopause
through outthroughout
my selfmyself
ploy ureapolyurea
boy friendboyfriend
there afterthereafter
on setonset1.txt
some thingsomething24.txt
ultra soundultrasound1-123152135.txt
up dateupdate1-122785307.txt