CSpell

Merge Candidates

I. Introduction

Merge correction is used to correct split errors. A merge correction merges a series of tokens to a single word as the correct word. Merge candidates are all possible merged word retrieved from the target word. This page use the non-word merge as example to illustrate the merge process.

II. Algorithm (Non-Word Merge)

Detector:
- Dictionary/MergeSpellChecker.IsValidWord
- detect the token is a non-word (OOV)
- Check both tokenStr and rmEndPuncStr (which remove the ending punctuation, such as ?|...), such as merge happen at the end of sentence.
- Not in the Lexicon.noAa.Dic
- Also check exceptions: digit, punctuation, url, email, ... (no merge)
- Pure abbreviations or acronyms are considered as non-word errors.
  Example: dur ing, where "dur" matches "DUR|E0446524" for "drug use review", while "ing" matches "ING|E0439350" for "isotope nephrogram". However, they are considered as non-word (the dictionary does not include Aa for merge), dur is a OOV and starts the merge process.
Candidate Generator:
- Candidates/MergeCandidates.java
- Find the merged word by merging the target word and neighbor word within the specified number of spaces in both directions (before and after).
- Use the merged word as candidate if it is in the dictionary
- MergeObj is used for the merge operation
- Example: If the input is "A big disap point ment comes", "disap" is detected as an OOV for merge case. A merge size of 2 generates 5 possible candidates:
  - abigdisap
  - bigdisappoint
  - disappointment
  - bigdisap
  - disappoint
  Only "disappointment" and "disappoint" are in the suggestDic and used as candidates
- Also merge for hyphen ("-"), in addition to space (" ").
  - non prescription -> nonprescription
  - non prescription -> non-prescription
- The candidates need to be:
  - in the suggestion Dic
  - not a known multiword (such as "non clinical")
  - not a known abbreviation or acronym (so "c d" does not merge to "cd")
Ranker:
- Rankers/RankMerge*.java
- frequency (better recall)
- word embedding (better precision)
  => Use word embedding for ranking merge candidates is a much more complicated application than one-to-one or split because different merge candidate might have different context.
- combined (one-stage: context score, then frequency)
- TBD: noisy channel for merge cases is not implemented and tested due to the limited resources and not too many merge cases.
- Not too many merge cases. Most case have only 1 merge candidate (in such case, no ranking is needed).
Corrector:
- Corrector/ProcessNonWordMerge.java
- Go through all merge objects to perform merge operation
- Correct the merge and reconstruct the whole text.
- The merge operation handles:
  - contain: Use the longer candidates
    - Example: imple ment ation
    - choose implementation over implement because implementation contains implement
  - overlap: use the first one in the queue
    - Example: proto col or
    - chose protocol or over proto color (not a very good example)

III. Merge Examples

Input Tokens	corrected word	Edit Dis	Notes
non-word merge correction
anyt ime	anytime		16481.txt
non prescription	nonprescription		13645.txt
dur ing	during		73.txt
stiff n ess	stiffness		1-119980475.txt
ver y	very		1-136586815.txt
e mail	email		1-135588237.txt
a m	am		1-135787225.txt
real-word merge correction
tricho rhino phalangeal	trichorhinophalangeal		12.txt
some times	sometimes		use frequency, or word embedding
use full	useful		use frequency, or word embedding, then word correction
cryo surgery	cryosurgery
post menopause	postmenopause
through out	throughout
my self	myself
ploy urea	polyurea
boy friend	boyfriend
there after	thereafter
on set	onset		1.txt
some thing	something		24.txt
ultra sound	ultrasound		1-123152135.txt
up date	update		1-122785307.txt