Merge Candidates
I. Introduction
Merge correction is used to correct split errors.
A merge correction merges a series of tokens to a single word as the correct word. Merge candidates are all possible merged word retrieved from the target word. This page use the non-word merge as example to illustrate the merge process.
II. Algorithm (Non-Word Merge)
- Detector:
Dictionary/MergeSpellChecker.IsValidWord
- detect the token is a non-word (OOV)
- Check both tokenStr and rmEndPuncStr (which remove the ending punctuation, such as ?|...), such as merge happen at the end of sentence.
- Not in the Lexicon.noAa.Dic
- Also check exceptions: digit, punctuation, url, email, ... (no merge)
- Pure abbreviations or acronyms are considered as non-word errors.
Example: dur ing, where "dur" matches "DUR|E0446524" for "drug use review", while "ing" matches "ING|E0439350" for "isotope nephrogram". However, they are considered as non-word (the dictionary does not include Aa for merge), dur is a OOV and starts the merge process.
- Candidate Generator:
Candidates/MergeCandidates.java
- Find the merged word by merging the target word and neighbor word within the specified number of spaces in both directions (before and after).
- Use the merged word as candidate if it is in the dictionary
- MergeObj is used for the merge operation
- Example: If the input is "A big disap point ment comes", "disap" is detected as an OOV for merge case. A merge size of 2 generates 5 possible candidates:
- abigdisap
- bigdisappoint
- disappointment
- bigdisap
- disappoint
Only "disappointment" and "disappoint" are in the suggestDic and used as candidates
- Also merge for hyphen ("-"), in addition to space (" ").
- non prescription -> nonprescription
- non prescription -> non-prescription
- The candidates need to be:
- in the suggestion Dic
- not a known multiword (such as "non clinical")
- not a known abbreviation or acronym (so "c d" does not merge to "cd")
- Ranker:
Rankers/RankMerge*.java
- frequency (better recall)
- word embedding (better precision)
=> Use word embedding for ranking merge candidates is a much more complicated application than one-to-one or split because different merge candidate might have different context.
- combined (one-stage: context score, then frequency)
- TBD: noisy channel for merge cases is not implemented and tested due to the limited resources and not too many merge cases.
- Not too many merge cases. Most case have only 1 merge candidate (in such case, no ranking is needed).
- Corrector:
Corrector/ProcessNonWordMerge.java
- Go through all merge objects to perform merge operation
- Correct the merge and reconstruct the whole text.
- The merge operation handles:
- contain: Use the longer candidates
- Example: imple ment ation
- choose implementation over implement because implementation contains implement
- overlap: use the first one in the queue
- Example: proto col or
- chose protocol or over proto color (not a very good example)
III. Merge Examples
Input Tokens | corrected word | Edit Dis | Notes
|
---|
non-word merge correction
|
anyt ime | anytime | | 16481.txt
|
non prescription | nonprescription | | 13645.txt
|
dur ing | during | | 73.txt
|
stiff n ess | stiffness | | 1-119980475.txt
|
ver y | very | | 1-136586815.txt
|
e mail | email | | 1-135588237.txt
|
a m | am | | 1-135787225.txt
|
real-word merge correction
|
tricho rhino phalangeal | trichorhinophalangeal | | 12.txt
|
some times | sometimes | | use frequency, or word embedding
|
use full | useful | | use frequency, or word embedding, then word correction
|
cryo surgery | cryosurgery | |
|
post menopause | postmenopause | |
|
through out | throughout | |
|
my self | myself | |
|
ploy urea | polyurea | |
|
boy friend | boyfriend | |
|
there after | thereafter | |
|
on set | onset | | 1.txt
|
some thing | something | | 24.txt
|
ultra sound | ultrasound | | 1-123152135.txt
|
up date | update | | 1-122785307.txt
|