Non-word Merge
I. Introduction
This page describes the processes for non-word merge detection and correction.
II. Processes
NonWordMergeDetector.java
MergeCandidates.java
CS_CAN_NW_MAX_MERGE_NO
)
CS_CAN_NW_MERGE_WITH_HYPHEN
)
RankNonWordMergeByContext.java
,
CS_NW_MERGE_CONTEXT_RADIUS
)
MergeCorrector.java
III. Development Test
Id | Source | Original Words | Merged Word |
---|---|---|---|
TP-1 | 11 | neuro transmissions | neurotransmissions |
TP-2 | 12 | tricho rhino phalangeal | trichorhinophalangeal |
TP-3 | 73 | dur ing | during |
TP-4 (RW) | 11579 | meth amphetamines | methamphetamines |
TP-5 (RW) | 13645 | non prescription | nonprescription |
TP-6 (RW) | 16974 | non drug | nondrug |
Id | Source | Original Words | Merged Word |
---|---|---|---|
FP-1 | 42 | senior loken | senior-loken |
FP-2 | 80 | pallido ponto nigral | pallidopontonigral |
FP-3 | 13423 | 3rd stage | 3rd-stage |
Id | Source | Original Words | Merged Word |
---|---|---|---|
FN-1 | 53 | rs 12934922 | rs12934922 |
FN-2 | 53 | rs 4889294 | rs4889294 |
FN-3 | 13082 | as nd | and |
FN-4 | 16247 | long gevity | longevity |
IV. Spelling Variants and Merge
Spelling variants, that include space, hyphen, and no space, are very tough to handle for merge cases. Please see the following table for the result from our merge algorithm:
Input | Output | Lexicon, Medline WC | Notes |
---|---|---|---|
merge spaces | |||
non prescription | nonprescription |
| Space merge due to the ranking |
non profit | nonprofit |
| Space merge due to the ranking |
merge hyphens | |||
non protein | non-protein |
| Hyphen merge due to the ranking |
non self | non-self |
| Hyphen merge due to the ranking |
non small | non-small |
| Hyphen merge due to the ranking |
No merge | |||
non diabetic | non diabetic |
| No merge if the original term is in Lexicon (Dic - valid multiword) |
non surgical | non surgical |
| No merge if the original term is in Lexicon (Dic - valid multiword) |
non competitive | non competitive |
| No merge if the original term is in Lexicon (Dic - valid multiword) |