CSpell

Non-word Merge

I. Introduction

This page describes the processes for non-word merge detection and correction.

II. Processes

  • Detector:
    NonWordMergeDetector.java
    • non-word: invalid word (not in splitDic)
      • Check both tokenStr and rmEndPuncStr (which remove the ending punctuation, such as ?|..., to avoid merge happen at the end of sentence).
      • Pure abbreviations or acronyms are considered as non-words. They are stored in the Lexicon.noAa.Dic, which is excluded in splitDic
        => so that pure abbreviations or acronyms are considered as non-word.
      • Example: dur ing, where dur matches "DUR|E0446524" for "drug use review", while ing matches "ING|E0439350" for "isotope nephrogram". When we exclude abbreviations and acronyms, dur is a non-word and starts the merge process to during.
    • Not Exceptions:
      • Exceptions include digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
      • No merge on exceptions
  • Candidate Generator:
    MergeCandidates.java
    • Find the merged candidates by merging the target word and adjacent words within the specified number of spaces in both directions (before and after).
    • MergeNo <= 2 (configurable: CS_CAN_NW_MAX_MERGE_NO)
    • merge with hyphen = true (configurable: CS_CAN_NW_MERGE_WITH_HYPHEN)
      merge with space (" ") and hyphen ("-")
      Example: non prescription is merge to nonprescription and non-prescription
    • exceptions (url, email, ...) are not merge, (target word or adjacent words)
    • orgWords (before merge words) is not a multiword (in mwDic)
    • candidate is in suggDic (valid word), not in aaDic (not Abbreviations or acronyms)
      => Use the merged word as candidate if it is in the dictionary
    • MergeObj is used for the merge operation
      Example: If the input is A big disap point ment comes, disap is detected as an non-word for merge case. A merge size of 2 generates 5 possible candidates:
      • abigdisap
      • bigdisappoint
      • disappointment
      • bigdisap
      • disappoint

      Only disappointment and disappoint are in the suggestDic and used as candidates
      • The candidates must be in the suggestion Dic
      • the original term is not a known multiword (such as "non clinical")
      • words in the original term are not known abbreviations or acronyms (so "c d" does not merge to "cd")
  • Ranker:
    RankNonWordMergeByContext.java,
    uses the top ranked candidate for correction for the following cases:
    • frequency (better recall)
    • word embedding (better precision)
      • context radius = 2 (configurable, CS_NW_MERGE_CONTEXT_RADIUS)

      • topScore != 0

      Please note that as for implementation, use word embedding for ranking merge candidates is a much more complicated than 1-to-1 or split because different merge candidates have different context.
    • CSpell ranking: word embedding, then frequency

    • We observed that there were not too many merge cases in the training set. Most cases have only 1 merge candidate. Thus, no ranking is needed.
    • TBD: noisy channel for merge cases is not implemented and tested. It is because current ranking method is good enough and not too many merge cases (no need to spend more efforts)
  • Corrector:
    MergeCorrector.java
    • Reconstrcut the whole text (inTokenList) by going through all mergeObjs
    • Update process history to non-word-merge

III. Development Test

  • True-Positive Non-word Merge:
    IdSourceOriginal WordsMerged Word
    TP-111neuro transmissionsneurotransmissions
    TP-212tricho rhino phalangealtrichorhinophalangeal
    TP-373dur ingduring
    TP-4 (RW)11579meth amphetaminesmethamphetamines
    TP-5 (RW)13645non prescriptionnonprescription
    TP-6 (RW)16974non drugnondrug
    • Case 4,5,6 are corrected as non-word merge (non) while it was annotated as real-word merge in the gold standard.
  • False-Positive Non-word Merge:
    IdSourceOriginal WordsMerged Word
    FP-142senior lokensenior-loken
    FP-280pallido ponto nigralpallidopontonigral
    FP-3134233rd stage3rd-stage
    • Case 1 is "Senior Loken Syndrome" are not in the Lexicon.
    • Case 2 is "pallido ponto nigral" are not in the Lexicon (multiword dictioanry).
    • Case 3 is fixed by adding order number to the splitDictionary
  • False-Negative Non-word Merge:
    IdSourceOriginal WordsMerged Word
    FN-153rs 12934922rs12934922
    FN-253rs 4889294rs4889294
    FN-313082as ndand
    FN-416247long gevitylongevity
    • Case 1 & 2 are Genome, which is not in the current scope of CSpell
    • Case 3 & 4 involves more than pure merge operations (spelling + merge)

IV. Spelling Variants and Merge
Spelling variants, that include space, hyphen, and no space, are very tough to handle for merge cases. Please see the following table for the result from our merge algorithm:

InputOutputLexicon, Medline WCNotes
merge spaces
non prescriptionnonprescription
  • nonprescription|Yes|1803
  • non-prescription|Yes|1083
  • non prescription|No|
Space merge due to the ranking
non profitnonprofit
  • nonprofit|Yes|3682
  • non-profit|Yes|1762
  • non profit|No|
Space merge due to the ranking
merge hyphens
non proteinnon-protein
  • nonprotein|Yes|2727
  • non-protein|Yes|3852
  • non protein|No|
Hyphen merge due to the ranking
non selfnon-self
  • nonself|Yes|860
  • non-self|Yes|1826
  • non self|No|
Hyphen merge due to the ranking
non smallnon-small
  • nonsmall|Yes|3167
  • non-small|Yes|54482
  • non small|No|
Hyphen merge due to the ranking
No merge
non diabeticnon diabetic
  • nondiabetic|Yes|
  • non-diabetic|Yes|
  • non diabetic|Yes|
No merge if the original term is in Lexicon (Dic - valid multiword)
non surgicalnon surgical
  • nonsurgical|Yes|
  • non surgical|Yes|
  • non-surgical|Yes|
No merge if the original term is in Lexicon (Dic - valid multiword)
non competitivenon competitive
  • noncompetitive|Yes|
  • non-competitive|Yes|
  • non competitive|Yes|
No merge if the original term is in Lexicon (Dic - valid multiword)