CSpell

Split Candidates

I. Introduction

A token is a single word (term without spaces). If a token is corrected to multiple tokens (a multiword - a term with spaces), this correction is called split correction. Space(s) are inserted to the input token to generate the split candidates. Split correction is used to correct agglutination errors.

II. Algorithm

  • Get all possible splits from the spelling error words within 1 edit distance by:
    • Insert a space: abc -> {{a bc}, {ab c}}
  • Get all possible split combinations from the spelling error words within N splits by:
    • Recursively apply the same algorithm on the result of above (1 split)
    • Sum all combinations

    • Also replace '-' by space for split. For example, "neck-lesion" is converted to "neck lesion". However, this feature increase recall, decrease precision and F1. So, it is currently disable. Need more study on the data to find a better algorithm for this feature. Maybe check with spelling variants as filter first.
  • Check all N splits:
    • Check if the whole split term is multiwords (multiword dictionary):
      • If yes, they are suggested candidates
        => such as "perse" to "per se"
      • If not, check if all split words are valid split words from split word dictionary. Split word dictionary includes English words and proper noun, excludes abbreviations/acronyms with small (specified) length
        => Used dictionary without acronym and abbreviations to reduce noise (such as 'a', 'ab', ...)
        => may add exceptions (such as digit, unit, ...)

III. Split Test Examples

Error tokencorrected wordsNotes
perseper se
PTHrPeptidePTHr Peptide
viceversavice versa
knowaboutknow about14.txt
testsplittest split
123testsplitok123?123 test split ok 123?
Amlodipine5mgsAmlodipine 5 mgsword + measurement
aftercareEmailaftercare Email
friendSharefriend Share10225.txt
facebookSharefacebook Share10225.txt
carbohydratescarbo hydrates
leftsideleft side
camedowncame down
OftenDoOften Do
alota lot14.txt
ThankyouThank you
thank-youThank you
diseaseAnydisease Any
IlostI lost
alonga long1034.txt
aparta part13864.txt
anywayany way1-134591345.txt
everydayevery day1-136441717.txt
nooneno one1-135085045.txt
hotflasheshot flashes1-120029095.txt
infooninfo on11757.txt
icecreamarebadice cream are badsame score, use frequency to increase precision
manytestsaredonemany tests are donewrong ranking score among 3 candidates, use frequency to increase precision

IV. TBD

  • split + 1-to-1 correction (TBD)
  • Examples:

    InputSplit1-to-1 CorrectionNotes
    menimgtisneefmenimgtis neefmeningitis needs
    anorexia?Thank-youanorexia? Thank-youanorexia? Thank you2.txt
    Shuntfrom2007.HowShunt from 2007. HowShut from 2007. How
    polipsremovedpolips removedpolyps removed
    problem,I-amproblem, I-amproblem, I am
    gooddoggood doggood dogsplit
    goooddoggoood doggood dogsplit to none-word correction
    goddoggod doggood dogsplit to real-word correction