Split Candidates
I. Introduction
A token is a single word (term without spaces). If a token is corrected to multiple tokens (a multiword - a term with spaces), this correction is called split correction. Space(s) are inserted to the input token to generate the split candidates. Split correction is used to correct agglutination errors.
II. Algorithm
III. Split Test Examples
Error token | corrected words | Notes |
---|---|---|
perse | per se | |
PTHrPeptide | PTHr Peptide | |
viceversa | vice versa | |
knowabout | know about | 14.txt |
testsplit | test split | |
123testsplitok123? | 123 test split ok 123? | |
Amlodipine5mgs | Amlodipine 5 mgs | word + measurement |
aftercareEmail | aftercare Email | |
friendShare | friend Share | 10225.txt |
facebookShare | facebook Share | 10225.txt |
carbohydrates | carbo hydrates | |
leftside | left side | |
camedown | came down | |
OftenDo | Often Do | |
alot | a lot | 14.txt |
Thankyou | Thank you | |
thank-you | Thank you | |
diseaseAny | disease Any | |
Ilost | I lost | |
along | a long | 1034.txt |
apart | a part | 13864.txt |
anyway | any way | 1-134591345.txt |
everyday | every day | 1-136441717.txt |
noone | no one | 1-135085045.txt |
hotflashes | hot flashes | 1-120029095.txt |
infoon | info on | 11757.txt |
icecreamarebad | ice cream are bad | same score, use frequency to increase precision |
manytestsaredone | many tests are done | wrong ranking score among 3 candidates, use frequency to increase precision |
IV. TBD
Input | Split | 1-to-1 Correction | Notes |
---|---|---|---|
menimgtisneef | menimgtis neef | meningitis needs | |
anorexia?Thank-you | anorexia? Thank-you | anorexia? Thank you | 2.txt |
Shuntfrom2007.How | Shunt from 2007. How | Shut from 2007. How | |
polipsremoved | polips removed | polyps removed | |
problem,I-am | problem, I-am | problem, I am | |
gooddog | good dog | good dog | split |
goooddog | goood dog | good dog | split to none-word correction |
goddog | god dog | good dog | split to real-word correction |