CSpell

Leading Digit Splitter

  • Description:
    This splitter is used to process a split by adding a space after the leading digits if a token leads with digits.

  • Features:
    Split a token at the end of leading digits.

  • Examples:

    File NameInputOutput
    73.txt4miscarriages4 miscarriages
    10349.txt20years20 years
    11579.txt29yrs29 yrs
    10349.txt1.5years1.5 years
    13082.txt3weeks3 weeks
    13175.txt50mg50 mg

  • Implementation Logic:
    • Converts input word to coreTerm by strip off leading and ending punctuation and spaces.
    • Check if the coreTerm leads with digit, if yes
      • Check if the coreTerm matches the exceptions, if not:
        • Add space after the leading digit
    • Converts the updated coreTerm back to output term

  • Notes:
    • Baseline source code: PreProcSplit.java
    • Enhancement:
      • Not used dictionary
      • In addition to handle ordinal number (e.g. 1st, 2rd. 3rd. 4th), more exception patterns are extracted from Lexicon and consumer data to increase the precision (see detail below).
    • Action: Redesign and implemented
    • Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression. They are described in the following table:

      Matchers
      MatcherRegular ExpressionExamples
      Leads with digit(s)^(\\d*\\.?\\d+)([a-zA-Z]{2,})(.*)$
      • 21year
      • 1.5months
      • 5mg
      • 5and

      Filters (Exceptions)
      Filter (Exception)Regular ExpressionExamples
      1. ordinal number^((\\d*)(1st|2nd|3rd))|((\\d+)(th))$
      • 1st
      • 42nd
      • 3rd
      • 435th
      2. [single chars] after the leading digit^(\\d+)([a-zA-Z])$
      • 31D
      • 9L
      • 5q
      3. [Upper], [Upper or digit]* after leading digit ^(\\d+)([A-Z]+)([A-Z0-9]*)$"
      • 67LR
      • 3Y1
      • 7PA2
      • 5FU
      4. [Upper, lower]+, [-], [word]* after leading digit^(\\d+)([a-zA-Z]+)-(\\w*)$
      • 111In-Cl
      • 5q-syndrome
      • 38C-13
      5. [Upper, lower], [punc, digit]* after leading digit^(\\d+)([a-zA-Z])([\\p{Punct}\\d]*)$
      • 16P-13.11
      • 16P-13
      • 1q21.1.

  • Source Code: LeadingDigitSplitter.java