CSpell

Ending Digit Splitter

  • Description:
    This splitter is used to process a split by adding a space before the ending digits if a token ends with digits.

  • Features:
    Split a token in front of ending digits.

  • Examples:

    File NameInputOutput
    26.txtquestions.1)questions. 1)
    26.txthereditary2)hereditary 2)
    26.txtdisease3)disease 3)
    14849.txtshuntfrom2007.shuntfrom 2007.
    73.txtjk5jk 5

  • Implementation Logic:
    • Converts input word to coreTerm by strip off leading and ending punctuation and spaces.
    • Check if the coreTerm ends with digit, if yes
      • Check if the coreTerm matches the exceptions, if not:
        • Add space before the ending digit(s)
    • Converts the updated coreTerm back to output term

  • Notes:
    • Baseline source code: PreProcSplit.java
    • Enhancement: not used dictionary
    • Action: Redesign and implemented
    • Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression. They are described in the following table:

      Matchers
      MatcherRegular ExpressionExamples
      Ends with digit(s)^(.*)([a-zA-Z\\.]+)(\\d+)$
      • disease3
      • 100.1
      • Co-Q10

      Filters (Exceptions)
      Filter (Exception)Regular ExpressionExamples
      1. [Upper]+ before ending digit^([A-Z]+)(\\d+)$
      • A1
      • UPD14
      • CAD106
      • A2780
      2. [char]+, [-], [char]+ before ending digit^([a-zA-Z]+)-([a-zA-Z]+)(\\d+)$
      • NCI-H460
      • CCRF-HSB2
      • Co-Q10
      • saframycin-Yd2
      3. [Greek alphabet] before ending digit^(.*)(alpha|beta|gamma|delta|epsilon)(\\d)$
      • alpha1
      • beta2
      • gamma2
      • epsilon4
      4. [char] before ending digit^([a-zA-Z])(\\d+)$
      • c7
      • A1

  • Source Code: EndingDigitSplitter.java