CSpell

Leading Punctuation Splitter

  • Description:
    This splitter is used to process a split by adding a space before leading punctuation if a token contains leading punctuation. Leading punctuation includes: &([{

  • Features:
    Split a token in front of leading punctuation.

  • Examples:

    File NameInputOutput
    12134.txtdoppler(doppler (
    12271.txt1-plug&1-plug &
    12353.txtepilepsy(epilepsy (
    12353.txtvolunteers(volunteers (
    12706.txtdr.[dr. [
    18186.txttest(test (
    18341.txtvain(vain (
    2.txtone(one (
    30.txtfolitrax(folitrax (
    50.txt,[, [
    78.txtgenes[genes [

  • Implementation Logic:
    • Recursively perform the following process:
    • Converts input word to coreTerm by stripping off leading punctuation, spaces, and digits.
    • Check if the coreTerm contains leading punctuation, if yes
      • Find the first leading punctuation
      • Check if the coreTerm matches the exceptions of the leading punctuation, if not:
        • Add space before the leading punctuation
    • Check if the prefix contains leading punctuation, if yes
      • Find the first leading punctuation
      • Check if the prefix matches the exceptions of the leading punctuation, if not:
        • Add space before the leading punctuation
    • Check if the suffix leads with leading punctuation, if yes
      • Add space before the leading punctuation
    • Converts the updated coreTerm back to output term if split happen in coreterm, prefix, or suffix.

  • Notes:
    • Baseline source code: PreProcSplit.java
    • Enhancement:
      • not used dictionary
      • Add leading punctuation of [&]
      • Remove leading punctuation of [/] and [-] to increase precision
      • Implements exceptions separately for each leading punctuation
      • Use coreTermObj to split to prefix, coreTerm, suffix
      • Recursively split until there is no more split
    • Punctuation of @ and * might be qualified for leading punctuation, it needs further analysis.
    • Action: Redesign and implemented
    • Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression for each leading punctuation. They are described in the following table:
      Broader Generic Matchers (Qualifiers)
      MatcherRegular ExpressionExamples
      Contains Leading Punctuation^.*[&\\(\\[\\{].*$

      Filters (Specific Exceptions for Each Leading Punctuation)
      Leading PunctuationFilter (Exception)Regular ExpressionExamples
      Ampersand [&]1. Abbreviations
      [A-Z]+&[A-Z]+
      ^[A-Z]+&[A-Z]+$
      • AT&T
      • R&D
      Left Parenthesis [(] 1. contains digits or plus sign
      [non-space]*([digit]+\+?)[non-space]*
      ((\\S)*\\([\\d]+(\\+)?\\)(\\S)*)
      • RS(3)PE
      • δ(18)O
      • Ca(2+)
      • Ca(2+)-ATPase
      2. max or min
      [non-space]*(max|min)[non-space]*
      ((\\S)*\\((max|min)\\))
      • V(max)
      • C(min)
      3. contains a single char or plus
      [non-space]*(+char)[non-space]*
      ((\\S)*\\([+\\w]\\)(\\S)*)
      • D(+)HUS
      • GABA(A)
      • apolipoprotein(a)
      • beta(1)s
      • homocyst(e)ine
      4. parenthetic plural forms
      [word]+((s|es)|(y(ies)))
      ([\\w]+((s\\(es\\))|(y\\(ies\\))))
      • finger(s)
      • fetus(es)
      • extremity(ies)
      5. after a hyphen
      [non-space]*-([non-space]*)
      ((\\S)*-\\((\\S)*)
      • poly-(ethylene
      • poly-(ADP-ribose)
      • C-(17:0)
      • I-(alpha)
      Left Square Bracket [[] 1. [ [lower] ]
      [non-space]*[[lower]][non-space]*
      (\\S*\\[[a-z]\\]\\S*)
      • benzo[a]pyrene
      • B[e]P
      2. leads with tilde or hyphen
      (tilde|hyphen)[
      ([~\\-]\\[\\S*)
      • -[NAME]
      • ~[NAME]
      Left Curly Brace [{]1. No exceptions found$^None

  • Source Code: