Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Ending Punctuation Splitter

  • Description:
    This splitter is used to process a split by adding a space after ending punctuation if a token contains ending punctuation. Ending punctuation includes: .?!,:;&)]}

  • Features:
    Split a token in front of ending punctuation.

  • Examples:

    File NameInputOutput
    10023.txtdown.pleasedown. please
    10286.txt...my... my
    10004.txtcancer?ifcancer? if
    11186.txt?pls? please
    97.txtsuggestions?thankssuggestions? thanks
    53.txthello!canhello! can
    11186.txt,she, she
    16823.txt:by: by
    22.txt;syrinx; syrinx
    2.txt)why) why

  • Implementation Logic:
    • Recursively perform the following process:
    • Converts input word to coreTerm by strip off leading and ending punctuation, spaces, and digits.
    • Check if the coreTerm contains ending punctuation, if yes
      • Find the last ending punctuation
      • Check if the coreTerm matches the exceptions of the ending punctuation, if not:
        • Add space before the ending punctuation
    • Check if the prefix ends with ending punctuation, if yes
      • Add space after the ending punctuation
    • Check if the suffix contains ending punctuation, if yes
      • Find the last ending punctuation
      • Check if the suffix matches the exceptions of the ending punctuation, if not:
        • Add space before the ending punctuation
    • Converts the updated coreTerm back to output term if split happen in coreterm, prefix, or suffix.

  • Notes:
    • Baseline source code: PreProcSentence.java
    • Enhancement:
      • not used dictionary
      • Add ending punctuation of [:]
      • Remove hard coded patterns of [NUM], [EMAIL], [URL]
      • Remove leading punctuation of [/] and [-] to increase precision
      • Implements exceptions separately for each ending punctuation
      • Use coreTermObj to split to prefix, coreTerm, suffix
      • Recursively split until there is no more split
    • Punctuation of @ and * might be qualified for ending punctuation, it needs further analysis.
    • Action: Redesign and implemented
    • Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression for each ending punctuation. They are described in the following table:
      Broader Generic Matchers
      MatcherRegular ExpressionExamples
      Contains Ending Punctuation^.*[\\.\\?!,;:&\\)\\]\\}].*$
      Email (false)^[\\w!#$%&'*+-/=?^_`{|}~]+@(\\w+(\\.\\w+)*(\\.(gov|com|org|edu|mil|net)))$
      • abc@gmail.com
      • !!@gamil.com
      • abc@123.net
      Url (false)^((ftp|http|https|file)://)?(\\w+(\\.\\w+)*(\\.(gov|com|org|edu|mil|net|uk)).*)$
      • http://www.yahoo.com
      • yahoo.com
      • yahoo.com?test=1%20try%20abc
      Pure digit or punctuation (false)^([\\W_\\d&&\\S]+)$
      • 123.500
      • 12-35-00
      • 12.35.00
      • !@#123$%^

      Filters (Specific Exceptions for Each Ending Punctuation)
      Ending PunctuationFilter (Exception)Regular ExpressionExamples
      Period [.] 1. Plural form(.*\\.s)
      • Dr.s
      • Mr.s
      2. surrounded by digit
      [char]*[digit].[digit][char]*
      ((\\w*\\d\\.\\d\\w*)+)
      • 16q22.1
      • 123.2
      • 123.234.4567
      • 1c3.2d4.4e6
      3. surrounded by single characters
      [single non-digit].[single non-digit]?
      ((\\D\\.)+\\D?)
      • D.C.A.B.
      • D.C.A.B
      • d.c.a.
      • d.c.a
      • D.c
      4. followed by a hyphen
      [word]*.-[word]*
      (\\w*\\.-\\w*)
      • St.-John
      • 123.-John
      5. followed by a quote
      [char]*.['"]
      (.*\\.['\"])
      • Mucinosis."
      Question Mark [?]1. followed by a quote
      [char]*?['"]
      (.*\\?['\"])
      • ulcers?'
      • ulcers?"
      Exclamation Mark [!]1. followed by a quote
      [char]*!['"]
      (.*!['\"])
      • ulcers!'
      • ulcers!"
      Comma [,] 1. digit group separator
      [digit]+,[digit]{3}
      (\\d+(,[\\d]{3})+)
      • 12,345
      • 1,234,567
      Colon [:]1. ratio
      [digit]+:[digit]+
      (\\d+:\\d+)
      • 1:2
      Semicolon [;]1. No exceptions found$^None
      Ampersand [&]1. Abbreviations
      [A-Z]+&[A-Z]+
      [A-Z]+&[A-Z]+
      • AT&T
      • R&D
      Right Parenthesis [)] 1. single char surrounded by parenthesis
      [non-space]*([+char])[non-space]*
      ((\\S)*\\([+\\w]\\)(\\S)*)
      • homocyst(e)ine
      • NAD(P)H
      • RS(3)PE
      • D(+)HUS
      2. chars surrounded by parenthesis and followed by a hyphen
      [non-space]*(char+)-[non-space]*
      ((\\S)*\\([+\\w]+\\)-(\\S)*)
      • Ca(2+)-ATPase
      • beta(2)-microglobulin
      • (Si)-synthase
      • (ADP)-ribose
      3. digit surrounded by parenthesis
      [non-space]*(digit+)[non-space]*
      ((\\S)*\\(\\d+\\)(\\S)*)
      • VO(2)max
      • δ(18)O
      • (123)I-mIBG
      • (131)I
      Right Square Bracket []] 1. [digit]+[Upper] surrounded by []
      [non-space]*[[digit]+[Upper]][non-space]*
      (\\S*\\[\\d+[A-Z]\\]\\S*)
      • [11C]MeG
      • [3H]-thymidine
      • [3H]tyrosine
      2. [lower] surrounded by []
      [Upper]+
      (\\S*\\[[a-z]\\]\\S*)
      • benzo[a]pyrene
      • B[e]P
      Right Curly Brace [}]1. No exceptions found$^None

  • Source Code: