CSpell

Ending Punctuation Splitter

Description:
This splitter is used to process a split by adding a space after ending punctuation if a token contains ending punctuation. Ending punctuation includes: .?!,:;&)]}
Features:
Split a token in front of ending punctuation.

Examples:

File Name	Input	Output
10023.txt	down.please	down. please
10286.txt	...my	... my
10004.txt	cancer?if	cancer? if
11186.txt	?pls	? please
97.txt	suggestions?thanks	suggestions? thanks
53.txt	hello!can	hello! can
11186.txt	,she	, she
16823.txt	:by	: by
22.txt	;syrinx	; syrinx
2.txt	)why	) why

Implementation Logic:
- Recursively perform the following process:
- Converts input word to coreTerm by strip off leading and ending punctuation, spaces, and digits.
- Check if the coreTerm contains ending punctuation, if yes
  - Find the last ending punctuation
  - Check if the coreTerm matches the exceptions of the ending punctuation, if not:
    - Add space before the ending punctuation
- Check if the prefix ends with ending punctuation, if yes
  - Add space after the ending punctuation
- Check if the suffix contains ending punctuation, if yes
  - Find the last ending punctuation
  - Check if the suffix matches the exceptions of the ending punctuation, if not:
    - Add space before the ending punctuation
- Converts the updated coreTerm back to output term if split happen in coreterm, prefix, or suffix.

Notes:

Baseline source code: PreProcSentence.java
Enhancement:
- not used dictionary
- Add ending punctuation of [:]
- Remove hard coded patterns of [NUM], [EMAIL], [URL]
- Remove leading punctuation of [/] and [-] to increase precision
- Implements exceptions separately for each ending punctuation
- Use coreTermObj to split to prefix, coreTerm, suffix
- Recursively split until there is no more split
Punctuation of @ and * might be qualified for ending punctuation, it needs further analysis.
Action: Redesign and implemented

Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression for each ending punctuation. They are described in the following table:

Broader Generic Matchers
Matcher	Regular Expression	Examples
Contains Ending Punctuation	`^.[\\.\\?!,;:&\\)\\]\\}].$`
Email (false)	^[\\w!#$%&'+-/=?^_`{\|}~]+@(\\w+(\\.\\w+)(\\.(gov\|com\|org\|edu\|mil\|net)))$	abc@gmail.com !!@gamil.com abc@123.net
Url (false)	`^((ftp\|http\|https\|file)://)?(\\w+(\\.\\w+)(\\.(gov\|com\|org\|edu\|mil\|net\|uk)).)$`	http://www.yahoo.com yahoo.com yahoo.com?test=1%20try%20abc
Pure digit or punctuation (false)	`^([\\W_\\d&&\\S]+)$`	123.500 12-35-00 12.35.00 !@#123$%^

Filters (Specific Exceptions for Each Ending Punctuation)
Ending Punctuation	Filter (Exception)	Regular Expression	Examples
Period [.]	1. Plural form	`(.*\\.s)`	Dr.s Mr.s
	2. surrounded by digit [char][digit].[digit][char]	`((\\w\\d\\.\\d\\w)+)`	16q22.1 123.2 123.234.4567 1c3.2d4.4e6
	3. surrounded by single characters [single non-digit].[single non-digit]?	`((\\D\\.)+\\D?)`	D.C.A.B. D.C.A.B d.c.a. d.c.a D.c
	4. followed by a hyphen [word].-[word]	`(\\w\\.-\\w)`	St.-John 123.-John
	5. followed by a quote [char]*.['"]	`(.*\\.['\"])`	Mucinosis."
Question Mark [?]	1. followed by a quote [char]*?['"]	`(.*\\?['\"])`	ulcers?' ulcers?"
Exclamation Mark [!]	1. followed by a quote [char]*!['"]	`(.*!['\"])`	ulcers!' ulcers!"
Comma [,]	1. digit group separator [digit]+,[digit]{3}	`(\\d+(,[\\d]{3})+)`	12,345 1,234,567
Colon [:]	1. ratio [digit]+:[digit]+	`(\\d+:\\d+)`	1:2
Semicolon [;]	1. No exceptions found	`$^`	None
Ampersand [&]	1. Abbreviations [A-Z]+&[A-Z]+	`[A-Z]+&[A-Z]+`	AT&T R&D
Right Parenthesis [)]	1. single char surrounded by parenthesis [non-space]([+char])[non-space]	`((\\S)\$[+\\w]\$(\\S))`	homocyst(e)ine NAD(P)H RS(3)PE D(+)HUS
	2. chars surrounded by parenthesis and followed by a hyphen [non-space](char+)-[non-space]	`((\\S)\$[+\\w]+\$-(\\S))`	Ca(2+)-ATPase beta(2)-microglobulin (Si)-synthase (ADP)-ribose
	3. digit surrounded by parenthesis [non-space](digit+)[non-space]	`((\\S)\$\\d+\$(\\S))`	VO(2)max δ(18)O (123)I-mIBG (131)I
Right Square Bracket []]	1. [digit]+[Upper] surrounded by [] [non-space][[digit]+[Upper]][non-space]	`(\\S\\[\\d+[A-Z]\\]\\S)`	[11C]MeG [3H]-thymidine [3H]tyrosine
Right Square Bracket []]	2. [lower] surrounded by [] [Upper]+	`(\\S\\[[a-z]\\]\\S)`	benzo[a]pyrene B[e]P
Right Curly Brace [}]	1. No exceptions found	`$^`	None

Source Code:
- EndingPunc.java
- EndingPuncSplitter.java