CSpell

Leading Punctuation Splitter

Description:
This splitter is used to process a split by adding a space before leading punctuation if a token contains leading punctuation. Leading punctuation includes: &([{
Features:
Split a token in front of leading punctuation.

Examples:

File Name	Input	Output
12134.txt	doppler(	doppler (
12271.txt	1-plug&	1-plug &
12353.txt	epilepsy(	epilepsy (
12353.txt	volunteers(	volunteers (
12706.txt	dr.[	dr. [
18186.txt	test(	test (
18341.txt	vain(	vain (
2.txt	one(	one (
30.txt	folitrax(	folitrax (
50.txt	,[	, [
78.txt	genes[	genes [

Implementation Logic:
- Recursively perform the following process:
- Converts input word to coreTerm by stripping off leading punctuation, spaces, and digits.
- Check if the coreTerm contains leading punctuation, if yes
  - Find the first leading punctuation
  - Check if the coreTerm matches the exceptions of the leading punctuation, if not:
    - Add space before the leading punctuation
- Check if the prefix contains leading punctuation, if yes
  - Find the first leading punctuation
  - Check if the prefix matches the exceptions of the leading punctuation, if not:
    - Add space before the leading punctuation
- Check if the suffix leads with leading punctuation, if yes
  - Add space before the leading punctuation
- Converts the updated coreTerm back to output term if split happen in coreterm, prefix, or suffix.

Notes:

Baseline source code: PreProcSplit.java
Enhancement:
- not used dictionary
- Add leading punctuation of [&]
- Remove leading punctuation of [/] and [-] to increase precision
- Implements exceptions separately for each leading punctuation
- Use coreTermObj to split to prefix, coreTerm, suffix
- Recursively split until there is no more split
Punctuation of @ and * might be qualified for leading punctuation, it needs further analysis.
Action: Redesign and implemented

Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression for each leading punctuation. They are described in the following table:

Broader Generic Matchers (Qualifiers)
Matcher	Regular Expression	Examples
Contains Leading Punctuation	`^.[&\\(\\[\\{].$`

Filters (Specific Exceptions for Each Leading Punctuation)
Leading Punctuation	Filter (Exception)	Regular Expression	Examples
Ampersand [&]	1. Abbreviations [A-Z]+&[A-Z]+	`^[A-Z]+&[A-Z]+$`	AT&T R&D
Left Parenthesis [(]	1. contains digits or plus sign [non-space]([digit]+\+?)[non-space]	`((\\S)\$[\\d]+(\\+)?\$(\\S))`	RS(3)PE δ(18)O Ca(2+) Ca(2+)-ATPase
	2. max or min [non-space](max\|min)[non-space]	`((\\S)*\$(max\|min)\$)`	V(max) C(min)
	3. contains a single char or plus [non-space](+char)[non-space]	`((\\S)\$[+\\w]\$(\\S))`	D(+)HUS GABA(A) apolipoprotein(a) beta(1)s homocyst(e)ine
	4. parenthetic plural forms [word]+((s\|es)\|(y(ies)))	`([\\w]+((s\$es\$)\|(y\$ies\$)))`	finger(s) fetus(es) extremity(ies)
	5. after a hyphen [non-space]-([non-space])	`((\\S)-\\((\\S))`	poly-(ethylene poly-(ADP-ribose) C-(17:0) I-(alpha)
Left Square Bracket [[]	1. [ [lower] ] [non-space][[lower]][non-space]	`(\\S\\[[a-z]\\]\\S)`	benzo[a]pyrene B[e]P
Left Square Bracket [[]	2. leads with tilde or hyphen (tilde\|hyphen)[	`([~\\-]\\[\\S*)`	-[NAME] ~[NAME]
Left Curly Brace [{]	1. No exceptions found	`$^`	None

Source Code:
- LeadingPunc.java
- LeadingPuncSplitter.java