Text Categorization

Word Tokenizer Algorithm (Java)

Word Tokenizer is used to tokenize and filter out words and characters in TI and AB fields from citations. The algorithm used in the Java version is slice different than the Lisp version. Please see TI report and AB report for details.

The procedures and criteria are described as follows:

Remove matched string (case sensitive)

Beginning string	Ending string	References
[correction	]	?
(abstracts were not	included)	0289-10695616
[J. Neuroimmunol. 104,	85-91]	?
(abstracts presented at recent scientific meetings	package inserts)	0306-11261533
(Japanese Association of Intellectual Copyright	#130,591)	0306-11276498

remove matched ending string (case sensitive)

Beginning string	References
CopyrightCopyright	?
Copyright Copyright	?
Copyright	?
.Copyright	?
)Copyright	?
(abstract	?
(ABSTRACT	?
? Copyright	?
) Copyright	?
Copyright 2001 Wiley-Liss, Inc.	0310-11391771

remove matched ending string (case insensitive)

Beginning string	Ending string	Exceptions	References
[	]	[	?
[	.]	[These syndromes can be a contributory	0408-10199143
[published erratum	]	None	?
[forensic science international	]	None	?
(abstract truncated	)	None	?
(published erratum	)	None	?
(comments	)]	None	?

remove exact matched ending string (case insensitive)

Match string	References
[see comments]	?
(see comments)	?
[seecomments]	?
[ see comments]	?
[in process citation]	?
(in process citation)	?
[corrected]	?
[correction of artistic]	?
(letter)	?
(letter)]	?
(editorial)]	?

remove [title]
remove non-alpha-num char (beginning and ending) from all words
expand contraction
replace punctuation with space
remove words with less 3 characters
remove words begins with digit