Word Tokenizer Algorithm (Java)
Word Tokenizer is used to tokenize and filter out words and characters in TI and AB fields from citations. The algorithm used in the Java version is slice different than the Lisp version. Please see TI report and AB report for details.
The procedures and criteria are described as follows:
- Remove matched string (case sensitive)
Beginning string | Ending string | References
[correction | ] | ?
(abstracts were not | included) | 0289-10695616
[J. Neuroimmunol. 104, | 85-91] | ?
(abstracts presented at recent scientific meetings | package inserts) | 0306-11261533
(Japanese Association of Intellectual Copyright | #130,591) | 0306-11276498
- remove matched ending string (case sensitive)
Beginning string | References
CopyrightCopyright | ?
Copyright Copyright | ?
Copyright | ?
.Copyright | ?
)Copyright | ?
(abstract | ?
? Copyright | ?
) Copyright | ?
Copyright 2001 Wiley-Liss, Inc. | 0310-11391771
- remove matched ending string (case insensitive)
Beginning string | Ending string | Exceptions | References
[ | ] | [ | ?
[ | .] | [These syndromes can be a contributory | 0408-10199143
[published erratum | ] | None | ?
[forensic science international | ] | None | ?
(abstract truncated | ) | None | ?
(published erratum | ) | None | ?
(comments | )] | None | ?
- remove exact matched ending string (case insensitive)
Match string | References
[see comments] | ?
(see comments) | ?
[seecomments] | ?
[ see comments] | ?
[in process citation] | ?
(in process citation) | ?
[corrected] | ?
[correction of artistic] | ?
(letter) | ?
(letter)] | ?
(editorial)] | ?
- remove [title]
- remove non-alpha-num char (beginning and ending) from all words
- expand contraction
- replace punctuation with space
- remove words with less 3 characters
- remove words begins with digit