Text Categorization

Word Tokenizer Algorithm (Java)

Word Tokenizer is used to tokenize and filter out words and characters in TI and AB fields from citations. The algorithm used in the Java version is slice different than the Lisp version. Please see TI report and AB report for details.

The procedures and criteria are described as follows:

  • Remove matched string (case sensitive)
    Beginning stringEnding stringReferences
    [correction]?
    (abstracts were notincluded)0289-10695616
    [J. Neuroimmunol. 104,85-91]?
    (abstracts presented at recent scientific meetingspackage inserts)0306-11261533
    (Japanese Association of Intellectual Copyright#130,591)0306-11276498

  • remove matched ending string (case sensitive)
    Beginning stringReferences
    CopyrightCopyright?
    Copyright Copyright?
    Copyright?
    .Copyright?
    )Copyright?
    (abstract?
    (ABSTRACT?
    ? Copyright?
    ) Copyright?
    Copyright 2001 Wiley-Liss, Inc.0310-11391771

  • remove matched ending string (case insensitive)
    Beginning stringEnding stringExceptionsReferences
    [][?
    [.][These syndromes can be a contributory0408-10199143
    [published erratum]None?
    [forensic science international]None?
    (abstract truncated)None?
    (published erratum)None?
    (comments )]None?

  • remove exact matched ending string (case insensitive)
    Match stringReferences
    [see comments]?
    (see comments)?
    [seecomments]?
    [ see comments]?
    [in process citation]?
    (in process citation)?
    [corrected]?
    [correction of artistic]?
    (letter)?
    (letter)]?
    (editorial)]?

  • remove [title]

  • remove non-alpha-num char (beginning and ending) from all words

  • expand contraction

  • replace punctuation with space

  • remove words with less 3 characters

  • remove words begins with digit