Text Categorization

Word Tokenizer Rules (requirements)

Word Tokenizer is used to tokenize and filter out words and characters in TI and AB fields from citations. The following rules/requirements are captured from the training set of 2004 data.

  • "Copyright" related patterns
    PatternCaseActionExamplesExceptions
    ..."CopyrightCopyright.."Yesremove "CopyrightCopyright.."0284-10559555None
    ..."Copyright Copyright.."Yesremove "Copyright Copyright.."0285-10567770None
    ..."Copyright Crown copyright..."Yesremove "Copyright Crown copyright..."0294-10878556None
    ..." Copyright.."Yesremove " Copyright.."
    0296-10927060
    0319-11751007
    ..."( Copyright)" 
    ..."with Copyright Clostridium perfringens type D epsilon toxin."... 0409-10208737
    ..." Copyright Demodex canis mites and Leishmania spp." 0409-10213670
    ..." Copyright Clostridium perfringens." 0415-10486158
    ..." Copyright Act, may significantly affect" 0416-10518204
    ...".Copyright.."Yesremove ".Copyright.." None
    ...")Copyright.."Yesremove ")Copyright.." None
    ..."? Copyright.."Yesremove "? Copyright.." None
    ...") Copyright.."Yesremove ") Copyright.." None
    ..."(Japanese Association of Intellectual Copyright #130,591)"...Yes remove "(Japanese Association of Intellectual Copyright #130,591)" 0306-11276498 None
    ..."Copyright 2001 Wiley-Liss, Inc."Yes remove "Copyright 2001 Wiley-Liss, Inc." 0310-11391771 None
    ..."Copyright -Copyright 2000 John Wiley & Sons, Ltd."Yes remove "Copyright -Copyright 2000 John Wiley & Sons, Ltd." 0301-11114061 None
    ..."GRASPCopyright workload system"...Yes Do nothing0299-11049704 None
    ..."PortfolioCopyright, a tool"...Yes Do nothing0298-10995616None
    ..."(Copyright P<0.001)"...Yes Do nothing0411-10373290None

  • Remove "Crown Copyright Copyright.."

  • "ABSTRACT" related patterns
    PatternCaseActionExamplesExceptions
    ..."(abstract.."Yesremove "(abstract.."0314-11530280
    ..."(abstract)"...0291-10775709
    ..."(abstract 483)"...0304-11167089
    ..."(abstract proverbs)"...0294-10877433
    ..."(abstract or embedded principle method)"...0313-11504010
    ..."(abstract geometric figures vs. photographs of children)"...0315-11575901
    ..."(abstract/concrete judgement)"...0315-11580903
    ..."(abstracts)"...0408-10200860
    ..."(ABSTRACT.."Yesremove "(ABSTRACT.."
    0287-10654982
    0300-11074393
    0303-11062606
    0304-11158495
    0310-11389298
    0411-10368776
    0412-10416202
    None
    ..."(abstract trucated..)"Noremove "(abstract trucated..)" None
    ..."(abstracts were not included)"...No remove "(abstracts were not included)" 0289-10695616None
    ..."(abstracts presented at recent scientific meetings, manufacturers' package inserts)"...Yes remove "(abstracts presented at recent scientific meetings, manufacturers' package inserts)" 0306-11261533 None

  • Remove "[abstract corrected]"

  • "comments" related patterns
    PatternCaseActionExampleExceptions
    ..."[see comments]"Noremove "[see comments]" None
    ..."(see comments)"Noremove "(see comments)" None
    ..."[seecomments]"Noremove "[seecomments]" None
    ..."[ see comments]"Noremove "[ see comments]" None
    ..."(comments..)]"Noremove "(comments..)]" None

  • Other patterns
    PatternCaseActionExamplesExceptions
    ..."[correction..]"...Noremove "[correction..]"
    0284-10535671, 10550444
    0285-10583726, 10593612
    None
    ..."[Key words:"...Noremove "[key words:" 0290-10751293None
    ..."[..]"Noremove "[..]"
    remove ..."[Diabetologia..]"0284-10550410
    remove ..."[See editorial..]"0285-10570444
    remove ..."[Editorial comment..]"0306-11283830, 11283831
    remove ..."[Originally published in..]"0306-11276057, 11276063
    remove ..."[see text]"0290-10753871
    remove ..."[formula: see text]"0287-10629949
    remove ..."[figure: see text]"0292-10800657
    remove ..."[not readable: see text]"0302-11142883, 11142885
    remove ..."[reaction: see text]"0314-11556387
    remove ..."[structures: see text]"0295-10907885
    remove ..."[structure: see text]"0298-10998489
    remove ..."[table: see text]"0303-11190348
    remove ..."[Table: see text]"0319-11753520
    remove ..."[The sequence data described in this paper have been submitted to..]"0288-10673275
    remove ..."[This abstract has been prepared centrally.]"0292-10796470, 10796496, 10796629
    remove ..."[Translations are provided in the International Abstracts..]"0407-10052387, 10052388
    [..."[..]" 
    ..."[l: atlantoaxial fixation, biomechanics, cervical spine, instability, spinal instrumentation, transarticular screws]" 0303-11074673
    ..."[.. .]"Noremove "[.. .]" 
    ..."[The OTN PCR used in conjunction with CB18-CH-PK or IMS could be effectively used as a diagnostic and/or screening test for the detection of M. bovis in milk from herds with bovine tuberculosis.]" 0307-11289205
    ..."[These syndromes can be a contributory causes of insulin resistance in a subpopulation with NIDDM.]" 0408-10199143
    "["..."]"Noremove "[" and "]" None
    ..."[published erratum..]"Noremove "[published erratum..]" None
    ..."[forensic science international..]"Noremove "[forensic science international..]" None
    ..."(published erratum..)"Noremove "(published erratum..)" None
    ..."[in process citation]"Noremove "[in process citation]" None
    ..."(in process citation)"Noremove "(in process citation)" None
    ..."[corrrected]"Noremove "[corrrected]" None
    ..."[correction of artistic]"Noremove "[correction of artistic]" None
    ..."(letter)"Noremove "(letter)" None
    ..."(letter)]"Noremove "(letter)]" None
    ..."(editorial)]"Noremove "(editorial)]" None
    ..."2-[substituted acetyl]-amino-5-alkyl-1,3,4-thiadiazoles"Nodo nothing0404-9868551None

  • Individual cases
    PatternCaseActionExamples
    ..."[J. Neuroimmunol. 104, 85-91]"...Yesremove "[J. Neuroimmunol. 104, 85-91]"0295-10900360

  • Trim string

  • LowerCase

  • remove non-alpha-num char (beginning and ending) from all words

  • Expand contractions
    PatternCaseActionExamples
    ..."didn't"...Noexpand to "did not"0404-9875250
    ..."don't"...Noexpand to "do not"
    0290-10738817
    0294-10859841
    0298-11019399
    0312-11467217
    ..."who'd"...Noexpand to "who would"0311-11425141
    ..."can't"...Noexpand to "cannot"0316-11673724
    ..."won't"...Noexpand to "will not"0314-11519969
    ..."wouldn't"...Noexpand to "would not"0405-9892548

  • replace punctuation with space

  • remove words with less 3 characters

  • remove words begins with digit