Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

Word Tokenizer Rules (requirements)

Word Tokenizer is used to tokenize and filter out words and characters in TI and AB fields from citations. The following rules/requirements are captured from the training set of 2004 data.

  • "Copyright" related patterns
    PatternCaseActionExamplesExceptions
    ..."CopyrightCopyright.."Yesremove "CopyrightCopyright.."0284-10559555None
    ..."Copyright Copyright.."Yesremove "Copyright Copyright.."0285-10567770None
    ..."Copyright Crown copyright..."Yesremove "Copyright Crown copyright..."0294-10878556None
    ..." Copyright.."Yesremove " Copyright.."
    0296-10927060
    0319-11751007
    ..."( Copyright)" 
    ..."with Copyright Clostridium perfringens type D epsilon toxin."... 0409-10208737
    ..." Copyright Demodex canis mites and Leishmania spp." 0409-10213670
    ..." Copyright Clostridium perfringens." 0415-10486158
    ..." Copyright Act, may significantly affect" 0416-10518204
    ...".Copyright.."Yesremove ".Copyright.." None
    ...")Copyright.."Yesremove ")Copyright.." None
    ..."? Copyright.."Yesremove "? Copyright.." None
    ...") Copyright.."Yesremove ") Copyright.." None
    ..."(Japanese Association of Intellectual Copyright #130,591)"...Yes remove "(Japanese Association of Intellectual Copyright #130,591)" 0306-11276498 None
    ..."Copyright 2001 Wiley-Liss, Inc."Yes remove "Copyright 2001 Wiley-Liss, Inc." 0310-11391771 None
    ..."Copyright -Copyright 2000 John Wiley & Sons, Ltd."Yes remove "Copyright -Copyright 2000 John Wiley & Sons, Ltd." 0301-11114061 None
    ..."GRASPCopyright workload system"...Yes Do nothing0299-11049704 None
    ..."PortfolioCopyright, a tool"...Yes Do nothing0298-10995616None
    ..."(Copyright P<0.001)"...Yes Do nothing0411-10373290None

  • Remove "Crown Copyright Copyright.."

  • "ABSTRACT" related patterns
    PatternCaseActionExamplesExceptions
    ..."(abstract.."Yesremove "(abstract.."0314-11530280
    ..."(abstract)"...0291-10775709
    ..."(abstract 483)"...0304-11167089
    ..."(abstract proverbs)"...0294-10877433
    ..."(abstract or embedded principle method)"...0313-11504010
    ..."(abstract geometric figures vs. photographs of children)"...0315-11575901
    ..."(abstract/concrete judgement)"...0315-11580903
    ..."(abstracts)"...0408-10200860
    ..."(ABSTRACT.."Yesremove "(ABSTRACT.."
    0287-10654982
    0300-11074393
    0303-11062606
    0304-11158495
    0310-11389298
    0411-10368776
    0412-10416202
    None
    ..."(abstract trucated..)"Noremove "(abstract trucated..)" None
    ..."(abstracts were not included)"...No remove "(abstracts were not included)" 0289-10695616None
    ..."(abstracts presented at recent scientific meetings, manufacturers' package inserts)"...Yes remove "(abstracts presented at recent scientific meetings, manufacturers' package inserts)" 0306-11261533 None

  • Remove "[abstract corrected]"

  • "comments" related patterns
    PatternCaseActionExampleExceptions
    ..."[see comments]"Noremove "[see comments]" None
    ..."(see comments)"Noremove "(see comments)" None
    ..."[seecomments]"Noremove "[seecomments]" None
    ..."[ see comments]"Noremove "[ see comments]" None
    ..."(comments..)]"Noremove "(comments..)]" None

  • Other patterns
    PatternCaseActionExamplesExceptions
    ..."[correction..]"...Noremove "[correction..]"
    0284-10535671, 10550444
    0285-10583726, 10593612
    None
    ..."[Key words:"...Noremove "[key words:" 0290-10751293None
    ..."[..]"Noremove "[..]"
    remove ..."[Diabetologia..]"0284-10550410
    remove ..."[See editorial..]"0285-10570444
    remove ..."[Editorial comment..]"0306-11283830, 11283831
    remove ..."[Originally published in..]"0306-11276057, 11276063
    remove ..."[see text]"0290-10753871
    remove ..."[formula: see text]"0287-10629949
    remove ..."[figure: see text]"0292-10800657
    remove ..."[not readable: see text]"0302-11142883, 11142885
    remove ..."[reaction: see text]"0314-11556387
    remove ..."[structures: see text]"0295-10907885
    remove ..."[structure: see text]"0298-10998489
    remove ..."[table: see text]"0303-11190348
    remove ..."[Table: see text]"0319-11753520
    remove ..."[The sequence data described in this paper have been submitted to..]"0288-10673275
    remove ..."[This abstract has been prepared centrally.]"0292-10796470, 10796496, 10796629
    remove ..."[Translations are provided in the International Abstracts..]"0407-10052387, 10052388
    [..."[..]" 
    ..."[l: atlantoaxial fixation, biomechanics, cervical spine, instability, spinal instrumentation, transarticular screws]" 0303-11074673
    ..."[.. .]"Noremove "[.. .]" 
    ..."[The OTN PCR used in conjunction with CB18-CH-PK or IMS could be effectively used as a diagnostic and/or screening test for the detection of M. bovis in milk from herds with bovine tuberculosis.]" 0307-11289205
    ..."[These syndromes can be a contributory causes of insulin resistance in a subpopulation with NIDDM.]" 0408-10199143
    "["..."]"Noremove "[" and "]" None
    ..."[published erratum..]"Noremove "[published erratum..]" None
    ..."[forensic science international..]"Noremove "[forensic science international..]" None
    ..."(published erratum..)"Noremove "(published erratum..)" None
    ..."[in process citation]"Noremove "[in process citation]" None
    ..."(in process citation)"Noremove "(in process citation)" None
    ..."[corrrected]"Noremove "[corrrected]" None
    ..."[correction of artistic]"Noremove "[correction of artistic]" None
    ..."(letter)"Noremove "(letter)" None
    ..."(letter)]"Noremove "(letter)]" None
    ..."(editorial)]"Noremove "(editorial)]" None
    ..."2-[substituted acetyl]-amino-5-alkyl-1,3,4-thiadiazoles"Nodo nothing0404-9868551None

  • Individual cases
    PatternCaseActionExamples
    ..."[J. Neuroimmunol. 104, 85-91]"...Yesremove "[J. Neuroimmunol. 104, 85-91]"0295-10900360

  • Trim string

  • LowerCase

  • remove non-alpha-num char (beginning and ending) from all words

  • Expand contractions
    PatternCaseActionExamples
    ..."didn't"...Noexpand to "did not"0404-9875250
    ..."don't"...Noexpand to "do not"
    0290-10738817
    0294-10859841
    0298-11019399
    0312-11467217
    ..."who'd"...Noexpand to "who would"0311-11425141
    ..."can't"...Noexpand to "cannot"0316-11673724
    ..."won't"...Noexpand to "will not"0314-11519969
    ..."wouldn't"...Noexpand to "would not"0405-9892548

  • replace punctuation with space

  • remove words with less 3 characters

  • remove words begins with digit