Text Categorization

Input Filter

The Text Categorization tools (JDI, STI, STRI) take free text as input, tokenizes text into words (separated by space or tab), and calculate the JD or ST scores. The interested text, such as abstracts and titles, may contain punctuation and irrelevant characters and words that require an input filter to filter them out before applying TC tools. This preprocess should be performed differently according to the application due to the nature difference on the interested text.

TC tools provide a basic filter, word extraction filter, to perform this input filtering process. This filter is designed to filter out irrelevant characters from abstracts and titles of articles in MEDLINE. This filter is on by default in command-line tools. It can be turn off if users apply their won filter before applying TC tools.

  • JDI:
    • Use "|" as token to separate text and MeSH
    • Use TextInputFilter( ) to process text
    • Use MeshInputFilter( ) to process MeSH
      • Tokenize MeSHs using "|" as separator
      • Check if MeSH is an MeSH abbreviation
      • Check if MeSH is in the database tables

  • STI:
    • Use TextInputFilter( ) to process text

  • STRI:
    • Use TextInputFilter( ) to process text
    • Use MeshInputFilter( ) to process MeSH
      • Tokenize MeSHs using "|" as separator
      • Check if MeSH is an MeSH abbreviation
      • Check if MeSH is in the database tables