Text Categorization

Input Filter

The Text Categorization tools (JDI, STI, STRI) take free text as input, tokenizes text into words (separated by space or tab), and calculate the JD or ST scores. The interested text, such as abstracts and titles, may contain punctuation and irrelevant characters and words that require an input filter to filter them out before applying TC tools. This preprocess should be performed differently according to the application due to the nature difference on the interested text.

TC tools provide a basic filter, word extraction filter, to perform this input filtering process. This filter is designed to filter out irrelevant characters from abstracts and titles of articles in MEDLINE. This filter is on by default in command-line tools. It can be turn off if users apply their won filter before applying TC tools.

JDI:
- Use "|" as token to separate text and MeSH
- Use TextInputFilter( ) to process text
  - Apply Word Extraction filter
  - Apply Acronym filter (TBD)
  - Filter out non-legal words to get legal words
  - Filter out duplicated words to get unique words
- Use MeshInputFilter( ) to process MeSH
  - Tokenize MeSHs using "|" as separator
  - Check if MeSH is an MeSH abbreviation
  - Check if MeSH is in the database tables
STI:
- Use TextInputFilter( ) to process text
  - Apply Word Extraction filter
  - Apply Acronym filter (TBD)
  - Filter out non-legal words to get legal words
  - Filter out duplicated words to get unique words
STRI:
- Use TextInputFilter( ) to process text
  - Apply Word Extraction filter
  - Apply Acronym filter (TBD)
  - Filter out non-legal words to get legal words
  - Filter out duplicated words to get unique words
- Use MeshInputFilter( ) to process MeSH
  - Tokenize MeSHs using "|" as separator
  - Check if MeSH is an MeSH abbreviation
  - Check if MeSH is in the database tables