Text Categorization

Word Extraction Filter

The input text needs to be filtered out irrelevant words and extract only relevant words for JDI, STI, and STRI. This word extraction filter is designed to perform such features. This filter is specially used on the MEDLINE citation. Users should develop their own filter to filter out irrelevant words and characters from the input text before apply JDI or STI for their applications.

MEDLINE titles or abstracts include some irrelevant text and should be filter out to increase the accuracy of indexing. For examples, punctuation and stopwords should be deleted. Also, there are contractions are need to be expanded. Please see algorithm below for details. This MEDLINE filter provides a method for this process.

  • Description:

    This Java API is to filter out irrelevant characters and words from Medline titles and abstracts before JD indexing processing. It also expand contractions.

  • API Usage:
    • Word(Contractions contractions)
    • Word(String contractionFile)
    • Word(String contractionFile, boolean verbose)
    • GetFilteredStr(String inStr)

  • Inputs:
    • contractions.txt
      ContractionExpansion 1Expansion 2

  • Algorithm:
    • remove "(ABSTRACT TRUNCATED AT 250 WORDS)" if it is at the end
    • remove "(ABSTRACT TRUNCATED AT 400 WORDS)" if it is at the end
    • remove "(ABSTRACT TRUNCATED)" if it is at the end
    • remove "[published erratum ...]" if it is at the end
    • remove "[see comments]" if it is at the end
    • remove "[In Process Citation]" if it is at the end

    • if foreign title, [title], remove "(published erratum ...)" if it is at the end
    • if foreign title, [title], remove "(see comments)" if it is at the end
    • if foreign title, [title], remove "(In Process Citation)" if it is at the end

    • remove word if it does not have alpha character

    • decompose contractions (contractions.txt)

    • Decompose words contains punctuation (replace punctuation with space)

    • remove non-alphanumerical characters at the beginning of words
    • remove non-alphanumerical characters at the end of words

    • Remove words does not contain alpha characters
    • Remove words begins with digit character
    • Remove words with length less than 3