Text Categorization

stWsd Tool Introduction

I. Background

STI and STRI are used for WSD in NLP projects and is considered s one of the best unsupervised methods in term of precision. On the other hand, it requires lots of programming work for users:

  • First, users need to write their own tokenization program to retrieve the sentences. There are three types of input sentences in the NLM WSD test collection:
    • Target sentence: the sentence in which ambiguous word appears
    • Entire citation: entire title and abstract of the article
    • Ambiguous sentences: all sentences from title and abstract contain ambiguous words or its morphological variants.
  • Second, it is an extra work for users to find the morphological variants of the ambiguous word in order to retrieve ambiguous sentences.
  • Third, users need to force the ambiguous word to be the legal word in the STI or STRI to avoid empty results
  • Fourth, users need to decide which program (STI or STRI) and which score (DC or WC) to use and is best for their application.
  • Five, users need to process the results from TC package and Java APIs to find the best sense when integrating STI/STRI into their WSD applications

In order to make the WSD easier for users, Lexical System Group plan to add a new tool, stWsd, into the TC package (after 2009). stWsd (Word Sense Disambiguation, based on Semantic Type) will takes care all above top issues for users and return the best sense. The only requirement for users is to understand stWsd Java APIs and use it.

II. Design & Algorithm

III. Results