Text Categorization

Command Line Tools

  • mlt:
    Mlt tool is designed to tokenize fields from MEDLINE citations. Title, Abstract, and MH (starred MHs and SHs only) fields, and combinations of these are routinely tokenized and extracted from a MEDLINE citation. Other fields may be specified for tokenization aw well (more).

  • jdi:
    Jdi tool uses statistical associations between words and JDs, between MHs and JDs, and between SHs and JDs, from a training set of MEDLINE citations. The word-Jd scores, Mh-Jd scores, and Sh-Jd scores are pre-calculated and loaded into a database. Jdi takes the inputs, which may be text phrases, MeSH terms, or a combination. Filters are applied to text input, such as word extraction algorithms, stopwords, minimum word length, etc. Then, JDI calculates the average score for all inputs, and sends the ranked JDs with their scores to the output.

    Jdi is the core methodology of TC tools. It is used in Sti and Stri program. It is used to categorize text, index contents, retrieve records, and Word Sense Disambiguation (more).

  • sti:
    Sti tool uses the Jdi methodology to calculate word-St scores. The calculation of word-St scores uses words in the MEDLINE training set and ST (semantic type) documents; an stDocument is a set of one-word UMLS Metathesaurus strings belonging to an ST. The word-St scores for a word have been calculated by comparing the JDI of the word and the JDI of each ST document. The pre-calculated word-St scores are loaded into a database. Sti takes the inputs, which are text phrases, and applies filters such as word extraction algorithms, stopwords, minimum word length, etc. Then, Sti calculates the average word-St scores for all inputs, and sends the ranked STs with their scores to the output (more).

  • stri:
    Stri tool uses the Jdi methodology as its basis. It uses ST (semantic type) documents; an ST document is a set of one-word UMLS Metathesaurus strings belonging to an ST. Stri takes the inputs, which may be text phrases or MeSH terms. Filters are applied to text input, such as word extraction algorithms, stopwords, minimum word length, etc. Then, Stri ranks the STs for an input according to similarity of JDI of the input (result of running Jdi tool on the input in real time) compared to pre-calculated JDI of each ST document, and sends the ranked STs with their scores to the output (more).

  • stWsd:
    StWsd tool uses STI as its basis. If the senses of an ambiguous word are expressed by STs, STI is used to perform on the context surrounding the word (phrase, sentence, paragraph, etc.) in the expectation that in the ST indexing of context, the correct STs for the word will rank/score higher than other candidate STs for the word (more).