STMT

STMT Tutorial

STMT is used to find all sub-term related functions for NLP projects. This page describes the functionality by going through an example. Please refer to design documents for how the algorithm works.

Corpus
A subterm is a subset of a term that is known to the corpus. The corpus is a collection of known terms. For example, if we want to find all subterm in the Lexicon. The corpus will be all terms in the Lexicon (LexItem). In addition, if we want to know the EUI (Entry Unique Identifier) of the subterm, the corpus could include term to EUI mapping (term|EUI). Accordingly, STMT provides two ways to specify the corpus in the configuration file:
- CORPUS_FILE: term only
- SYNONYM_FILE: term|mapping synonyms (such as EUI)
The CORPUS_FILE is ignored if the SYNONYM_FILE is specified. The key (1st field) of the SYNONYM_FILE will be used as terms in corpus.
Normalization
In NLP practice, normalization is used to aggressively match terms to increase recall rate. For example, case and punctuation are often abstract away because they don't contribute too much in meaning. STMT applies Lexical Tools APIs for the normalization and provides three most used normalizations:
In addition, user is able to create their own normalization in the Java StmtApi class public abstract Vector Norm(String inStr);

Example 1:
Let say we use LexItemNorm to ignore case and punctuation and have a simple (synonym) corpus as follows:

Norm key	Synonym
dog	canine
dog	puppy
canine	K9
cat	feline
feline	kitty
dog and cat	pets

The following examples illustrate basic functions of subterms:

Input: Dog and cat g and

Functions	Results
In Corpus	true
The Longest Prefix	dog and cat
Prefixes	dog dog and cat
Subterms	cat dog dog and cat
Subterm Synonym Substitutions	canine and cat canine and feline dog and feline pets puppy and cat puppy and feline

Please note that prefix related functions require one-to-one normalization, such as LexItemNorm to work properly.

Example 2: Subterm Synonym Substitutions

The subterm synonym substitution is the most complicated operation in STMT. It includes five steps as described below (using above example).

Step	Results
normTerm	dog and cat
subterms	subterm[0]: dog\|0\|1 subterm[1]: dog and cat\|0\|3 subterm[2]: cat\|2\|3
subterm patterns	Pattern[0] dog\|0\|1\|true and\|1\|2\|false cat\|2\|3\|false Pattern[1] dog\|0\|1\|false and\|1\|2\|false cat\|2\|3\|true Pattern[2] dog\|0\|1\|true and\|1\|2\|false cat\|2\|3\|true Pattern[3] dog and cat\|0\|3\|true
synonym patterns	Pattern[0] dog\|0\|1\|canine\|puppy and\|1\|2 cat\|2\|3 Pattern[1] dog\|0\|1 and\|1\|2 cat\|2\|3\|feline Pattern[2] dog\|0\|1\|canine\|puppy and\|1\|2 cat\|2\|3\|feline Pattern[3] dog and cat\|0\|3\|pets
synonym substitution permutations	canine and cat (1) canine and feline (2) dog and feline (1) pets (1) puppy and cat (1) puppy and feline (2)

Sub-Term Mapping Tools