Rule-Types
I. Introduction
The software assign all input words (from N-Grams) to a set of predefined rule-types. The format are: RT_CAT_PATTERN
II. RuleTypes - Examples
The following rule types are used to filter out invalid multiwords from MEDLINE n-grams (by the same sequencial order in the program RuleType.java
):
Type | Description | Examples (with element word "mellitus") | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Default: valid candidate multiword | ||||||||||||||||||||
RT_TBD | Candidate multiwords |
| ||||||||||||||||||
if the whole n-gram (input term) is in Lexicon | ||||||||||||||||||||
RT_LEX_EM | Exact match |
| ||||||||||||||||||
RT_LEX_LC | Match (after lowercased) |
| ||||||||||||||||||
RT_LEX_LE_PUNC | match (after removing punctuation at lead or/and end words) |
| ||||||||||||||||||
RT_LEX_LC_LE_PUNC | match (after lowercased, removing punctuation at lead or/and end words) |
| ||||||||||||||||||
RT_LEX_ALL_PUNC | match (after removing all punctuation) |
| ||||||||||||||||||
RT_LEX_LC_ALL_PUNC | match (after lowercased, removing all punctuation) |
| ||||||||||||||||||
RT_LEX_NUMBER | number (after lowercased, removing all punctuation) |
| ||||||||||||||||||
RT_LEX_DIGIT | all digits (after lowercased, removing all punctuation) |
| ||||||||||||||||||
if the n-gram (input term) is not a valid multiword | ||||||||||||||||||||
RT_INV_SINGLE_WORD | a single word (uni-gram) |
|