SPECIALIST Lexicon

Rule-Types

I. Introduction
The software assign all input words (from N-Grams) to a set of predefined rule-types. The format are: RT_CAT_PATTERN

RT: RuleType
CAT:
- LEX: in Lexicon
- CAN: candidatate
- INV: invalid multiwords
- TBD: to be determined
PATTERN:
- EM, LC, LE_PUNC, LC_LE_PUNC, ALL_PUNC, LC_ALL_PUNC, NUMBER, DIGIT
- SPVAR
- SINGLE_WORD
- LEAD_WROD
- END_ABB
- END_WORD
- LEAD_END_WORD

II. RuleTypes - Examples

The following rule types are used to filter out invalid multiwords from MEDLINE n-grams (by the same sequencial order in the program RuleType.java):

Type	Description	Examples (with element word "mellitus")
Default: valid candidate multiword
RT_TBD	Candidate multiwords	mellitus diabetes mellitus
if the whole n-gram (input term) is in Lexicon
RT_LEX_EM	Exact match	diabetes mellitus insulin-dependent diabetes mellitus
RT_LEX_LC	Match (after lowercased)	DIABETES MELLITUS Insulin-dependent diabetes mellitus
RT_LEX_LE_PUNC	match (after removing punctuation at lead or/and end words)	diabetes mellitus, (diabetes mellitus, diabetes mellitus),
RT_LEX_LC_LE_PUNC	match (after lowercased, removing punctuation at lead or/and end words)	[Diabetes mellitus DIABETES MELLITUS] [Diabetes mellitus]
RT_LEX_ALL_PUNC	match (after removing all punctuation)	diabetes mellitus -
RT_LEX_LC_ALL_PUNC	match (after lowercased, removing all punctuation)	DIABETES MELLITUS -
RT_LEX_NUMBER	number (after lowercased, removing all punctuation)	fifty , Fifty
RT_LEX_DIGIT	all digits (after lowercased, removing all punctuation)	50 , 50
if the n-gram (input term) is not a valid multiword
RT_INV_SINGLE_WORD	a single word (uni-gram)	mellitus Mellitus
RT_INV_LEAD_WORD	beginning with a nonLead word Auxiliary (3) - be, do, have, are, don't, has, etc. Complementizer (1) - that Conjunction (67) - and, or, but, as if, as well as, and/or, etc. Determiner (38) - a, all, the, some, each, which, etc. Modal (8) - can, dare, may, must, ought, shall, will, need, might, etc. Preposition (216) - about, across from, to, on, in, at, by, as far as, etc.	have diabetes mellitus that diabetes mellitus or diabetes mellitus which diabetes mellitus may diabetes mellitus of diabetes mellitus
RT_INV_END_ABB	ending word - acronym in parenthesis	mellitus (DM) mellitus (DM),
RT_INV_END_WORD	ending with a nonEnd word Auxiliary (3) - be, do, have, are, don't, has, etc. Complementizer (1) - that Conjunction (67) - and, or, but, as if, as well as, and/or, etc. Determiner (38) - a, all, the, some, each, which, etc. Modal (8) - can, dare, may, must, ought, shall, will, need, might, etc. Preposition (216) - about, across from, to, on, in, at, by, as far as, etc.	diabetes mellitus have diabetes mellitus that diabetes mellitus or diabetes mellitus which diabetes mellitus may diabetes mellitus in
RT_INV_LEAD_END_WORD	bi-gram coposed of nonLead and nonEnd Words	as if across from
If have spelling variants in n-gram
RT_CAN_SPVAR	A group of n-Gram match spelling variant pattern - considered as candidate of multiwords	noninsulin dependent diabetes mellitus non-insulin dependent diabetes mellitus non insulin-dependent diabetes mellitus non insulin dependent diabetes mellitus

The SPECIALIST Lexicon