The SPECIALIST Lexicon

Exclusive Filter: Lead-End-Units Model

I. Invalid lead-end-units

Multiwords don't start or end with seven closed class POSes, such as auxiliaries (be, do, etc.), complementizer (that), conjunctions (and, or, but, etc.), determiners (a, the, some, etc.), modals (may, must, can, etc.), pronoun (it, he, they, etc.), and prepopistions (to, on, by, etc.). These units are called invalid lead units or end units. They are used in exclusive filter to exclude invalid multiwords from the n-Grams.

Invalid lead units could be units (multiwople words), such as "as if|conj", "as far as|prep", "across from|prep", etc. "as|prep|conj" is the lead unit in "as if" and "as far as". "as" is the parent unit of "as if" or "as far as". The behavior of child units should considered as exception of it's parent unit and not take into consideration when decide if a parent unit is a invalid lead unit. For example, valid units "as if personality" is in Lexicon. Thus:

  • "as if" is a valid lead unit
  • "as" is still an invalid lead unit (in this case)

II. Candidate lead-end-units

On the other hand, a unit ends with certain units are likely a valid multiwords, such as index, test, assay, protien, factor, disease, syndrome, procedure, etc..). These are called candidate end units. Candidate lead units and end units are used in inclusive filter after exclusive filters apply on n-Grams.