Lead-Unit Types
The results from above steps (
invalid-laead-end-units candidates - 444) are categorized into following lead-unit types:
- Absolute Invalid Lead-Unit (382)
Terms from invalid lead-end-unit candidates that are not lead units in Lexicon. They are stored in the file - invalidLeadUnits.data.abs.
They are used in exclusive filter - absolute invalid lead-unit to filter out any n-gram that starts with these absolute invalid lead-units.
- Valid Lead-Unit Pattern - Without Spelling Variants (52)
Units from valid lead-end-unit candidates that lead units in Leixcon. That are stored in the file - validLeadUnits.data.pat.
They are used in exclusive filter - valid lead-unit pattern, without spelling variants to filter out any n-gram that starts with these lead-units without spelling variant patterns of:
- non-spaced:
under floor|underfloor
in plane|inplane
- hyphened:
under floor|under-floor
below the knee|below-the-knee
in vitro grown|in vitro-grown|in-vitro grown|in-vitro-grown
- capitalized:
In some cases, capitalization could be fit into spVar pattern, such as:
a stage resin|A stage resin|A-stage resin
may apple|May apple|Mayapple|mayapple
However, capital is not used in normalization to exclude more invalid MWEs because the spVar must include nonr-space and hyphen.
However, capitalized is not considered as spVar pattern to exclude more invalid MWEs because the spVars must include non-spaced and hyphened pattern. Nevertheless, all captialized units are counted for it's own spVar in the program, such as UNDER FLOOR|UNDER-FLOOR.
In other words, if a n-gram starts with these valid-lead-unit and have no spelling variants (with space, hyphen, or capital) co-exist in n-gram set, it is invalid.
Please note that:
Lead-Unit | Actions
|
---|
| Two LexRecords was found in 2014. Both of them ("the Netherlands", "the Staatliche Frauenklinik und Hebammenschule") are erros and deleted. "The" should be added to Absolute invalid type.
|
- Lead-Unit TBD - not used (10)
LexRecords lead with these units do not have spVar. They are removed from valid-lead-unit-pattern. They need further observation.