The SPECIALIST Lexicon

Lead-Unit Types

The results from above steps ( invalid-laead-end-units candidates - 444) are categorized into following lead-unit types:

  • Absolute Invalid Lead-Unit (382)
    Terms from invalid lead-end-unit candidates that are not lead units in Lexicon. They are stored in the file - invalidLeadUnits.data.abs. They are used in exclusive filter - absolute invalid lead-unit to filter out any n-gram that starts with these absolute invalid lead-units.

  • Valid Lead-Unit Pattern - Without Spelling Variants (52)
    Units from valid lead-end-unit candidates that lead units in Leixcon. That are stored in the file - validLeadUnits.data.pat. They are used in exclusive filter - valid lead-unit pattern, without spelling variants to filter out any n-gram that starts with these lead-units without spelling variant patterns of:
    • non-spaced:
      under floor|underfloor
      in plane|inplane
    • hyphened:
      under floor|under-floor
      below the knee|below-the-knee
      in vitro grown|in vitro-grown|in-vitro grown|in-vitro-grown

    • capitalized:
      In some cases, capitalization could be fit into spVar pattern, such as:
      a stage resin|A stage resin|A-stage resin
      may apple|May apple|Mayapple|mayapple
      However, capital is not used in normalization to exclude more invalid MWEs because the spVar must include nonr-space and hyphen.
      However, capitalized is not considered as spVar pattern to exclude more invalid MWEs because the spVars must include non-spaced and hyphened pattern. Nevertheless, all captialized units are counted for it's own spVar in the program, such as UNDER FLOOR|UNDER-FLOOR.

    In other words, if a n-gram starts with these valid-lead-unit and have no spelling variants (with space, hyphen, or capital) co-exist in n-gram set, it is invalid.

    Please note that:

    Lead-UnitActions
    • the
    Two LexRecords was found in 2014. Both of them ("the Netherlands", "the Staatliche Frauenklinik und Hebammenschule") are erros and deleted. "The" should be added to Absolute invalid type.

  • Lead-Unit TBD - not used (10)
    LexRecords lead with these units do not have spVar. They are removed from valid-lead-unit-pattern. They need further observation.
    • ex
    • inside
    • last
    • most
    • only
    • per
    • round
    • sensu
    • v.
    • w

      The following units is in valid-lead-unit list for spVar pattern. However, it might need more observation:

    • may
    • mine
    • minus