The SPECIALIST Lexicon

Exclusive Filter Rules: Derive invalid lead units and invalid end units From Lexicon

I. Introduction

By definition, any known multiwords (in Lexicon) should not begin with an invalid lead unit. The only exception is a child word of the invalid word is the lead unit of the multiwords. For example, "above" is a lead word because there are 13 multiwords found in Lexicon that beginning with "above":

  • above cited
  • above the knee amputations
  • above knee amputations
  • above knee amputee
  • above board
  • above knee
  • above reported
  • above listed
  • above named
  • above mentioned
  • above the knee amputation
  • above knee amputation
  • above knee amputees
On the other hand, "above" is a invalid end unit because the only multiwords ends with "above" is it's parent words, "over and above", which is also an invalid end unit.

II. Source of invalid lead-end-units

  • Get the invalid lead units and invalid end units from the latest Lexicon
    • Get all multiwords from Lexicon
    • Get candidate invalid units from aux, compl, conj, det, modal, pron, prep.
      CategoryExamplesLexicon.2014Lexicon.2015
      auxiliarybe, do, etc.3 (30)3 (30)
      complementizerthat1 (1)1 (1)
      conjunctionand, or, but, etc.71 (71)71 (71)
      determinera, the, some, etc.38 (38)38 (38)
      modalmay, must, can, etc.8 (27)8 (27)
      pronounit, he, they, etc.87 (87)87 (87)
      prepositionto, on, by, etc.233 (233)233 (233)
    • Get child words of candidate lead-end words
    • Get invalid lead units (no multiwords that are leading by it)
    • Get invalid end units (no multiwords that are ending by it)
  • Applied to exclusive filters.

III. Algorithm

  • Get candidates of invalid lead-end-units by combining aux, compl, conj, det, modal, pron, and prep.
  • Find valid/invalid lead-end-units by excluding lead-end-unit candidates that have multiwords in Lexicon:
    • Check each LeadEndUnit candidate against Lexicon multiwords to calcualte stats as follows (and save in LeadEndUnitCandidateStatsObj):
      • if LeadEndUnit candidate matches (is) a multiword in Lexicon
        => matchNo++
      • else
        • if a multiword leads with LeadEndUnit candidate, but not leads with it's child units
          => leadNo++
        • if a multiword ends with LeadEndUnit candidate, but not ends with it's child units
          => endNo++
    • LeadEndUnitCandidateStatsObj:
      • leadNo > 0, it is a valid lead-unit
      • leadNo = 0, it is a nvalid lead-unit
      • endNo > 0, it is a valid end-unit
      • endNo = 0, it is a invalid end-unit
      • matchNo must be 0 or 1, it does not matter on leadUnit or endUnit
        • 0: if lead-end-unit candidate is not a multiword in Lexicon (it is a single word)
        • 1: if lead-end-unit candidate is a multiword in Lexicon
    • Results: invalid Lead-Units and invalid End-Units, please notes that these are units (not words). Such as "as far as", "across from", etc.

  • Examples:
    LeadEndUnit CandidateMatches No.LeadUnit No.LT ExamplesLeadUnitEndUnit No.ET ExamplesEndUnit
    across00 Invalid0 Invalid
    across from10
       
    Invalid0 Invalid
    around00 Invalid1
    • non-wrap around
    Valid
    as01
    • as yet unidentified
    Valid0 Invalid
    as far as10 Invalid0 Invalid
    as if12
    • as if personality
    • as if personalities
    Valid0 Invalid
    down035
    • down flow
    • down flows
    • down fracture
    • down fractures
    • down gaze
    • down gradient
    • down growth
    • down growths
    • down modulator
    • down modulators
    • down payment
    • down payments
    • down regulate
    • down regulated
    • down regulates
    • down regulating
    • down regulation
    • down regulations
    • down regulatory
    • down river
    • down side
    • down sides
    • down slope
    • down sloped
    • down slopes
    • down sloping
    • down stage
    • down staged
    • down stages
    • down staging
    • down stroke
    • down strokes
    • down time
    • down times
    • down town
    Valid12
    • bearing down
    • broken down
    • cone down
    • face down
    • knock down
    • let down
    • run down
    • step down
    • take down
    • touch down
    • up and down
    • upside down
    Valid
    above013
    • above cited
    • above the knee amputations
    • above knee amputations
    • above knee amputee
    • above board
    • above knee
    • above reported
    • above listed
    • above named
    • above mentioned
    • above the knee amputation
    • above knee amputation
    • above knee amputees
    Valid0 Invalid
    on05
    • on again, off again
    • on and off
    • on duty
    • on call
    • on hand
    Valid3
    • end on
    • head on
    • side on
    Valid
    on board11
    • on board imaging
    Valid0 Invalid
    out012
    • out growths
    • out numbered
    • out performs
    • out numbers
    • out group
    • out number
    • out numbering
    • out performing
    • out perform
    • out groups
    • out growth
    • out performed
    Valid43
  • read out
  • acting out
  • knock out
  • first in/first out
  • ...
  • Valid
    out of110
    • out of position
    • out of reach
    • out of hospital cardiac arrests
    • out of office
    • out of hospital cardiac arrest
    • out of date
    • out of phase
    • out of hospital
    • out of kilter
    • out of doors
    Valid0 Invalid
    up07
    • up front
    • up trend
    • up trends
    • up state
    • up and down
    • up gaze
    • up states
    Valid29
  • beatings up
  • top up
  • step up
  • geared up
  • ...
  • Valid
    up to11
    • up to date
    Valid0 Invalid

IV. Processes/Programs

  • directory: ${MULTIWORDS_DIR}/bin
  • program: 3.InvalidLeadEndTerm
  • Run program: shell> ./3.InvalidLeadEndTerm ${YEAR}
  • Processes:

    StepDescriptionIONotes - Examples
    1Get all words and multiwords from Lexicon
    • GetWordsFromLex.java
    • Retrieves all words and multiwords from inflVars
    Inputs:
    • ./inData/inflVars.data

    Outputs:

    • ./outData/3.InvalidLeadEndTerm/lexMultiwords.data
    • ./outData/3.InvalidLeadEndTerm/lexWords.data
    • 30 sec.
    • link inflVars to the release file
    2Get invalid Lead-End-Unit candidates from Lexicon
    • GetInvalidLeadEndTermCandidates.java
    • compose of units from aux, compl, conj, det, modal, pron, and prep
    Inputs:
    • ./inData/inflVars.data

    Outputs:

    • ./inData/allCats.data

    • ./inData/aux.data
    • ./inData/compl.data
    • ./inData/conj.data
    • ./inData/det.data
    • ./inData/modal.data
    • ./inData/pron.data
    • ./inData/prep.data

    • ./outData/3.InvalidLeadEndTerm/invalidLeadEndTermCandidates.data
    • 5 sec.
    • link aux, compl, conj, det, modal, pron, and prep from release
    • invalid Lead-End-Unit candidates are composed of aux, compl, conj, det, modal, pron, and prep
    • invalid Lead-End-unit candidates are used to get invalid lead-units and invalid end-units in the next steps
    3Get child lead-units and end-units of invalid Lead-End-unit candidates
    • GetChildLeadEndTermsFromAFile.java
    • Go through all invalid lead-end-unit candidates and find child lead-units and child end-units
    Inputs:
    • ./outData/3.InvalidLeadEndTerm/invalidLeadEndTermCandidates.data

    Outputs:

    • ./outData/3.InvalidLeadEndTerm/invalidLeadTermsCandidatesChild.data
    • ./outData/3.InvalidLeadEndTerm/invalidEndTermsCandidatesChild.data
    • 5 sec.
    • A child lead-unit of a invalid lead-end-unit is a member of invalid lead-end-units and has the leading units of another invalid Lead-Ena-units
    • For example, "up to" is a child lead-unit of "up" and a cwchild end-unitof "to"
    • This data is used in the step 4 to get real invalid-Lead-unit and invalid-end-units
    4Get invalid LeadTerms and invalid EndTerms from LeadEndTerm candidates
    • GetAbsoluteInvalidLeadTermsEndTerms.java
    • Go through all LeadEndTerm Candidates and get
      • invalid LeadTerms
      • invalid EndTerms
      • valid leadTerms
      • valid endTerms

        by stats with details and child lists

    Inputs:
    • ./outData/3.InvalidLeadEndTerm/invalidLeadEndTermCandidates.data
    • ./outData/3.InvalidLeadEndTerm/lexMultiwords.data
    • ./outData/3.InvalidLeadEndTerm/invalidLeadTermsCandidatesChild.data
    • ./outData/3.InvalidLeadEndTerm/invalidEndTermsCandidatesChild.data

    Outputs:

    • ./outData/3.InvalidLeadEndTerm/leadEndTermCandidates.Stats.rpt
      => stats of lead-end-units of matching, leads, ends no. for a multiwords
    • ./outData/3.InvalidLeadEndTerm/leadTerms.detail.rpt
      => multiwords leads by a LeadEndTerm candidate
    • ./outData/3.InvalidLeadEndTerm/endTerms.detail.rpt
      => multiwords ends by a LeadEndTerm candidate
    • ./outData/3.InvalidLeadEndTerm/leadTerms.child.rpt
      => multiwords leads with child lead-end-units
    • ./outData/3.InvalidLeadEndTerm/endTerms.child.rpt
      => multiwords ends with child lead-end-units

    • ./outData/3.InvalidLeadEndTerm/validLeadTerms.data
    • ./outData/3.InvalidLeadEndTerm/validEndTerms.data
    • ./outData/3.InvalidLeadEndTerm/invalidLeadTerms.data
    • ./outData/3.InvalidLeadEndTerm/invalidEndTerms.data
    • 1 min.
    • leadEndTermCandidates.Stats.rpt is the total stats
    • leadTerms.detail.rpt and endTerms.detail.rpt are exmaples of leadTerms and endTerms
    • leadTerms.child.rpt and endTerms.child.rpt are exmaples of leadTerms and endTerms leads and ends with child units. This should not be collected in parent's unit.