The SPECIALIST Lexicon

Exclusive Filter: A Term Leads with a Valid Lead-Unit matches Pattern of no SpVar (VLU)

  • Description:
    If a term leads with a valid lead-unit (VLU) and has no spelling variants co-exist in the n-gram set, it is not a valid multiword. These terms are filtered out from the MEDLINE n-gram set. The spelling variants pattern includes hypen (under floor|under-floor), non-space (under floor|underfloor), case (a stage resin|A stage resin), and combination of above cases (a stage resin|A-stage resin).

  • Examples:
    • within type I
    • after a surgery
    • for a policy

    The valid-lead-units are derived from Lexicon. Some lead-units from the invalid lead-end-unit candidate list are valid-lead-units and used to checked in the spVar pattern, such as "W", "many", "out of", etc.. N-grams start with any of these pattern valid lead-units and does not have spelling variant co-exist in n-gram set are most likely not valid multiwords. In 2014, there are 63 valid-lead-units found from program. 11 of them are removed and only 52 valid-lead-units are used for the pattern of no spVar. There are two wrong lexRecords associated with "the" and thus "the" is moved to absolute invalid lead-unit. Terms - "ex", "insdie", "last", "most", "only", "per", "round", "sersu", "v.", and "w" have valid MWE in Lexicon without spVar and thus they are removed as well. Please refer to design documents of Lead-Unit Types for details.

  • Input Term: core-term
  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Get invalid termsFT_TBD
      • LeadTermPatObj.java
        • HashSet<String> leadTermSet_
        • HashSet<String> natureTermSet_
        • HashSet<String> orgTermSet_
      • collect all terms that starts with valid-lead-unit
      • Save all terms with same normSpVar (hyphened or non-spaced)
        • lowercase terms if the valid-lead-unit is not upper case
        • get nature-unit (strip lead-end-punc, for example: - in details, => in details)
        • save to a hashMap(normSpVar, LeadTermPatObj) if nature-unit starts with valid-lead-unit
      • Convert hashMap(normSpVar, LeadTermPatObj) to invalid term list
        • Find LeadTermPatObj matches valid-lead-unit pattern
        • no spVar exist (checks LeadTermPatObj.natureTermSet_.size() <= 1)
        • not an invalid-lead-unit candidates
      Check if an invalid-unit FT_LEAD_TERM_INV_PATUse invalid-unit-list from above step

    • source code: FilterLeadTermPat.java
    • FilterType: FilterType.FT_LEAD_TERM_INV_PAT

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/03.LeadEndTerm/validLeadTerms.data.pat
      • ${OUT_DATA}/03.LeadEndTerm/invalidLeadEndTermCandidates.data
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2023FT_LEAD_TERM_INV_PAT10018671001762 105099.9895%
      2022FT_LEAD_TERM_INV_PAT998845998740 105099.9895%
      2021FT_LEAD_TERM_INV_PAT992545992443 102099.9897%
      2020FT_LEAD_TERM_INV_PAT983420983324 96099.9902%
      2019FT_LEAD_TERM_INV_PAT972721972633 88099.9910%
      2018FT_LEAD_TERM_INV_PAT955564955476 88099.9908%
      2017FT_LEAD_TERM_INV_PAT935276935192 84099.9910%
      2016FT_LEAD_TERM_INV_PAT915583915503 80099.9913%
      2015FT_LEAD_TERM_INV_PAT896213896120 93099.9896%
      2014FT_LEAD_TERM_INV_PAT875090874991 99099.9887%

  • Example Walk Through (invalid terms):

    OperationsContents
    Inputs (21):
    • Test

    • in particular
    • in-particular
    • inparticular
    • In particular
    • IN PARTICULAR
    • -in particular
    • in particular),
    • - in particular,

    • in conclusion
    • In conclusion
    • IN CONCLUSION
    • in conclusion,
    • -in conclusion,

    • in to
    • internal
    • all in all
    • after pressure
    • on-board imager
    • out of kilter
    • one gene-one enzyme hypothesis
    1. Form HashMap
    =>Please note:
    "Test" is not in HashMap because does not match valid-lead-unit
    Some terms has multiple matches on lead-unit, such as on-board-image

    2. Check:
    2.1 Match Lead-Unit
    2.2 Has SpVar
    2.3 Is ILET

    3. Invalid Term:

    • Match lead-unit (2.1: true)
    • Has no SpVar (2.2: false)
    • Not a ILET (2.3: false)
    HashMap2.1 Match Lead-Unit2.2 Has SpVar2.3 Is ILETValid?
    key (normSpVar)Value (LeadTermPatObj)
    internal
    • lead-units: in
    • nature-units: internal
    • org-units: internal
    falsen/an/avalid
    allinall
    • lead-units: all, a
    • nature-units: all in all
    • org-units: all in all
    truefalsefalseinvalid
    innonspace
    • lead-units: in
    • nature-units: innonspace, in nonspace
    • org-units: innonspace, in nonspace
    truetruen/avalid
    IN PARTICULARINPARTICULAR
    • lead-units: I
    • nature-units: IN PARTICULAR
    • org-units: IN PARTICULAR
    falsen/an/avalid
    inhyphen
    • lead-units: in
    • nature-units: in hyphen, in-hyphen
    • org-units: in hyphen, in-hyphen
    truetruen/avalid
    INCONCLUSION
    • lead-units: I
    • nature-units: IN CONCLUSION
    • org-units: IN CONCLUSION
    falsen/an/avalid
    onegeneoneenzymehypothesis
    • lead-units: one, on
    • nature-units: one gene-one enzyme hypothesis
    • org-units: one gene-one enzyme hypothesis
    truefalsefalseinvalid
    afterpressure
    • lead-units: a, after
    • nature-units: after pressure
    • org-units: after pressure
    truefalsefalseinvalid
    into
    • lead-units: in
    • nature-units: in to
    • org-units: in to
    truefalsetruevalid
    outofkilter
    • lead-units: out of, out
    • nature-units: out of kilter
    • org-units: out of kilter
    truefalsefalseinvalid
    inconclusion
    • lead-units: in
    • nature-units: in conclusion
    • org-units: in conclusion, IN CONCLUSION, In conclusion
    truefalsefalseinvalid
    Inparticular
    • lead-units: I
    • nature-units: In particular
    • org-units: In particular
    falsen/an/avalid
    Inconclusion
    • lead-units: I
    • nature-units: In conclusion
    • org-units: In conclusion
    falsen/an/avalid
    onboardimager
    • lead-units: on-board, on
    • nature-units: on-board imager
    • org-units: on-board imager
    truefalsefalseinvalid
    inparticular
    • lead-units: in
    • nature-units: in particular, in-particular, inparticular
    • org-units: In particular, in particular, in-particular, inparticular, IN PARTICULAR
    truetruen/avalid
    inother
    • lead-units: in
    • nature-units: in-other, inother
    • org-units: in-other, inother
    falsen/an/avalid
    Invalid-Units (10):

    • in conclusion
    • In conclusion
    • IN CONCLUSION
    • in conclusion,
    • -in conclusion,

    • after pressure
    • all in all
    • on-board imager
    • one gene-one enzyme hypothesis
    • out of kilter