The SPECIALIST Lexicon

Exclusive Filter: A Term Leads with Absolute invalid Lead-Unit (AILU)

  • Description:
    If a term leads with absolute invalid lead-unit (AILU), it is not a valid multiword. These terms are filtered out from the MEDLINE n-gram set.

  • Examples:
    • away from that
    • as to support
    • but a simple

    The absolute invalid lead-unit (AILU) are derived from Lexicon. Some lead-units from the invalid lead-end-unit candidate list are absolute invalid lead-units, such as "about", "across", "across from", etc.. N-grams start with any of these absolute invalid lead-units are not valid multiwords. In 2014, there are 381 abosulute invalid lead-units derived from coputer program. "the" is moved manually from valid-lead-units to invalid-lead-units becasue it was an error in Lexicon. The final file used for this model contains 382 abosulute invalid lead-units. Please refer to design documents of Lead-Terms Types and Lead-End-Terms Model for details.

    In 2020+, invalid Unicode lead terms are added:

    Typeterms
    Math Symbols+ - = ⁺ ⁻ × ÷ ⁼ ⅀ ∀ ∑ − ≅ ≠ ≤ ≥ ±
    Fractions¼ ½ ¾ ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞
    Signs® © °c °f ™

  • Input Term: core-term
  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Norm: strip punctuation except for '/.-FT_TBD
      • Optional
      • (ABB) of: => abb of
      • Norm case: go through all absolute invalid lead-units (AILUs)
        • Case-1.1: if AILTs is not upper case and inTerm is uppercase
        • Case-1.2: if AILTs is not upper case and inTerm is mixed case and lead-word is not upper case
          => lowercase, use inTerm.lc
        • Case-2.1: if AILTs is upper case
        • Case-2.2: if AILTs is not upper case and inTerm is lowercase
        • Case-2.1: if AILTs is not upper case and inTerm is miexed case and lead-word is upper case
          => use inTerm (no change in case)
      CaseAILTinTerminTerm converted
      LowerCase
      1.1hisHIS PROBLEMhis problem
      1.2hisHis problemhis problem
      Keep case
      2.1W/OW/O problemsW/O problems
      2.2hishis problemhis problem
      2.3norNOR miceNOR mice
      • Check if inTerm is abs-inv-lead-units (AILU)
      FT_LEAD_TERM_ILT_ITSELF
      • Expcetions: AILTs are valid terms
      • Check if inTerm leads with AILT + " "
      FT_LEAD_TERM_INV_ABS
      • his problem => invalid

    • source code: FilterLeadTermAbs.java
    • FilterType: FilterType.FT_LEAD_TERM_INV_ABS

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/03.LeadEndTerm/invalidLeadTerms.data.abs
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2023FT_LEAD_TERM_INV_ABS10018671001823 44 46199.9956%
      2022FT_LEAD_TERM_INV_ABS998845998801 44 46099.9956%
      2021FT_LEAD_TERM_INV_ABS992545992501 44 46099.9956%
      2020FT_LEAD_TERM_INV_ABS983420983377 43 44699.9956%
      2019FT_LEAD_TERM_INV_ABS972721972665 56 44299.9942%
      2018FT_LEAD_TERM_INV_ABS955564955508 56 44299.9941%
      2017FT_LEAD_TERM_INV_ABS935276935222 54 42299.9942%
      2016FT_LEAD_TERM_INV_ABS915583915531 52 43199.9943%
      2015FT_LEAD_TERM_INV_ABS896213896167 46 42799.9949%
      2014FT_LEAD_TERM_INV_ABS875090875044 46 42799.9947%

Please note two types of valid words are filtered out by mistake:
  • Init case type
    His bundle: is a valid word, a collection of heart muscle fibers were names after Swiss cardiologist Wilhelm His Jr.. who discovered themin 1893.
  • Upper case type:
    US EPA: United States Environmental Protection Agency

However, these types of valid words are very few. Also, these two trapped words are not multiwords and have been removed from Lexicon after 2015. In other word, "the" should belong to absolute invalid lead-unit list.

  • the Netherlands
  • the Staatliche Frauenklinik und Hebammenschule