The SPECIALIST Lexicon

Exclusive Filter: A Term Ends with Absolute Invalid End-Unit (AIEU)

  • Description:
    If a term ends with absolute invalid end-unit (AIEU), it is not a valid multiword. These terms are filtered out from the MEDLINE n-gram set.

  • Examples:
    • away from that
    • the source, and
    • the tumors, but

    The absolute invalid end-units (AIEU) are derived from Lexicon. Some end-units from the invalid lead-end-unit candidate list are absolute invalid end-units, such as "about", "across", "across from", etc.. N-grams end with any of these absolute invalid end-units are not valid multiwords. In 2014, there are 407 abosulute invalid end-units. Please refer to design documents of Lead-End-Unit Model for details.

  • Input Term: core-term
  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Norm: strip punctuation except for '/.-FT_TBD
      • Optional
      • in addition, the => in addition the
      • tissues (such => tissues such
      • Go through all abs-inv-end-units (AIEUs)
        • Case-1: if AIETs is not upper case and inTerm is uppercase
          => lowercase, use inTerm.lc
        • Case-2.1: if AIETs is upper case
        • Case-2.2: if AIETs is not upper case and inTerm is lowercase
        • Case-2.3: if AIETs is not upper case and inTerm is mixed case
          => use inTerm (no change in case)
      CaseAIETinTerminTerm converted
      LowerCase
      1theFROM THEfrom the
      Keep case
      2.1W/Oissue W/Oissue W/O
      2.2thefrom thefrom the
      2.3.1theFrom theFrom the
      2.3.2norin the USin the US
      2.3.3asHex AsHex As
      • Check if inTerm is abs-inv-end-units (AIEU)
      FT_END_TERM_IET_ITSELF
      • Expcetions: AIETs are valid terms
      • Check if ends with " " + AIET
      FT_END_TERM_INV_ABS
      • 3D US => invalid

    • source code: FilterEndTermAbs.java
    • FilterType: FilterType.FT_END_TERM_INV_ABS

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/03.LeadEndTerm/invalidEndTerms.data.abs
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2023FT_END_TERM_INV_ABS10018671001861 6 47199.9994%
      2022FT_END_TERM_INV_ABS998845998839 6 47199.9994%
      2021FT_END_TERM_INV_ABS992545992539 6 47199.9994%
      2020FT_END_TERM_INV_ABS983420983414 6 45699.9994%
      2019FT_END_TERM_INV_ABS972721972715 6 44299.9994%
      2018FT_END_TERM_INV_ABS955564955558 6 44299.9994%
      2017FT_END_TERM_INV_ABS935276935273 3 44299.9997%
      2016FT_END_TERM_INV_ABS915583915580 3 43899.9997%
      2015FT_END_TERM_INV_ABS896213896210 3 43599.9997%
      2014FT_END_TERM_INV_ABS875090875087 3 43699.9997%

      Please note three valid words are filtered out by mistake:
      • 3-D US: three dimensional ultrasound
      • 3D US: three dimensional ultrasound
      • PD US: power Doppler ultrasound