The SPECIALIST Lexicon

Exclusive Filter: A Term Ends with a Valid End-Unit (VEU) matches Pattern of no SpVar

  • Description:
    If a term ends with a valid end-unit (VEU) and has no spelling variants co-exist in the n-gram set, it is not a valid multiword. These terms are filtered out from the MEDLINE n-gram set. The spelling variants pattern includes hypen (built in|built-in), non-space (built in|builtin), case (Built in|built in), and combination of above cases (Built-In|built in).

  • Examples:
    • analysis framework for
    • channel blockers should be
    • lotion in the treatment of
    • design was used to
    • lymph nodes up

    The valid-end-units are derived from Lexicon. Some end-units from the invalid lead-end-unit candidate list are valid-end-units and used to checked in the spVar pattern, such as "after", "for", "worth", etc.. N-grams end with any of these pattern valid end-units and does not have spelling variant co-exist in n-gram set are most likely not valid multiwords. In 2014, there are 37 valid end-units found from program. 10 of them are removed and only 27 valid end-units are used for the pattern of no spVar. Terms - "I", "W", "all", "bar", "may", "mine", "minus", "need", "one", and "other" have valid MWE in Lexicon without spVar and thus they are removed as well. Please refer to design documents of End-Unit Types for details.

  • Input Term: core-term
  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Get invalid termsFT_TBD
      • EndTermPatObj.java
        • HashSet<String> endTermSet_
        • HashSet<String> natureTermSet_
        • HashSet<String> orgTermSet_
      • collect all terms that ends with valid-end-unit
      • Save all terms with same normSpVar (hyphened or non-spaced)
        • lowercase terms if the valid-end-unit is not upper case
        • lowercase terms if terms are not all upper case
        • save to a hashMap(normSpVar, EndTermPatObj) if nature-unit ends with valid-end-unit
      • Convert hashMap(normSpVar, EndTermPatObj) to invalid term list
        • Find EndTermPatObj matches valid-end-unit pattern
        • no spVar (checks EndTermPatObj.natureTermSet_.size() <= 1)
        • not an invalid-end-unit candidates
      Check if an invalid-unit FT_END_TERM_INV_PATUse invalid-unit-list from above step

    • source code: FilterEndTermPat.java
    • FilterType: FilterType.FT_END_TERM_INV_PAT

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/03.LeadEndTerm/validEndTerms.data.pat
      • ${OUT_DATA}/03.LeadEndTerm/invalidLeadEndTermCandidates.data
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2023FT_END_TERM_INV_PATTERN10018671001838 29099.9971%
      2022FT_END_TERM_INV_PATTERN998845998816 29099.9971%
      2021FT_END_TERM_INV_PATTERN992545992516 29099.9971%
      2020FT_END_TERM_INV_PATTERN983420983391 29099.9971%
      2019FT_END_TERM_INV_PATTERN972721972692 29099.9970%
      2018FT_END_TERM_INV_PATTERN955564955535 29099.9970%
      2017FT_END_TERM_INV_PATTERN935276935247 29099.9969%
      2016FT_END_TERM_INV_PATTERN915583915554 29099.9968%
      2015FT_END_TERM_INV_PATTERN896213896190 23099.9974%
      2014FT_END_TERM_INV_PATTERN875090875068 22099.9975%