The SPECIALIST Lexicon

Exclusive Filter: A Term with min. Document Count and Word Count

  • Description:
    If a term has DC or WC less than the specified minimum DC or WC, it is not a valid (or commonly-used) multiword. These terms are filtered out from the MEDLINE n-gram set. For examples, the following terms are invalid multiwords:
    • n2oniliforme
    • m2per
    • protocol, recruitment
    • embryos (61.9%)

    The MEDLINE n-gram set is used to retrieve the DC and WC. It uses 30 as the minimum WC and 1 and the min. DC. There are lots of multiwords in Lexcion are not in the n-gram set due to:

    • They are spelling variants - meaning their spVars exist in n-gram set, they don't have enough WC (30).
    • n-gram might in different forms, such as case and punctuation, thus they don't have enough WC (30).
    • Lexicon records some multiwords which has small occurance (WC).
    Thus, we think this is not an right (frequency) filter to used in generating multiwords. Instead, we should use normalized word count (NWC) along with DC|WC for a better result from the frequncy test. Please note that normalized document count (NDC) can't be calculated.

  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      get DC|WC from n-gramFT_TBD
      if not in the MEDLINE n-gram setFT_WC_DC_NOT_FOUND
      • Exceptions: valid terms not found DC|WC in n-gram
      Check if (dc < minDc) or (wc < minWc)FT_WC_DC_INV_LES
      • filtered invalid terms with DC|WC less than minimum

    • source code: FilterDcWc.java
    • FilterType: FilterType.FT_WC_DC_INV_LESS

  • Accuracy Test on Lexicon (DC >= 1; WC >= 30):
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/02.NGram/nGrams/n-gram.${YEAR}
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2023FT_WC_DC_INV_LESS100186710018670 621582100.0000%
      2022FT_WC_DC_INV_LESS9988459988450 623591100.0000%
      2021FT_WC_DC_INV_LESS9925459925450 626830100.0000%
      2020FT_WC_DC_INV_LESS9834209834200 629088100.0000%
      2019FT_WC_DC_INV_LESS9727219727210 630890100.0000%
      2018FT_WC_DC_INV_LESS9555649555640 625175100.0000%
      2017FT_WC_DC_INV_LESS9352769352760 618346100.0000%
      2016FT_WC_DC_INV_LESS9155839155830 618966100.0000%
      2015FT_WC_DC_INV_LESS8962138962130 612316100.0000%
      2014FT_WC_DC_INV_LESS8750908750900 603592100.0000%