The SPECIALIST Lexicon

Exclusive Filter: A Term contains pattern of incomplete

  • Description:
    If a term contains pattern of incomplete, it is an invalid MWE. Incomplete patterns are terms that does not have even number of left and right parenthesis or square brackets or they are not closed. This filter also takes into consideration of core-term so that the input is a nGram. Patterns included are shown as the table below:

    DescriptionIncomplete Pattern
    uneven number of parenthesis
    • xx (yy
    • xx ((yy) zz
    • xx) yy
    • (xx) yy) zz
    uneven number of brackets
    • xx [yy
    • xx [[yy] zz
    • xx] yy
    • [xx] yy] zz
    unclosed parenthesis
    • xx) yy (zz
    unclosed brackets
    • xx] yy [zz

  • Examples:
    • II (Hunter syndrome
    • 0.05) higher
    • bond]C-C[triple
    • (chi(2)

    This filter take core-term into consideration, such as XXX) will be convert to XXX and then check if it matches the incomplete pattern.

  • Input Term: original term
  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Assign default filter typeFT_INCOMPLET 
      Check if a term is not incomplete (complete)FT_TBD
      • check if number of left and right of parenthesis and square brackets are even
      • check if parenthesis and square brackets are closed correctly
      Check if the core-term is not incomplete (complete). For example, "test" is the core-term of "(test", which is complete, thus "(test" pass this filter because core-term are used as LMW candidates at the end of these processes (this feature is updated in 2016). FT_TBD
      • check if number of left and right of parenthesis and square brackets are even
      • check if parenthesis and square brackets are closed correctly

    • source code: FilterIncomplete.java
    • FilterType: FilterType.FT_INCOMPLETE

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2023FT_INCOMPLETE100186710018670 0100.0000%
      2022FT_INCOMPLETE9988459988450 0100.0000%
      2021FT_INCOMPLETE9925459925450 0100.0000%
      2020FT_INCOMPLETE9834209834200 0100.0000%
      2019FT_INCOMPLETE9727219727210 0100.0000%
      2018FT_INCOMPLETE9555649555640 0100.0000%
      2017FT_INCOMPLETE9352769352760 0100.0000%
      2016FT_INCOMPLETE9155839155830 0100.0000%
      2015FT_INCOMPLETE896213896212 1 099.9999%
      2014FT_INCOMPLETE875090875089 1 099.9999%

      Please note the only trapped term from Lexicon, "antipoly ADP-ribose) polymerase", is an error in Lexicon.2014. It should be "antipoly (ADP-ribose) polymerase" and expected to be fixed in later version. In other words, the passing-rate is 100%.