The SPECIALIST Lexicon

Exclusive Filter

I. Introduction

Exclusive filters are also called invalid filters. They are designed to filter out invalid nGrams. The filtered terms are close to 100% invalid terms. Wtih this characteristics, apply this filter on a nGram set should increase precision without drop recall.

Exclusive filters are used to retrieve multiwords from n-grams by filter out invalid multiwords. The following rule-based filters are developed and tested. The objective of these filters is to exclude invalid terms from n-gram. We focus on the accuracy of these filters. In other words, ideally, no (or verylow percentage) valid multiwords should be filtered out by these filters. On the other hand, invalid multiwords could pass these filter and considered as multiwords candidates. With a series of exlcusive filter, we hope to reach a desired precision on the MWE candidate.

II. Accuracy Test on Exclusive Fitlers

Terms that are filtered out (trapped, removed) by exclusive fitlers are invalid multiwords. These filtered invalid multiwords must has high accuracy. We can test Lexicon (all valid multiwords) on these exclusive filters. The passing rate should be close to 100%.

  • Filter passing rate = pass-through terms / total terms
  • Filter efficiency = filtered out terms / total terms
  • Accuracy
    = (TP + TN) / (TP + TN + FP + FN)
    = (Retrieved, relevant) + (Not Retrieved, not relevant) / (Retrieved, relevant) + (Not Retrieved, not relevant) + (Retrieved, not relevant) + (Not retrieved, relevant)
    = (Retrieved, relevant) / (Retrieved, relevant) + (Not retrieved, relevant)
    • When apply Leixcon to the filter, all inputs are valid MWEs (relevant).
    • TN (Not Retrieved, not relevant) is 0
    • FP (Retrieved, not relevant) is 0
    • TP (Retrieved, relevant) is pass No.
    • FN (Not retrieved, relevant) is trap No.
    • Filter accuracy = TP/(TP + FN) = pass terms/total terms = passing rate

In 2014, there are 875,090 unqiue inflectional variants in Lexicon. We tested these exclusive filters on the Lexicon (all valid words) to see how much valid multiwords are filtered out by these filters. The format of this report are:

FilterWord NoPass NoTrap NoExp NoPass-Rate

where:

  • Filter: filter type (name)
  • Sample No: total sample terms
  • Pass No: valid terms pass the filter
  • Trap No: invalid terms trapped by the filter
  • Exp No: terms that match filter pattern, but it is valid (exceptions)
  • Pass-Rate: Pass No / Sample No

III. Accuracy Test Details

  • Dir: ${MULTIWORDS}/bin
  • Program:
    shell> 04.TestFilters ${YEAR}
    1
    10
    
    or 
    
    2
    11-42