Lexical Tools

SD-Rules Optimization - Analysis

I. A SD-rule:
In general, a good rule must have:

  • high accuracy rate
    so most of the matched pairs are valid SD-pairs (not exceptions)
  • high coverage rate
    so there are enough instances of matched pairs in the working corpus

II. A set of SD-rules:
A good set of rules is measured by overall system performance. The system performance is defined as the accumulated accuracy rate plus accumulated coverage rate in our model. The optimized set can be obtained by our proposed model once the optimization goal is specified. In general, a good set of rules should:

  • only have unique rules
    • duplicated rules should be removed
    • related rules, such as parent-child rules, sibling rules and uncle rules, should be decomposed and evaluated. Only the one with better system performance should be used to avoid duplication.
  • have more rules
    • have more unique rules to have better coverage
    • have more candidate rules to have a more comprehensive system
  • only include rules with high accuracy rate to reach specified system performance

III. A rule & a set of rules:

We observed the following results in our practice on 2013 release:

  • System Accuracy Rate
    If a rule has better accuracy rate than the accuracy rate of the cutoff rule of an existing set, the curve of system accuracy rate will shift toward to the upper-right when this rule is added to the system, and vice versa. Thus, it would cover more rules with higher coverage rate for the same specified accuracy rate when adding such rule in the set. Accordingly, such rules should be added to the set to reach better system performance. The diagram below shows the curves of accuracy rate between the orgRules, add nomD, add factD, and add suggested rule.

  • System Coverage Rate
    The curve of system coverage rate will shift toward to the lower-right and end at the same place (100%) when new rules are added to the system. Thus, the intersection of curves of SA and SC will shift to the right. In other words, the system optimized point (intersection) will cover more rules. The diagram below shows the curves of coverage rate between the orgRules, add nomD, add factD, and add suggested rule.

  • System Optimization
    The system optimization point is at the intersection of the two curves. Diagram shows the intersection point shifts toward upper right from the orgRules, add nomD, add factD to have better system performance and include more rules in the set.