Lexical Tools

Conclusions

Here are the conclusions of this study:

I. A SD-rule:
In general, a good rule must have:

  • high accuracy rate (precision)
    so most of the matched pairs are valid SD-pairs (not exceptions)
  • high coverage rate (recall)
    so there are enough instances of matched pairs in the working corpus

II. A set of SD-rules:
A good set of rules is measured by overall system performance. The system performance is defined as the accumulated accuracy rate (precision) plus accumulated coverage rate (recall) in our model. The optimized set can be obtained by our proposed model once the optimization goal is specified. In general, a good set of rules should:

  • only have unique rules
    • duplicated rules should be removed
    • related rules, such as parent-child rules, sibling rules and uncle rules, should be decomposed and evaluated. Only the one with better system performance should be used to avoid duplication.
      • In general, a parent rule has higher recall while a child rule has higher precision
  • have more rules
    • have more unique rules to have better coverage
    • have more candidate rules to have a more comprehensive system
  • only include rules with high accuracy rate to reach specified system performance

III. Parent and Child rules:

  • Root parent rules are used for the baseline
  • Child rules of known parent rules (rules with child rules in the set) are derived by program.
    • Only child rule with better precision (than parent) and above the min. coverage rate (25% for 2015) are evaluated
    • Parent rules are not necessary better (however, most of time it is).
    • Only parent-child rules exists in the set are evaluated for this study.
    • All rules should be evaluated by their root parent rules for future study

IV. More NomD and OrgD rules:

  • It seems all rules derived from nomD and orgD are good rules. More rules from this method (with lower frequency) should be evaluated and added to the Rule set.