Lexical Tools

Summary and Conclusion

Here are the conclusions of this study:

I. A SD-rule:
In general, a good rule must have:

  • high accuracy rate (precision)
    so most of the matched pairs are valid SD-pairs (not exceptions)
  • high coverage rate (recall)
    so there are enough instances of matched pairs in the working corpus

II. A set of SD-rules:
A good set of rules is measured by overall system performance (system precision + system recall). The system performance is defined as the accumulated accuracy rate (precision) plus accumulated coverage rate (recall) in our model. All candidate SD-pairs generated by SD-candidates rules and Lexicon are used as the gold standard. The optimized set can be obtained by our proposed model once the optimization goal is specified. In general, a good set of rules should:

  • only have unique rules
    • duplicated rules should be removed
    • related rules, such as parent-child rules, sibling rules and uncle rules, should be decomposed and evaluated. Only the one with better system performance should be used to avoid duplication.
      • If system performance is the same between parent-rule and it's associated child-rule, linguist's knowledge is used to choose the better rule.
      • If linguist's knowledge can't be applied, parent-rule is chosen. It is because parent-rule has better coverage.
      • In general, a parent rule has higher recall while a child rule has higher precision
  • have more rules
    • have more unique rules to have better coverage
    • have more candidate rules to have a more comprehensive system
  • only include rules with high accuracy rate to reach specified system performance

III. Parent and Child rules:

  • Only root-parent rules are used for related rules as the baseline
  • Child rules of known parent rules (rules with child rules in the set) are derived by computer programs.
    • Child-rule must meet the following criteria for evaluation:
      • better precision (than parent)
      • coverage rate = child-occurance/parent-occurance
      • coverage rate >= 40%: further decompose for next generatin child
      • coverage rate >= 25%: candidate child-rule
    • Parent rules are not necessary better (however, most of the time, it is).
    • Only parent-child rules exists in the set are evaluated for this study.
    • Suggestion: all rules should be evaluated by their root parent rules for future study (extremely expensive)

IV. More NomD and OrgD rules:

  • It seems most rules derived from nomD and orgD are good rules. More rules from this method (with lower frequency) should be evaluated and added to the Rule set for the future release if resource is available.

    ReleasenomDorgDESNotes
    2014N/AN/AN/AFirst try, all rules are new
    20156/62/23/5Evaluate more rules from nomD and orgD
    20163/44/51/2Evaluate more rules from nomD and orgD
    20174/42/40/2Evaluate more rules from nomD and orgD
    20205/52/40/2Evaluate more rules from nomD and orgD
    20218/111/43/3Evaluate more rules from nomD and orgD