Lexical Tools

Summary and Conclusion

Here are the conclusions of this study:

I. A SD-rule:
In general, a good rule must have:

high accuracy rate (precision)
so most of the matched pairs are valid SD-pairs (not exceptions)
high coverage rate (recall)
so there are enough instances of matched pairs in the working corpus

II. A set of SD-rules:
A good set of rules is measured by overall system performance (system precision + system recall). The system performance is defined as the accumulated accuracy rate (precision) plus accumulated coverage rate (recall) in our model. All candidate SD-pairs generated by SD-candidates rules and Lexicon are used as the gold standard. The optimized set can be obtained by our proposed model once the optimization goal is specified. In general, a good set of rules should:

only have unique rules
- duplicated rules should be removed
- related rules, such as parent-child rules, sibling rules and uncle rules, should be decomposed and evaluated. Only the one with better system performance should be used to avoid duplication.
  - If system performance is the same between parent-rule and it's associated child-rule, linguist's knowledge is used to choose the better rule.
  - If linguist's knowledge can't be applied, parent-rule is chosen. It is because parent-rule has better coverage.
  - In general, a parent rule has higher recall while a child rule has higher precision
have more rules
- have more unique rules to have better coverage
- have more candidate rules to have a more comprehensive system
only include rules with high accuracy rate to reach specified system performance

III. Parent and Child rules:

Only root-parent rules are used for related rules as the baseline
Child rules of known parent rules (rules with child rules in the set) are derived by computer programs.
- Child-rule must meet the following criteria for evaluation:
  - better precision (than parent)
  - coverage rate = child-occurance/parent-occurance
  - coverage rate >= 40%: further decompose for next generatin child
  - coverage rate >= 25%: candidate child-rule
- Parent rules are not necessary better (however, most of the time, it is).
- Only parent-child rules exists in the set are evaluated for this study.
- Suggestion: all rules should be evaluated by their root parent rules for future study (extremely expensive)

IV. More NomD and OrgD rules:

It seems most rules derived from nomD and orgD are good rules. More rules from this method (with lower frequency) should be evaluated and added to the Rule set for the future release if resource is available.

Release	nomD	orgD	ES	Notes
2014	N/A	N/A	N/A	First try, all rules are new
2015	6/6	2/2	3/5	Evaluate more rules from nomD and orgD
2016	3/4	4/5	1/2	Evaluate more rules from nomD and orgD
2017	4/4	2/4	0/2	Evaluate more rules from nomD and orgD
2020	5/5	2/4	0/2	Evaluate more rules from nomD and orgD
2021	8/11	1/4	3/3	Evaluate more rules from nomD and orgD