Results of optimized set - 2016
I. The optimized set
As the result,
we concluded case 14.3 is the final optimized set of SD-Rules in the corpus of Lexicon 2016 to include 82 (out of 111) SD-rules to reach:
- system accuracy rate: 95.00%
- system coverage rate: 95.26%
- system performance: 1.9026
This set of SD-rules is used in Lexical Tools SD-Rule Trie because it is expected to reach the same system performance when it is applied to other English corpora under the assumption that:
- the characteristics of derivations are consistent between from Lexicon and the working general English domain.
Lexicon is considered as a representable subset (in terms of derivations) for general English. Please refer to future work for this assumption.
II. The methodology
This approach is to find the best set of SD-rules from a set of known candidate SD-rules.
Theoretically, a complete set of SD-Rules can be obtained when more SD-rules are evaluated and added. This methodology provides a systematic approach to:
- measure system performance
- to evaluate new SD-rules
- obtain the set of SD-rules according to user's specified target minimum accuracy rate (system performance)
- choose among parent-child SD-Rules to reach Max. system precision and recall rate.
- In general, a parent rule has higher recall while a child rule has higher precision
- This method is used to get the local optimization to provide a good way to choose between a parent rule and child rule(s). The results shows the computation method is concoin with linguist's knowledge. For example, new rules
esis$|noun|ic$|adj
is evaluated. It's 3rd generation child rule genesis$|noun|genic$|adj
is selected through method. This is same as linguist's suggests:
- nouns ending in -esis do have related adjectives, but those almost invariably end in -etic, not -ic (e.g.osmesis|noun|E0575508|osmic|adj|E0044296|no)
- Of the handful of non-genesis nouns in this list, most ended in -emesis, which relates to -emetic, not -ic (e.g. pyemesis|noun|E0051349|pyemic|adj|E0051351|no)
- only 1(!) pair that got a yes-tag (ataphoresis|noun|E0430522|cataphoric|adj|E0015539|yes)
- Should use genesis|genic, because Dorland's does provide justification for pairing -genesis Ns with -genic adj's (p.763):
- genesis: a word termination used to denote the production, formation, or development of the object or state indicated by the word stem to which it is affixes, as biogenesis, gametogenesis and pathogenesis.
- genic: a word termination meaning producing, or productive of
III. The target precision and recall rate (95%)
The intersection of curves (optimization) of system precision rate and system recall rate of the final set are at 95%. We also used average values for the window size of 3, 5, 7 rules for these two curves for noise reduction (smoothing algorithm - simple moving average) and find the intersections are all around 95% for all cases (see diagram below). Smoothing this data set allows us to capture the characteristics of this set and leave out noise. Accordingly, our target minimum accuracy rate (95%) is a good choice to obtain the optimized set of SD-rules (close to optimization).
Procedure to generate diagram for optimal set