Results of optimized set - 2021
I. The optimized set
As the result,
we concluded case 35.3 is the final optimized set of SD-Rules in the corpus of Lexicon 2021 to include 104 (out of 148) SD-rules to reach:
This set of SD-rules is used in Lexical Tools SD-Rule Trie because it is expected to reach the same system performance when it is applied to other English corpora under the assumption that:
II. The methodology
This approach is to find the best set of SD-rules from a set of known candidate SD-rules.
Theoretically, a complete set of SD-Rules can be obtained when more SD-rules are evaluated and added. This methodology provides a systematic approach to:
esis$|noun|ic$|adj
is evaluated. It's 3rd generation child rule genesis$|noun|genic$|adj
is selected through method. This is same as linguist's suggests:
III. The target precision and recall rate (95%)
The intersection of curves (optimization) of system precision rate and system recall rate of the final set are at 95%. We also used average values for the window size of 3, 5, 7 rules for these two curves for noise reduction (smoothing algorithm - simple moving average) and find the intersections are all around 95% for all cases (see diagram below). Smoothing this data set allows us to capture the characteristics of this set and leave out noise. Accordingly, our target minimum accuracy rate (95%) is a good choice to obtain the optimized set of SD-rules (close to optimization).
Please refer to the document of generating diagram for optimal set to generate the following diagrams.
System Precision vs. Recall Rate
System Precision vs. Recall Rate, 3-point Avg.
System Precision vs. Recall Rate, 5-point Avg.
System Precision vs. Recall Rate, 7-point Avg.