Lexical Tools

Results of optimized set - 2021

I. The optimized set
As the result, we concluded case 35.3 is the final optimized set of SD-Rules in the corpus of Lexicon 2021 to include 104 (out of 148) SD-rules to reach:

  • system precision rate: 95.12%
  • system recall rate: 93.45%
  • system performance F1: 1.8857

This set of SD-rules is used in Lexical Tools SD-Rule Trie because it is expected to reach the same system performance when it is applied to other English corpora under the assumption that:

  • the characteristics of derivations are consistent between from Lexicon and the working general English domain.
    Lexicon is considered as a representable subset (in terms of derivations) for general English. Please refer to future work for this assumption.

II. The methodology
This approach is to find the best set of SD-rules from a set of known candidate SD-rules. Theoretically, a complete set of SD-Rules can be obtained when more SD-rules are evaluated and added. This methodology provides a systematic approach to:

  • measure system performance
  • to evaluate new SD-rules
  • obtain the set of SD-rules according to user's specified target minimum accuracy rate (system performance)
  • choose among parent-child SD-Rules to reach Max. system precision and recall rate.
    • In general, a parent rule has higher recall while a child rule has higher precision
    • This method is used to get the local optimization to provide a good way to choose between a parent rule and child rule(s). The results shows the computation method is concoin with linguist's knowledge.
    • Example:
      new rules esis$|noun|ic$|adj is evaluated. It's 3rd generation child rule genesis$|noun|genic$|adj is selected through method. This is same as linguist's suggests:
      • nouns ending in -esis do have related adjectives, but those almost invariably end in -etic, not -ic (e.g.osmesis|noun|E0575508|osmic|adj|E0044296|no)
      • Of the handful of non-genesis nouns in this list, most ended in -emesis, which relates to -emetic, not -ic (e.g. pyemesis|noun|E0051349|pyemic|adj|E0051351|no)
      • only 1(!) pair that got a yes-tag (ataphoresis|noun|E0430522|cataphoric|adj|E0015539|yes)

      • Should use genesis|genic, because Dorland's does provide justification for pairing -genesis Ns with -genic adj's (p.763):
        • genesis: a word termination used to denote the production, formation, or development of the object or state indicated by the word stem to which it is affixes, as biogenesis, gametogenesis and pathogenesis.
        • genic: a word termination meaning producing, or productive of

III. The target precision and recall rate (95%)

The intersection of curves (optimization) of system precision rate and system recall rate of the final set are at 95%. We also used average values for the window size of 3, 5, 7 rules for these two curves for noise reduction (smoothing algorithm - simple moving average) and find the intersections are all around 95% for all cases (see diagram below). Smoothing this data set allows us to capture the characteristics of this set and leave out noise. Accordingly, our target minimum accuracy rate (95%) is a good choice to obtain the optimized set of SD-rules (close to optimization).
Please refer to the document of generating diagram for optimal set to generate the following diagrams.

System Precision vs. Recall Rate
System precision vs. recall rate

System Precision vs. Recall Rate, 3-point Avg.
System precision vs. recall rate, 3-point avg.

System Precision vs. Recall Rate, 5-point Avg.
System precision vs. recall rate, 5-point avg.

System Precision vs. Recall Rate, 7-point Avg.
System precision vs. recall rate, 7-point avg.