Lexical Tools

SD-Rules Optimization Goal

In order to reach the best system performance (accuracy rate and coverage rate), a systematic approach need to be developed to:

  • Evaluate the existing SD-Rule set
  • Filter out bad rules and keep good rules
  • Evaluate a new SD-rule for better system performance.

Frist, the collected SD-Rules set need to be refined and optimized to reach best accuracy and coverage rate in the following cases:

  • Duplicated Rules:
    An optimum set should not have duplicated rules. All rules should be unique. All rules should be normalized (sorted alphabetically) and duplicated rules should be removed. For example, able$|adj|ability$|noun is normalized to ability$|noun|able$|adj
  • Accuracy and coverage rates:
    SD-Rule with low accuracy rate and coverage rate need to be removed. For example, a$|noun|an$|noun has only 1 valid SD-Pair out of 273 raw SD-Pairs from Lexicon. The accuracy rate is 0.37% with only 1 valid instance (coverage). Such SD-Rule is expected to generate more invalid SD-pairs than valid one if it is applied to general English (outside the Lexicon) and thus should be removed from the optimized set.
  • Parent and child SD-Rules:
    SD-Rules with Parent-Child relationship should be refined. For example, a$|noun|an$|noun is the parent-rule of ia$|noun|ian$|noun. In other words, a parent-rule covers all SD-Pairs generated by its child-rules. There are two ways of refinement:
    • Include the parent-rule (exclude all associated child-rules):
      If the parent-rule is included in the SD-Rules set, there is no need to include any child-rule because all SD-Pairs derived from child-rule are included from the parent-Rule. Accordingly, all child rules are considered as duplicated and should be removed.
    • Exclude the parent-rule (keep some child-rules):
      On the other hand, if the parent-rule is excluded, all associated child-rules of this parent-rule should be evaluated. For example, sis$|noun|tic$|adj has two child-rules osis$|noun|otic$|adj and esis$|noun|etic$|adj form the original SD-Rules set. Both child rules need to be evaluated by comparing the system performance to the parent rules. Child-rules should be included only if they are good rules (have good accuracy and coverage rate, see next session for details on the evaluation procedures).

Optimization Goal

The goal is to find a good set of rules from known SD-rules (by removing bad rules) that have the best system performance to reach following criteria:

  • Min. system accuracy rate >= 95%
  • Max. system coverage rate
  • Includes more SD-Rules
  • Better system performance (= system accuracy rate + system coverage rate)

System performance

If we arrange all SD-Rules by the following order:

  • Descending accuracy rate (= valid SD-pairs no./raw SD-pairs no) for the associate SD-Rule
  • Descending raw SD-pairs instance count (occurrence)
  • Alphabetical order of SD-rule

, where the system accuracy and coverage rate can be defined as:

  • System accuracy rate = accumulated valid SD-pairs count/accumulated raw SD-pairs count
    That is the accumulated accuracy rate from the SD-Rules with highest accuracy and coverage rate
  • System coverage rate = accumulated valid SD-pairs count/total valid SD-pairs count of the whole set
    That is the coverage percentage of selected rules to all rules
    Please notes that parent valid SD-pairs count should be used for total valid SD-pairs count when evaluated teh child-rule.