Lexical Tools

Example - Add SD-Rules Derived from nomD

As discussed in the nominalization derivations session, valid nomD pairs from Lexicon can be generated by computer program automatically. Most of them are valid suffixD. In 2013 release, 14,368 of suffixD are generated from 14,668 valid nomD. A set of program is developed to derive possible SD-rule from these valid SD-pairs and then add to the SD-rule set (from previous session) to increase coverage:

  • A program identifies possible SD-Rules by stripping the same starting characters of each valid SD-pair from nomD.
    For example, SD-pair of nomD, location|noun|locate|verb, generates SD-rule of ion|noun|e|verb by stripping "locat".
    shell> GetSdRule 2013
    2
    nomD
    ...
  • Select SD-rules to add to SD-rules set:
    • The results include all possible SD-rules in sdRulesFromSdPairs.rpt
    • Select high coverage (occurrence) because a good rule must have good coverage rate
    • Check if it is a root parent-rules. Ideally, we want to work on root parent-rules for a comprehensive decompose. The decomposed child-rules can be evaluated as the same procedure in baseline optimization
    • Check if the SD-Rule is related to current set:
      • duplicated: not select because can't two identical rules in a set
      • child-rule: not select because its parent-rule already in the set
      • parent-rule: not select to simply the analysis. However, these rules should be evaluated in the future releases when more resource are available to have a better set.
    • As a result, four SD-rules are selected to be evaluated in 2013 to add to SD-rule set as shown in the following table. More SD-rules should be evaluated in the future for better coverage. The bottom line is more SD-rules are evaluated, the more complete and bigger coverage of the SD-rule set are.
      shell>GetSdRule 2013
      5
      2013
      $|adj|ness$|noun
      ...

      Possible SD-rule from nomDOccurrenceRootRelatedNotes
      $|adj|ness$|noun2489YesDuplicatedDone - not selected
      e$|verb|ion$|noun1740Yesparents-rule of
      ate$|verb|ation$|noun
      se$|verb|sion$|noun
      To be evaluated next
      $|adj|ity$|noun1635YesDuplicatedDone - not selected
      ility$|noun|le$|adj1295Yesparents-rule of
      ability$|noun|able$|adj
      To be evaluated next
      ation$|noun|e$|verb1164YesDuplicatedDone - not selected
      e$|adj|ity$|noun604YesDuplicatedDone - not selected
      ce$|noun|t$|adj522Yesparents-rule of
      ance$|noun|ant$|adj
      ence$|noun|ent$|adj
      iance$|noun|iant$|adj
      To be evaluated next
      iness$|noun|y$|adj501YesNoneSelected
      $|verb|ment$|noun467YesDuplicatedDone - not selected
      $|verb|ion$|noun381YesNoneSelected
      cy$|noun|t$|adj292Yesparents-rule of
      ency$|noun|ent$|adj
      iency$|noun|ient$|adj
      To be evaluated next
      ication$|noun|y$|verb232YesDuplicatedDone - not selected
      $|verb|ation$|noun214YesDuplicatedDone - not selected
      ed$|adj|ion$|noun200YesNoneSelected
      $|verb|ing$|noun194YesNoneSelected
      e$|adj|ion$|noun103YesNoneNot selected due to Low frequency (coverage)
      ............Not selected due to low frequency (coverage)

  • Apply the same procedures to get the optimized set as in optimizing baseline set by using the optimized set of 2.3 as new baseline. This task involves:
    • Retrieve all raw SD-pairs from Lexicon (2013) of above four selected SD-rules
    • Tag raw SD-pairs
    • Get stats of SD-pairs of these four SD-rules
    • Add to SD-rules set and find the optimization
    • The total valid SD-Pair no. (TotalYes) needs to be calculated as total valid SD-pair no. from all parent-rules.

    The iterative results are shown as follows:

    IDNew Candidate RuleTotal YesTotal Rule No.Rule No.A. RateOccr.YesNoTbdSD-RuleStatusSourceNotesSys A. RateSys C. RateSys. PerfNotes
    2.3
    (prev. optimized set)
      37,136876560.66%183111720ar$|adj|e$|noun2013ORG_RULESELF95.01%94.30%1.8931Baseline
    2.3.1 13|99.81%|536|535|1|0|iness$|noun|y$|adj|2013|NOM_D|SELF 37,671 =
    37,136 + 535
    886660.66%183111720ar$|adj|e$|noun2013ORG_RULESELF95.08%94.38%1.8946Better
    2.3.2 32|97.70%|651|636|15|0|ed$|adj|ion$|noun|2013|NOM_D|SELF 38,307 =
    37,671 + 636
    896760.66%183111720ar$|adj|e$|noun2013ORG_RULESELF95.13%94.47%1.8960Better
    2.3.3.0 46|93.31%|553|516|37|0|$|verb|ion$|noun|2013|NOM_D|PARENT
    remove child-rule:
    35|95.88%|97|93|4|0|ss$|verb|ssion$|noun|2013|ORG_RULE|SELF
    38,730 =
    38,307 + 516 - 93
    896760.66%183111720ar$|adj|e$|noun2013ORG_RULESELF95.10%94.53%1.8963Better
    2.3.3.1 1|429|414|15|t$|verb|tion$|noun|96.50%|77.58%
    Decomposed from parent-rule: 46|93.31%|553|516|37|0|$|verb|ion$|noun|2013|NOM_D|PARENT
    38,730896760.66%183111720ar$|adj|e$|noun2013ORG_RULESELF95.14%94.27%1.8941Worse
    2.3.4 50|91.57%|510|467|43|0|$|verb|ing$|noun|2013|NOM_D|SELF 39,197 =
    38,730 + 467
    906860.66%183111720ar$|adj|e$|noun2013ORG_RULESELF95.05%94.60%1.8965Better

The table above shows the iterative results by adding new rules derived from nomD step by step. Please note that SD-rule ss$|verb|ssion$|noun is removed because it is a child-rule of newly added SD-rule $|verb|ion$|noun for case 2.3.3. The results show all four selected SD-rules (with the highest frequency from nomD) improve the system performance. Thus, all these four SD-rules are added to the SD-rule set to reach better coverage rate (94.60%) and system performance (1.8965) with accuracy rate of 95.05% to include 68 (out of 90) SD-rule in the optimized set. The diagram below shows the system accuracy and coverage curves of this optimized set.