Lexical Tools

Local Optimization - Evaluate PARENT rules and their CHILD rules

The SD-Rule set includes the latest SD rules that are used to generate SD pairs in the Lexicon (data - dm.data). This set include some PARENT and CHILD SD-rules that need to be evaluated and choose the one(s) with best performance (F1) as an optimized SD-Rule set (dm.rul) to be used in the Lexical Tools. This page describes the details on the evaluation and optimization procedures as follows:

Identify all rules for evaluation
Decompose SD rules - CHILD rules
Optimization on SD rules
Results - Optimized SD Set
Generate SD rule Trie
References

I. Identify all rules for evaluation - PARENT, NEW, and previous better Rules

DIR: ${SUFFIXD_DIR}
Inputs:
- Prepare directory:
  shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
  shell> mkdir decompose.40.25 (40: min. local occurrence rate, 25: min. local cover-recall rate)
  shell> ln -sf ./decompose.40.25 decompose
- Get all tagged SD-pairs as the evaluation source data:
  ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdPairs.data
  shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
  shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
  shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
- Identify PARENT rules for decomposition and evaluataion:
  ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data
  - go through PARENT rules for decomposing and evalauting (CHILD rules):
    - all lines in ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/${YEAR}/sdRules.data.${YEAR}.relation.children.only.rpt .
    - copy this file to sdRules.data.${YEAR}.relation.children.only.rpt.${YEAR}.working
    - remove line if it is a child and parent rules at the same time (|CHILD => at the the left side of relationship =>), because these are duplicates of their parents' rules
    - manaully change format to suffix-1|pos-1|suffix-2|pos-2: remove the rest of the line
    - add esis$|noun|ic$|adj in if this rule is not there. (This rule was evaluated with better F1 with CHILD from previous experience)
    - As a result, total 19 candidate PARENT SD-Rules to be evaluated. These rules were evaluated previously (may not need to re-evaluated).
    - add new rules for evaluation
  - run the Program (as described below) through all parent Sd-Rules one by one by comment out (#) the rest
  - The reuslts between years might be slightly different. But, the principle is the same.
- New SD-Rules and their child rules is optional to be evaluated if time allowed
- add rules from previous elavatuion that has same or better result.
- Add all new rules to ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data

II. Decompose CHILD rules on identified rules

We could use the PARENT rule from the previous year (./dataR/SdRuleCheck/sdRule.data) as the baseline (plus updateing new added PARENT rules). This file can be obtained from above steps (data.${YEAR}.relation.children.only.rpt.${YEAR}.working)
manually comment out all rules except for the one to be evaluated in sdRule.data, and evalaute 1 RUle at 1 time.
Run GetSdRule ${YEAR} option 7 to retrieve all good candidate CHILD rules.
- Program to decompose PARENT rule and retrieve CHILD rules:
  shell> cd ${SUFFIXD_DIR}/bin
  shell> GetSdRule ${YEAR}
  7
  40 (min. occurrence rate - for decompose)
  => Need to have enough coverage for further decomposition on child rules
  25 (35) (min. coverage rate - for candidate child)
  => Need to have enough coverage to be a qualified child rule
- Outputs:
  - sdRules.decompose.out
    Child rule must have higher accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall, default is 25%). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:
    - the accuracy rate (precision) is higher than parent-rule
    - the coverage rate (recall) is higher than 25% (or the specified number)
  - shell>mv sdRules.decompose.out sdRules.decompose.out.${NO}.${RULE}
  - such as shell>mv sdRules.decompose.out sdRules.decompose.out.01.X-ally
    => use X to represent no suffix ($)
go through all (19) PARENT and new Rules in this step.
add 16 new rules for evaluation
add 2 rules from previous elavatuion that has same or better result.
Continue to the next step to evaluate parent-child rules (by replacing parent rule with Child rule) and repeat the whole process for all parent rules.
Updates optimal log while going through this process

III. Optimize SD rule ste by evaluating and selecting the best PARENT and CHILD rules
go through all decomposed CHILD rules from above steps (./daaR/SdRulesCheck/decompose/).

DIR: ${SUFFIXD_DIR}/data/${year}/dataR/SdRulesOptimum/
- Create a new directory
  shell>mkdir ${NO}.${RULE}
  shell>mkdir 01.X-ally
Inputs:
- Update the sdRules.stats.in by replace 1st parent rules with candidate child rules
  shell>cd 01.X-ally
  shell>cp -p ../00.baseline/sdRules.stats.in sdRules.stats.in.testing
  shell>ln -sf ./sdRules.stats.in.testing sdRules.stats.in
  Mannually updates this file:
  - Copy all computer marked candidate CHILD rules (<= Candidate from ../../SdRulesCheck/decompose/sdRule.decompose.out.1.X-ally to this file under the associated PARENT rule
  - Update the following field in sdRules.stats.in:
    - located the rank (25) of associated PARENT rule from the baseline with precision (98.99%)
    - 1st field - rank: change to 251 (Original rank - 25 + child level - 1)
    - update 2nd-10th fields from the decomposed file (sdRule.decompose.out.01.X-ally)
    - Change 11-13th fields to ${YEAR}|DECOMPOSE|CHILD
    - RUn Evaluation program on the CHILD rules with F1 is within 5% from the PARENT rules or has same or better reuslt in the past evaluation.
    - Comment out (#) those PARENT/CHILD rules are not in the current evaluation
    - The new edited RULE in the file looks like:
```
#======================================================================
# Rank|Precision|Occurrence|Yes|No|Tbd|SD-Rule|YEAR|SOURCE|RELATIONSHIP
#======================================================================
#25|98.99%|2086|2065|21|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT
251|99.95%|1966|1965|1|0|c$|adj|cally$|adv|2024|DECOMPOSE|CHILD
#252|99.95%|1961|1960|1|0|ic$|adj|ically$|adv|2024|DECOMPOSE|CHILD
```
Program - Get the optimal Set:

shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
1
others
01.X-ally
59911 <= total Yes from baseline (change on annaully evaluation)

Outputs directory:

${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/01.X-ally

-- Optimum SD-Rules: 102|73.13%|67|49|18|0|$|verb|per$|noun|2024|WORDNET|SELF|95.29%|87.08%|1.8237|52168|54747

Move and copy files
- shell> cp -p sdRules.stats.in.testing sdRules.stats.in.01.1
- shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
- shell> cp -p sdRules.stats.out.01.1.html ${WEB_LVG}/docs/designDoc/UDF/derivations/SD-Rules-Opti/Ex-${YEAR}/.
Updates optimal-log file
- Mark notes as Best|Better|Same|Worse by comparing F1, precision (Accuracy) and recall (Coverage) rate among the root PARENT rnd CHILD Rules
- choose PARENT rule over CHILD rule if F1, precision, and recall are the same.
- The result of evaluating PARENT-CHILD rules on worse is consistent in the past. So,
  - no need to evaluate those DECOMPOSED CHILD rules wwith 5% less in F! and worse result in the past.
  - only evaluate those with same/Better F1 Sd-Rules in the past evaluation and F1 is within 5% difference.
- Repeat this process for all generations of candidate child rules of the same parent rule.
- Repeat this process for all parent rules (using the best sdRules.stats.in)
- Go to result of optimization log for optimizing details.

IV. Optimize SD rule set Results

Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.

The result of the final optimized set of SD-Rules includes 162 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 105 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 87.57% system (accumulated) recall rate with a system performance of 1.8288. The total valid instance number is 59911.

- Total line no: 229
-- Total comment no: 67
-- Total Sd-Rule no: 162
---------------------------------------
-- Optimum SD-Rules: 105|73.17%|41|30|11|0|e$|noun|ery$|noun|2013|ORG_RULE|SELF|95.31%|87.57%|1.8288|52464|55048

V. POST-Process: generate SD Rule Trie

Generate SD-Rule trie from this 105/162 optimized set for Lexical tools SD-Rule generation.

cd ./dataR
cp ./37.ity-y/sdRules.stats.out ./37.ity-y/sdRules.stats.out.opti
ln -sf ./SdRulesOptimum/37.ity-y/sdRules.stats.out.opti sdRules.stats.out
shell> cd ./bin
shell> GetSdRule ${YEAR}
8
105 (the good rules)
output: ./dataR/dm.rul.2024.105 (the Tire file for Lvg)

VI. References

Generating SD-Rules in the SPECISLIST Lexical Tools - Optimization for Suffix Derivation Rule Set