Local Optimization - Evaluate PARENT rules and their CHILD rules
The SD-Rule set includes the latest SD rules that are used to generate SD pairs in the Lexicon (data - dm.data). This set include some PARENT and CHILD SD-rules that need to be evaluated and choose the one(s) with best performance (F1) as an optimized SD-Rule set (dm.rul) to be used in the Lexical Tools. This page describes the details on the evaluation and optimization procedures as follows:
I. Identify all rules for evaluation - PARENT, NEW, and previous better Rules
shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
shell> mkdir decompose.40.25
(40: min. local occurrence rate, 25: min. local cover-recall rate)
shell> ln -sf ./decompose.40.25 decompose
shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
suffix-1|pos-1|suffix-2|pos-2
: remove the rest of the line
esis$|noun|ic$|adj
in if this rule is not there. (This rule was evaluated with better F1 with CHILD from previous experience)
II. Decompose CHILD rules on identified rules
7
to retrieve all good candidate CHILD rules.
shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
7
40 (min. occurrence rate - for decompose)
25 (35) (min. coverage rate - for candidate child)
Child rule must have higher accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall, default is 25%). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:
shell>mv sdRules.decompose.out sdRules.decompose.out.${NO}.${RULE}
shell>mv sdRules.decompose.out sdRules.decompose.out.01.X-ally
III. Optimize SD rule ste by evaluating and selecting the best PARENT and CHILD rules
go through all decomposed CHILD rules from above steps (./daaR/SdRulesCheck/decompose/).
shell>mkdir ${NO}.${RULE}
shell>mkdir 01.X-ally
shell>cd 01.X-ally
shell>cp -p ../00.baseline/sdRules.stats.in sdRules.stats.in.testing
shell>ln -sf ./sdRules.stats.in.testing sdRules.stats.in
<= Candidate
from ../../SdRulesCheck/decompose/sdRule.decompose.out.1.X-ally to this file under the associated PARENT rule
#====================================================================== # Rank|Precision|Occurrence|Yes|No|Tbd|SD-Rule|YEAR|SOURCE|RELATIONSHIP #====================================================================== #25|98.99%|2086|2065|21|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT 251|99.95%|1966|1965|1|0|c$|adj|cally$|adv|2024|DECOMPOSE|CHILD #252|99.95%|1961|1960|1|0|ic$|adj|ically$|adv|2024|DECOMPOSE|CHILD
shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
1
others
01.X-ally
59911
<= total Yes from baseline (change on annaully evaluation)
-- Optimum SD-Rules: 102|73.13%|67|49|18|0|$|verb|per$|noun|2024|WORDNET|SELF|95.29%|87.08%|1.8237|52168|54747
shell> cp -p sdRules.stats.in.testing sdRules.stats.in.01.1
shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
shell> cp -p sdRules.stats.out.01.1.html ${WEB_LVG}/docs/designDoc/UDF/derivations/SD-Rules-Opti/Ex-${YEAR}/.
IV. Optimize SD rule set Results
Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.
The result of the final optimized set of SD-Rules includes 162 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 105 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 87.57% system (accumulated) recall rate with a system performance of 1.8288. The total valid instance number is 59911.
- Total line no: 229 -- Total comment no: 67 -- Total Sd-Rule no: 162 --------------------------------------- -- Optimum SD-Rules: 105|73.17%|41|30|11|0|e$|noun|ery$|noun|2013|ORG_RULE|SELF|95.31%|87.57%|1.8288|52464|55048
V. POST-Process: generate SD Rule Trie
Generate SD-Rule trie from this 105/162 optimized set for Lexical tools SD-Rule generation.
cd ./dataR
cp ./37.ity-y/sdRules.stats.out ./37.ity-y/sdRules.stats.out.opti
ln -sf ./SdRulesOptimum/37.ity-y/sdRules.stats.out.opti sdRules.stats.out
shell> cd ./bin
shell> GetSdRule ${YEAR}
8
105
(the good rules)