Local Optimization - Evaluate Parent rules and their Child rules
I. Find all candidate child rules for parent rules
shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
shell> mkdir decompose.40.25
(40: min. local occurrence rate, 25: min. local cover-recall rate)
shell> ln -sf ./decompose.40.25 decompose
shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
suffix-1|pos-1|suffix-2|pos-2
: remove the rest of the line
7
as below to get all good candidate child-rule.
1
ysis$|noun|yze$|verb
to reaches the best optimum SD-Rule set.
shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
7
40 (min. occurrence rate - for decompose)
25 (35) (min. coverage rate - for candidate child)
Child rule must have higher accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall, default is 25%). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:
shell>mv sdRules.decompose.out sdRules.decompose.out.${NO}.${RULE}
shell>mv sdRules.decompose.out sdRules.decompose.out.1.X-ally
II. Replace parent rules by selected candidate child SD-Rules for optimized set
shell>mkdir ${NO}.${RULE}
shell>mkdir 01.X-ally
shell>cd 01.X-ally
shell>cp ../00.baseline/sdRules.stats.in .
#32|99.08%|2073|2054|19|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT 321|99.08%|2073|2054|19|0|c$|adj|cally$|adv|2020|DECOMPOSE|CHILD #322|99.95%|1950|1949|1|0|ic$|adj|ically$|adv|2020|DECOMPOSE|CHILD
shell> mv sdRules.stats.in sdRules.stats.in.01.1
shell> ln -sf ./sdRules.stats.in.01.1 sdRules.stats.in
shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
1
others
01.X-ally
53440
<= total Yes from baseline
-- Optimum SD-Rules: 92|63.14%|331|209|122|0|$|noun|ist$|noun|2013|ORG_RULE|SELF|95.05%|94.26%|1.8931|50371|52993
mv Html file
shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
shell> cp -p sdRules.stats.out.01.1.html ${WEB_LVG}/docs/designDoc/UDF/derivations/SD-Rules-Opti/Ex-${YEAR}/.
III. Results
Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.
The result of the final optimized set of SD-Rules includes 130 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 93 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 94.48% system (accumulated) recall rate with a system performance of 1.8948. The total valid instance number is 53440.
-- Total line no: 168 -- Total comment no: 38 -- Total Sd-Rule no: 130 --------------------------------------- -- Optimum SD-Rules: 93|63.02%|192|121|71|0|ar$|adj|e$|noun|2013|ORG_RULE|SELF|95.00%|94.48%|1.8948|50488|53145
IV. Post-Process
Generate SD-Rule trie from this 93/130 optimized set for Lexical tools SD-Rule generation (TBD).