Local Optimization - Evaluate Parent rules and their Child rules
I. Find all candidate child rules for parent rules
shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
shell> mkdir decompose.40.25
(40: min. local occurrence rate, 25: min. local coverate rate)
shell> ln -sf ./decompose.40.25 decompose
shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
suffix-1|pos-1|suffix-2|pos-2
: remove the rest of the line
shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
7
40 (min. occurrence rate - for decompose)
25 (35) (min. coverage rate - for candidate child)
Child rule must have higher accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall, default is 25%). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:
shell>mv sdRules.decompose.out sdRules.decompose.out.${NO}.${RULE}
shell>mv sdRules.decompose.out sdRules.decompose.out.1.X-ally
II. Replace parent rules by selected candidate child SD-Rules for optimized set
shell>mkdir ${NO}.${RULE}
shell>mkdir 01.X-ally
shell>cd 01.X-ally
shell>cp ../00.baseline/sdRules.stats.in .
#28|99.08%|2072|2053|19|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT 281|99.95%|1954|1953|1|0|c$|adj|cally$|adv|2017|DECOMPOSE|CHILD #282|99.95%|1949|1948|1|0|ic$|adj|ically$|adv|2017|DECOMPOSE|CHILD
shell> mv sdRules.stats.in sdRules.stats.in.01.1
shell> ln -sf ./sdRules.stats.in.01.1 sdRules.stats.in
shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
1
others
01.X-ally
51788
<= from baseline
-- Optimum SD-Rules: 88|61.29%|93|57|36|0|$|noun|ish$|adj|2017|ORG_FACT|SELF|95.06%|95.00%|1.9006|51056|53709
mv Html file
shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
shell> cp -p sdRules.stats.out.01.1.html ${WEB_LVG}/docs/designDoc/UDF/derivations/SD-Rules-Opti/Ex-${YEAR}/.
III. Results
Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.
The result of the final optimized set of SD-Rules includes 111 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 82 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 95.26% system (accumulated) recall rate with a system performance of 1.9026. The total valid instance number is 50814.
-- Total line no: 171 -- Total comment no: 60 -- Total Sd-Rule no: 111 --------------------------------------- -- Optimum SD-Rules: 82|63.14%|331|209|122|0|$|noun|ist$|noun|2013|ORG_RULE|SELF|95.00%|95.26%|1.9026|48403|50949
IV. Post-Process
Generate SD-Rule trie from this 82/111 optimized set (TBD).