Lexical Tools

Local Optimization - Evaluate Parent rules and their Child rules

I. Find all candidate child rules for parent rules

DIR: ${SUFFIXD_DIR}
Inputs:
- Prepare directory:
  shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
  shell> mkdir decompose.40.25 (40: min. local occurrence rate, 25: min. local cover-recall rate)
  shell> ln -sf ./decompose.40.25 decompose
- Get all SD-pairs (corpus)
  ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdPairs.data
  shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
  shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
  shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
- Decompose parent's rules one-by-one:
  ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data
  - copy from ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/${YEAR}/sdRules.data.${YEAR}.relation.children.only.rpt .
  - remove line if it is a child and parent rules at the same time (|CHILD => at the the first part of relationship)
  - change format to suffix-1|pos-1|suffix-2|pos-2: remove the rest of the line
  - Add esis$|noun|ic$|adj in if this rule is not there.
  - Total 19 candidate parent SD-Rules to be evaluated. These rules were evaluated previously (may not need to re-evaluated).
  - run the Program (as described below) through all parent Sd-Rules one by one by comment out (#) the rest
  - The reuslts between years might be slightly different. But, the principle is the same.
- Also, need to evaluate new rule and their child rules.
  - Add all new rules to ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data
- Simplified method after 2020+
  The parent-child evaluation and optimization could be simplfied as follows:
  - Use the parent rule from the previous year (./dataR/SdRuleCheck/sdRule.data)
  - Run GetSdRule ${YEAR} 7 as described below to get all good candidate child-rule.
    - Program:
      shell> cd ${SUFFIXD_DIR}/bin
      shell> GetSdRule ${YEAR}
      7
      40 (min. occurrence rate - for decompose)
      => Need to have enough coverage for further decomposition on child rules
      25 (35) (min. coverage rate - for candidate child)
      => Need to have enough coverage to be a qualified child rule
    - Outputs:
      - sdRules.decompose.out
        Child rule must have higher accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall, default is 25%). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:
        
        the accuracy rate (precision) is higher than parent-rule
        the coverage rate (recall) is higher than 25% (or the specified number)
      - shell>mv sdRules.decompose.out sdRules.decompose.out.${NO}.${RULE}
      - such as shell>mv sdRules.decompose.out sdRules.decompose.out.1.X-ally
  - Compare the precision and recall rate to the root parent rules, The sum (F1) of a good child rule must be greater than (or close to, within 5%) parents' rules so that the overall F1 can be better (which is to run GetSdRule ${YEAR} 1
  - The evaluation of parent-child rules should be consistent as in the past. So, compare to the past and only evaluate those with same/Better F1 Sd-Rule.
  - Use the above two criterisa to evaluate new parent rules and those need to evaluated.
  - We used this simplied method on 2021 to evaluate 19 parent rules and 17 new rules to reaches the best optimum SD-Rule set.
Continue to the next step to evaluate parent-child rules (by replacing parent rule with Child rule) and repeat the whole process for all parent rules.
Updates optimal log while going through this process

II. Replace parent rules by selected candidate child SD-Rules for optimized set

DIR: ${SUFFIXD_DIR}/data/${year}/dataR/SdRulesOptimum/
- Create a new directory
  shell>mkdir ${NO}.${RULE}
  shell>mkdir 01.X-ally
Inputs:
- Update the sdRules.stats.in by replace 1st parent rules with candidate child rules
  shell>cd 01.X-ally
  shell>cp ../00.baseline/sdRules.stats.in .
  => Copy all candidate child rules from ../../SdRulesCheck/decompose/sdRule.decompose.out.1.X-ally to this file under the associated parent-rule
  Update the follows in sdRules.stats.in:
  - Find the rank (31) of associated parent rule from the baseline by precision rate (99.08%)
  - Change the rank (1st field): to 321 (Original rank - 32 + child level - 1)
  - Move 9th field - accuracy rate (precision) to 2nd field
  - Add 0 to 6th field (tbd no.)
  - Change fields 11~13 to ${YEAR}|DECOMPOSE|CHILD
  - Comment out (#) those parent/child rules are not in test
  - The new edited file looks like:
```
#31|99.08%|2075|2056|19|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT
311|99.95%|1957|1956|1|0|c$|adj|cally$|adv|2021|DECOMPOSE|CHILD
#312|99.95%|1952|1951|1|0|ic$|adj|ically$|adv|2021|DECOMPOSE|CHILD
```
Program - Get the optimal Set:
shell> mv sdRules.stats.in sdRules.stats.in.01.1
shell> ln -sf ./sdRules.stats.in.01.1 sdRules.stats.in

shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
1
others
01.X-ally
54347 <= total Yes from baseline
Outputs directory:
- ${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/01.X-ally
```
-- Optimum SD-Rules: 92|63.14%|331|209|122|0|$|noun|ist$|noun|2013|ORG_RULE|SELF|95.05%|94.26%|1.8931|50371|52993
```
mv Html file
- shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
- shell> cp -p sdRules.stats.out.01.1.html ${WEB_LVG}/docs/designDoc/UDF/derivations/SD-Rules-Opti/Ex-${YEAR}/.
- Updates optimal-log file
Repeat this process for all generations of candidate child rules of the same parent rule.
Repeat this process for all parent rules (using the best sdRules.stats.in)
Go to result of optimization log for optimizing details.

III. Results

Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.

The result of the final optimized set of SD-Rules includes 148 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 104 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 93.45% system (accumulated) recall rate with a system performance of 1.8857. The total valid instance number is 54347.

- Total line no: 197
-- Total comment no: 49
-- Total Sd-Rule no: 148
---------------------------------------
-- Optimum SD-Rules: 104|65.85%|41|27|14|0|ctic$|adj|xis$|noun|2021|ORG_FACT|SELF|95.12%|93.45%|1.8857|50857|53465

IV. Post-Process

Generate SD-Rule trie from this 104/148 optimized set for Lexical tools SD-Rule generation.

cd ./dataR
cp ./35.ity-y/sdRules.stats.out ./35.ity-y/sdRules.stats.out.opti
ln -sf ./SdRulesOptimum/35.ity-y/sdRules.stats.out.opti sdRules.stats.out
cd ./bin
8
104 (the good rules)
./dataR/dm.rul.2021.104 (the Tire file for Lvg)