Lexical Tools

Optimizing 2015 SD-Rule Set - Parent Rules

I. Find all candidate child rules for 14 parent rules

  • DIR: ${SUFFIXD_DIR}
  • Inputs:
    • ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdPairs.data
      shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
      shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
      shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
    • ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data
      => Add all 14 parent SD-Rules to
      => go through one by one by comment out (#) the rest 13
  • Program:
    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    7
    40 (min. occurrence rate - for decompose)
    => Need to have enough coverage for further decomposition on child rules
    25 (35) (min. coverage rate - for candidate child)
    => Need to have enough coverage to be a qualified child rule
  • Outputs:
    • sdRules.decompose.out

      Child rule must have high accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:

      • the accuracy rate (precision) is higher than parent-rule
      • the coverage rate (recall) is higher than 35% (or the specified number)
    • shell>mv sdRules.decompose.out sdRules.decompose.out.no.rule
    • such as shell>mv sdRules.decompose.out sdRules.decompose.out.1.X-ally
  • Repeat this process for all 14 parent rules.

II. Replace 14 parent rules by selected candidate child SD-Rules for optimized set

  • DIR: ${SUFFIXD_DIR}/data/$[year}/dataR/SdRulesOptimum/
    • Create a new directory
      shell>mkdir 1.X-ally
  • Inputs:
    • Update the sdRules.stats.in by replace 1st parent rules with candidate child rules
      shell>cd 1.X-ally
      shell>cp ../0.baseline/sdRules.stats.in .
      => Copy all candidate child rules from ../../SdRulesCheck/decompose/sdRule.decompose.out.1.X-ally to this file
      Update the follows:
      • Change the rank (1st field)a to 241 (Original rank + child level)
      • Move accuracy rate (precision) to 2nd field
      • Add 0 to 6th field (tbd no.)
      • Change fields 11~13 to ${YEAR}|DECOMPOSE|CHILD
      • Comment out (#) those parent/child rules are not in test
      #24|99.08%|2072|2053|19|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT
      241|99.95%|1954|1953|1|0|c$|adj|cally$|adv|2015|DECOMPOSE|CHILD
      #242|99.95%|1949|1948|1|0|ic$|adj|ically$|adv|2015|DECOMPOSE|CHILD
      
  • Program - Get the optimal Set:
    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    1
    others
    1.X-ally
    46950 <= from baseline
  • Outputs directory:
    • ${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/1.X-ally
    -- Optimum SD-Rules: 76|61.70%|188|116|72|0|ar$|adj|e$|noun|2013|ORG_RULE|SELF|95.21%|95.50%|1.9071|44835|47089
    
  • Repeat this process for all candidate child rules.
  • Repeat this process for all parent rules.

III. Results

Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.

The result of the final optimized set of SD-Rules includes 101 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 76 SD-Rules are used as the optimized SD-Rule set to cover 95.22% system (accumulated) precision and 95.70% system (accumulated) recall rate with a system performance of 1.9093. The total valid instance number is 46950.

-- Total line no: 147
-- Total comment no: 46
-- Total Sd-Rule no: 101
---------------------------------------
-- Optimum SD-Rules: 76|61.70%|188|116|72|0|ar$|adj|e$|noun|2013|ORG_RULE|SELF|95.22%|95.70%|1.9093|44933|47187

IV. Post-Process

Update ${SUFFIXD_DIR}/data/${YEAR}/dataOrg/sdRules.data.${NEXT_YEAR} by:

  • adding new candidate child rules with better system performance