Lexical Tools

Local Optimization - Evaluate Parent rules and their Child rules

I. Find all candidate child rules for parent rules

  • DIR: ${SUFFIXD_DIR}
  • Inputs:
    • Prepare directory:
      shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
      shell> mkdir decompose.40.25 (40: min. local occurrence rate, 25: min. local coverate rate)
      shell> ln -sf ./decompose.40.25 decompose
    • Get all SD-pairs (corpus)
      ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdPairs.data
      shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
      shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
      shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
    • Decompose parent's rules one-by-one:
      ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data
      • copy from ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/${YEAR}/sdRules.data.2017.relation.children.only.rpt .
      • remove line if it is a child and parent rules at the same time (|CHILD => at the the first part of relationship)
      • change format to suffix-1|pos-1|suffix-2|pos-2: remove the rest of the line
      • Total 16 candidate parent SD-Rules to be evaluated.
      • go through one by one (16) by comment out (#) the rest
  • Program:
    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    7
    40 (min. occurrence rate - for decompose)
    => Need to have enough coverage for further decomposition on child rules
    25 (35) (min. coverage rate - for candidate child)
    => Need to have enough coverage to be a qualified child rule
  • Outputs:
    • sdRules.decompose.out

      Child rule must have higher accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall, default is 25%). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:

      • the accuracy rate (precision) is higher than parent-rule
      • the coverage rate (recall) is higher than 25% (or the specified number)
    • shell>mv sdRules.decompose.out sdRules.decompose.out.${NO}.${RULE}
    • such as shell>mv sdRules.decompose.out sdRules.decompose.out.1.X-ally
  • Continue to the next step of evaluate parent-child rules and repeat the whole process for all parent rules.
  • Updates optimal log while going through this process

II. Replace parent rules by selected candidate child SD-Rules for optimized set

  • DIR: ${SUFFIXD_DIR}/data/$[year}/dataR/SdRulesOptimum/
    • Create a new directory
      shell>mkdir ${NO}.${RULE}
      shell>mkdir 01.X-ally
  • Inputs:
    • Update the sdRules.stats.in by replace 1st parent rules with candidate child rules
      shell>cd 01.X-ally
      shell>cp ../00.baseline/sdRules.stats.in .
      => Copy all candidate child rules from ../../SdRulesCheck/decompose/sdRule.decompose.out.1.X-ally to this file under the associated parent-rule
      Update the follows:
      • Find the associated parent rule (28)
      • Change the rank (1st field): to 281 (Original rank + child level)
      • Move 9th field - accuracy rate (precision) to 2nd field
      • Add 0 to 6th field (tbd no.)
      • Change fields 11~13 to ${YEAR}|DECOMPOSE|CHILD
      • Comment out (#) those parent/child rules are not in test
      • The new edited file looks like:
        #28|99.08%|2072|2053|19|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT
        281|99.95%|1954|1953|1|0|c$|adj|cally$|adv|2017|DECOMPOSE|CHILD
        #282|99.95%|1949|1948|1|0|ic$|adj|ically$|adv|2017|DECOMPOSE|CHILD
        
  • Program - Get the optimal Set:
    shell> mv sdRules.stats.in sdRules.stats.in.01.1
    shell> ln -sf ./sdRules.stats.in.01.1 sdRules.stats.in


    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    1
    others
    01.X-ally
    51788 <= from baseline

  • Outputs directory:
    • ${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/01.X-ally
    -- Optimum SD-Rules: 88|61.29%|93|57|36|0|$|noun|ish$|adj|2017|ORG_FACT|SELF|95.06%|95.00%|1.9006|51056|53709
    

    mv Html file

    • shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
    • shell> cp -p sdRules.stats.out.01.1.html ${WEB_LVG}/docs/designDoc/UDF/derivations/SD-Rules-Opti/Ex-${YEAR}/.
    • Updates optimal-log file
  • Repeat this process for all generations of candidate child rules of the same parent rule.
  • Repeat this process for all parent rules (using the best sdRules.stats.in)
  • Go to result of optimization log for optimizing details.

III. Results

Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.

The result of the final optimized set of SD-Rules includes 111 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 82 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 95.26% system (accumulated) recall rate with a system performance of 1.9026. The total valid instance number is 50814.

-- Total line no: 171
-- Total comment no: 60
-- Total Sd-Rule no: 111
---------------------------------------
-- Optimum SD-Rules: 82|63.14%|331|209|122|0|$|noun|ist$|noun|2013|ORG_RULE|SELF|95.00%|95.26%|1.9026|48403|50949

IV. Post-Process

Generate SD-Rule trie from this 82/111 optimized set (TBD).