Lexical Tools

Local Optimization - Evaluate Parent rules and their Child rules

I. Find all candidate child rules for parent rules

  • DIR: ${SUFFIXD_DIR}
  • Inputs:
    • Prepare directory:
      shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
      shell> mkdir decompose.40.25 (40: min. local occurrence rate, 25: min. local cover-recall rate)
      shell> ln -sf ./decompose.40.25 decompose
    • Get all SD-pairs (corpus)
      ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdPairs.data
      shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
      shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
      shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
    • Decompose parent's rules one-by-one:
      ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data
      • copy from ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/${YEAR}/sdRules.data.${YEAR}.relation.children.only.rpt .
      • remove line if it is a child and parent rules at the same time (|CHILD => at the the first part of relationship)
      • change format to suffix-1|pos-1|suffix-2|pos-2: remove the rest of the line
      • Add esis$|noun|ic$|adj in if this rule is not there.
      • Total 19 candidate parent SD-Rules to be evaluated. These rules were evaluated previously (may not need to re-evaluated).
      • run the Program (as described below) through all parent Sd-Rules one by one by comment out (#) the rest
      • The reuslts between years might be slightly different. But, the principle is the same.
    • Also, need to evaluate new rule and their child rules.
      • Add all new rules to ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data

    • Simplified method after 2020+
      The parent-child evaluation and optimization could be simplfied as follows:
      • Use the parent rule from the previous year (./dataR/SdRuleCheck/sdRule.data)
      • Run GetSdRule ${YEAR} 7 as described below to get all good candidate child-rule.
        • Program:
          shell> cd ${SUFFIXD_DIR}/bin
          shell> GetSdRule ${YEAR}
          7
          40 (min. occurrence rate - for decompose)
          => Need to have enough coverage for further decomposition on child rules
          25 (35) (min. coverage rate - for candidate child)
          => Need to have enough coverage to be a qualified child rule
        • Outputs:
          • sdRules.decompose.out

            Child rule must have higher accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall, default is 25%). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:

            • the accuracy rate (precision) is higher than parent-rule
            • the coverage rate (recall) is higher than 25% (or the specified number)
          • shell>mv sdRules.decompose.out sdRules.decompose.out.${NO}.${RULE}
          • such as shell>mv sdRules.decompose.out sdRules.decompose.out.1.X-ally
      • Compare the precision and recall rate to the root parent rules, The sum (F1) of a good child rule must be greater than (or close to, within 5%) parents' rules so that the overall F1 can be better (which is to run GetSdRule ${YEAR} 1
      • The evaluation of parent-child rules should be consistent as in the past. So, compare to the past and only evaluate those with same/Better F1 Sd-Rule.
      • Use the above two criterisa to evaluate new parent rules and those need to evaluated.
      • We used this simplied method on 2021 to evaluate 19 parent rules and 17 new rules to reaches the best optimum SD-Rule set.
  • Continue to the next step to evaluate parent-child rules (by replacing parent rule with Child rule) and repeat the whole process for all parent rules.
  • Updates optimal log while going through this process

II. Replace parent rules by selected candidate child SD-Rules for optimized set

  • DIR: ${SUFFIXD_DIR}/data/${year}/dataR/SdRulesOptimum/
    • Create a new directory
      shell>mkdir ${NO}.${RULE}
      shell>mkdir 01.X-ally
  • Inputs:
    • Update the sdRules.stats.in by replace 1st parent rules with candidate child rules
      shell>cd 01.X-ally
      shell>cp ../00.baseline/sdRules.stats.in .
      => Copy all candidate child rules from ../../SdRulesCheck/decompose/sdRule.decompose.out.1.X-ally to this file under the associated parent-rule
      Update the follows in sdRules.stats.in:
      • Find the rank (31) of associated parent rule from the baseline by precision rate (99.08%)
      • Change the rank (1st field): to 321 (Original rank - 32 + child level - 1)
      • Move 9th field - accuracy rate (precision) to 2nd field
      • Add 0 to 6th field (tbd no.)
      • Change fields 11~13 to ${YEAR}|DECOMPOSE|CHILD
      • Comment out (#) those parent/child rules are not in test
      • The new edited file looks like:
        #31|99.08%|2075|2056|19|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT
        311|99.95%|1957|1956|1|0|c$|adj|cally$|adv|2021|DECOMPOSE|CHILD
        #312|99.95%|1952|1951|1|0|ic$|adj|ically$|adv|2021|DECOMPOSE|CHILD
        
  • Program - Get the optimal Set:
    shell> mv sdRules.stats.in sdRules.stats.in.01.1
    shell> ln -sf ./sdRules.stats.in.01.1 sdRules.stats.in


    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    1
    others
    01.X-ally
    54347 <= total Yes from baseline

  • Outputs directory:
    • ${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/01.X-ally
    -- Optimum SD-Rules: 92|63.14%|331|209|122|0|$|noun|ist$|noun|2013|ORG_RULE|SELF|95.05%|94.26%|1.8931|50371|52993
    

    mv Html file

    • shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
    • shell> cp -p sdRules.stats.out.01.1.html ${WEB_LVG}/docs/designDoc/UDF/derivations/SD-Rules-Opti/Ex-${YEAR}/.
    • Updates optimal-log file
  • Repeat this process for all generations of candidate child rules of the same parent rule.
  • Repeat this process for all parent rules (using the best sdRules.stats.in)
  • Go to result of optimization log for optimizing details.

III. Results

Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.

The result of the final optimized set of SD-Rules includes 148 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 104 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 93.45% system (accumulated) recall rate with a system performance of 1.8857. The total valid instance number is 54347.

- Total line no: 197
-- Total comment no: 49
-- Total Sd-Rule no: 148
---------------------------------------
-- Optimum SD-Rules: 104|65.85%|41|27|14|0|ctic$|adj|xis$|noun|2021|ORG_FACT|SELF|95.12%|93.45%|1.8857|50857|53465

IV. Post-Process

Generate SD-Rule trie from this 104/148 optimized set for Lexical tools SD-Rule generation.

  • cd ./dataR
  • cp ./35.ity-y/sdRules.stats.out ./35.ity-y/sdRules.stats.out.opti
  • ln -sf ./SdRulesOptimum/35.ity-y/sdRules.stats.out.opti sdRules.stats.out

  • cd ./bin
  • 8
  • 104 (the good rules)

  • ./dataR/dm.rul.2021.104 (the Tire file for Lvg)