Lexical Tools

Local Optimization - Evaluate Parent rules and their Child rules

I. Find all candidate child rules for 15 parent rules

  • DIR: ${SUFFIXD_DIR}
  • Inputs:
    • Prepare directory:
      shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
      shell> mkdir decompose.40.25 (40: min. local occurrence rate, 25: min. local coverate rate)
      shell> ln -sf ./decompose.40.25 decompose
    • Get all SD-pairs (corpus)
      ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdPairs.data
      shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
      shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
      shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
    • Decompose parent's rules one-by-one:
      ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data
      => Add all 15 parent SD-Rules to
      • copy from ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/${YEAR}/sdRules.data.2016.relation.children.only.rpt
      • remove line if it is a child and parent rules at the same time (|CHILD)
      • change format to suffix-1|pos-1|suffix-2|pos-2: remove the rest of the line
      • go through one by one by comment out (#) the rest 14

      • Add one linguist suggested parent-rule: esis$|noun|ic$|adj
        Test if the results of this statistics system will consent with liguistics' knowledge.
        Suggested by Lynn:
        • noun ends with -esis$ have related adj -etic$ (not -ic$)
        • Other nouns end with -emesis$, which related to -emetic$ (not -emic$)
        • the correct SD-Rule should be genesis$|noun|genic$|adj
          • Dorland's does provide justification for pairing -genesis Ns with -genic adj's (p.763):
          • -genesis: a word termination used to denote the production, formation, or development of the object or state indicated by the word stem to which it is affixes, as biogenesis, gametogenesis and pathogenesis.
          • -genic: a word termination meaning producing, or productive of

        • Total 16 candidate parent SD-Rules to be evaluated.
  • Program:
    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    7
    40 (min. occurrence rate - for decompose)
    => Need to have enough coverage for further decomposition on child rules
    25 (35) (min. coverage rate - for candidate child)
    => Need to have enough coverage to be a qualified child rule
  • Outputs:
    • sdRules.decompose.out

      Child rule must have high accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:

      • the accuracy rate (precision) is higher than parent-rule
      • the coverage rate (recall) is higher than 25% (or the specified number)
    • shell>mv sdRules.decompose.out sdRules.decompose.out.no.rule
    • such as shell>mv sdRules.decompose.out sdRules.decompose.out.1.X-ally
  • Repeat this process for all 16 parent rules.

II. Replace 16 parent rules by selected candidate child SD-Rules for optimized set

  • DIR: ${SUFFIXD_DIR}/data/$[year}/dataR/SdRulesOptimum/
    • Create a new directory
      shell>mkdir 01.X-ally
  • Inputs:
    • Update the sdRules.stats.in by replace 1st parent rules with candidate child rules
      shell>cd 01.X-ally
      shell>cp ../00.baseline/sdRules.stats.in .
      => Copy all candidate child rules from ../../SdRulesCheck/decompose/sdRule.decompose.out.1.X-ally to this file
      Update the follows:
      • Change the rank (1st field): to 251 (Original rank + child level)
      • Move 9th field - accuracy rate (precision) to 2nd field
      • Add 0 to 6th field (tbd no.)
      • Change fields 11~13 to ${YEAR}|DECOMPOSE|CHILD
      • Comment out (#) those parent/child rules are not in test
      • The new edited file looks like:
        #25|99.08%|2072|2053|19|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT
        251|99.95%|1954|1953|1|0|c$|adj|cally$|adv|2016|DECOMPOSE|CHILD
        #252|99.95%|1949|1948|1|0|ic$|adj|ically$|adv|2016|DECOMPOSE|CHILD
        
  • Program - Get the optimal Set:
    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    1
    others
    01.X-ally
    50814 <= from baseline
  • Outputs directory:
    • ${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/01.X-ally
    -- Optimum SD-Rules: 81|68.29%|123|84|39|0|ant$|adj|ate$|verb|2013|ORG_RULE|SELF|95.18%|94.65%|1.8983|48095|50531
    
  • Repeat this process for all generations of candidate child rules of the same parent rule.
    • shell> mv sdRules.stats.in sdRules.stats.in.01.1
    • shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
  • Repeat this process for all parent rules.
  • Go to result of optimization log for optimizing details.

III. Results

Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.

The result of the final optimized set of SD-Rules includes 111 unique parents/self/child SD-Rules. They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 82 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 95.26% system (accumulated) recall rate with a system performance of 1.9026. The total valid instance number is 50814.

-- Total line no: 171
-- Total comment no: 60
-- Total Sd-Rule no: 111
---------------------------------------
-- Optimum SD-Rules: 82|63.14%|331|209|122|0|$|noun|ist$|noun|2013|ORG_RULE|SELF|95.00%|95.26%|1.9026|48403|50949

IV. Post-Process

Generate SD-Rule trie from this 82/111 optimized set (TBD).