Lexical Tools

Local Optimization - Evaluate Parent rules and their Child rules

I. Find all candidate child rules for 15 parent rules

DIR: ${SUFFIXD_DIR}
Inputs:
- Prepare directory:
  shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/
  shell> mkdir decompose.40.25 (40: min. local occurrence rate, 25: min. local coverate rate)
  shell> ln -sf ./decompose.40.25 decompose
- Get all SD-pairs (corpus)
  ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdPairs.data
  shell> sort -u ../../../data/suffixD.yesNo.data > ./suffixD.yesNo.data.uSort
  shell> flds 1,2,4,5,7 ./suffixD.yesNo.data.uSort > suffixD.yesNo.data.uSort.1.2.4.5.7
  shell> ln -sf ./suffixD.yesNo.data.uSort.1.2.4.5.7 sdPairs.data
- Decompose parent's rules one-by-one:
  ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/decompose/sdRule.data
  => Add all 15 parent SD-Rules to
  - copy from ${SUFFIXD_DIR}/data/${YEAR}/dataR/SdRulesCheck/${YEAR}/sdRules.data.2016.relation.children.only.rpt
  - remove line if it is a child and parent rules at the same time (|CHILD)
  - change format to suffix-1|pos-1|suffix-2|pos-2: remove the rest of the line
  - go through one by one by comment out (#) the rest 14
  - Add one linguist suggested parent-rule: esis$|noun|ic$|adj
    Test if the results of this statistics system will consent with liguistics' knowledge.
    Suggested by Lynn:
    - noun ends with -esis$ have related adj -etic$ (not -ic$)
    - Other nouns end with -emesis$, which related to -emetic$ (not -emic$)
    - the correct SD-Rule should be genesis$|noun|genic$|adj
      - Dorland's does provide justification for pairing -genesis Ns with -genic adj's (p.763):
      - -genesis: a word termination used to denote the production, formation, or development of the object or state indicated by the word stem to which it is affixes, as biogenesis, gametogenesis and pathogenesis.
      - -genic: a word termination meaning producing, or productive of
    - Total 16 candidate parent SD-Rules to be evaluated.
Program:
shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
7
40 (min. occurrence rate - for decompose)
=> Need to have enough coverage for further decomposition on child rules
25 (35) (min. coverage rate - for candidate child)
=> Need to have enough coverage to be a qualified child rule
Outputs:
- sdRules.decompose.out
  Child rule must have high accuracy rate (precision) than the root parent-rule and meets the min. coverage rate (recall). Manually look through the output file sdRule.decompose.out and search for "<= Candidate", these candidates are child-rules match following criteria:
  - the accuracy rate (precision) is higher than parent-rule
  - the coverage rate (recall) is higher than 25% (or the specified number)
- shell>mv sdRules.decompose.out sdRules.decompose.out.no.rule
- such as shell>mv sdRules.decompose.out sdRules.decompose.out.1.X-ally
Repeat this process for all 16 parent rules.

II. Replace 16 parent rules by selected candidate child SD-Rules for optimized set

DIR: ${SUFFIXD_DIR}/data/$[year}/dataR/SdRulesOptimum/
- Create a new directory
  shell>mkdir 01.X-ally
Inputs:
- Update the sdRules.stats.in by replace 1st parent rules with candidate child rules
  shell>cd 01.X-ally
  shell>cp ../00.baseline/sdRules.stats.in .
  => Copy all candidate child rules from ../../SdRulesCheck/decompose/sdRule.decompose.out.1.X-ally to this file
  Update the follows:
  - Change the rank (1st field): to 251 (Original rank + child level)
  - Move 9th field - accuracy rate (precision) to 2nd field
  - Add 0 to 6th field (tbd no.)
  - Change fields 11~13 to ${YEAR}|DECOMPOSE|CHILD
  - Comment out (#) those parent/child rules are not in test
  - The new edited file looks like:
```
#25|99.08%|2072|2053|19|0|$|adj|ally$|adv|2015|ORG_FACT|PARENT
251|99.95%|1954|1953|1|0|c$|adj|cally$|adv|2016|DECOMPOSE|CHILD
#252|99.95%|1949|1948|1|0|ic$|adj|ically$|adv|2016|DECOMPOSE|CHILD
```
Program - Get the optimal Set:
shell> cd ${SUFFIXD_DIR}/bin
shell> GetSdRule ${YEAR}
1
others
01.X-ally
50814 <= from baseline

Outputs directory:

${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/01.X-ally

-- Optimum SD-Rules: 81|68.29%|123|84|39|0|ant$|adj|ate$|verb|2013|ORG_RULE|SELF|95.18%|94.65%|1.8983|48095|50531

Repeat this process for all generations of candidate child rules of the same parent rule.
- shell> mv sdRules.stats.in sdRules.stats.in.01.1
- shell> mv sdRules.stats.out.html sdRules.stats.out.01.1.html
Repeat this process for all parent rules.
Go to result of optimization log for optimizing details.



III. Results
Please refer to the result of optimization log for details of each step for these parent-child rules optimization processes.


The result of the final optimized set of SD-Rules includes 111 unique parents/self/child SD-Rules.
They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 82 SD-Rules are used as the optimized SD-Rule set to cover 95.00% system (accumulated) precision and 95.26% system (accumulated) recall rate with a system performance of 1.9026. The total valid instance number is 50814.
-- Total line no: 171
-- Total comment no: 60
-- Total Sd-Rule no: 111
---------------------------------------
-- Optimum SD-Rules: 82|63.14%|331|209|122|0|$|noun|ist$|noun|2013|ORG_RULE|SELF|95.00%|95.26%|1.9026|48403|50949


IV. Post-Process
Generate SD-Rule trie from this 82/111 optimized set (TBD).