Lexical Tools

Comparison on Optimized Set between 2014 - 2021

I. New SD-Rules Evaluation Results:

Releases applied this approach to retrieve the optimized SD-rule set are copared as follows since 2014:

ReleaseNew SD-RulesBaselineResultsNotes
2014First Release (based on 2013 SD-Rule)
  • Total candidates SD-pairs: 43,375
  • Total valid candidates SD-pairs (SD-Facts: relevant): 37,136
  • N/A (All SD-Rules are first timer)
2015Added 15 new SD-Rules to the previous release
  • Total candidates SD-pairs: 53,905
  • Total valid candidates SD-pairs (SD-Facts: relevant): 46,950
  • 2 are duplicated (child rule of existing rules).
  • 11 (84.62%, 11/13) of them are evaluated as good rules in the optimized set
  • 2 (15.38%, 2/13) are bad rules
2016Added 12 new SD-Rules to the previous release
  • Total candidates SD-pairs: 58,422
  • Total valid candidates SD-pairs: 50,814
  • 1 are duplicated (of existing rules).
  • 8 (72.73%, 8/11) of them are evaluated as good rules in the optimized set
  • 3 (27.27%, 3/11) are bad rules
2017Added 11 new SD-Rules to the previous release
  • Total candidates SD-pairs: 59,850
  • Total valid candidates SD-pairs: 51,788
  • 1 are duplicated (of existing rules).
  • 6 (60.00%, 6/10) of them are evaluated as good rules in the optimized set
  • 4 (40.00%, 4/10) are bad rules
2020Added 18 new SD-Rules to the previous release
  • Total candidates SD-pairs: 61,777
  • Total valid candidates SD-pairs: 53,440
  • 7 are duplicated (of existing rules).
  • 7 (63.63%, 7/11) of them are evaluated as good rules in the optimized set
  • 4 (36.36%, 4/11) are bad rules
2021Proposed 21 new SD-Rules to the previous release
  • Total candidates SD-pairs: 63,712
  • Total valid candidates SD-pairs: 54,421
  • 3 are duplicated (of existing rules).
  • 12 (66.67%, 12/18) of them are evaluated as good rules in the optimized set
  • 6 (33.33%, 6/18) are bad rules

II. Comparison of SD-Rule set:

YearStatsOptimized Diagram
2014
  • Baseline Set (include parent-child rules): 107
  • Total Unique Rules: 96
  • Total Good Rules: 73
  • Total Valid SD-pairs (SD-Facts: Relevant): 42,552
  • Opti. System Precision: 95.30%
  • Opti. System Recall: 95.01%
  • Opti. System Performance: 1.9031
  • Cutoff Rule: ar$|adj|e$|noun
  • Optimized Set: 2014 Optimized Set
2015
  • Baseline Set (include parent-child rules):120
  • Total Unique Rules: 101
  • Total Good Rules: 76
  • Total Valid SD-pairs (SD-Facts: Relevant): 46,950
  • Opti. System Precision: 95.22%
  • Opti. System Recall: 95.70%
  • Opti. System Performance: 1.9093
  • Cutoff Rule: ar$|adj|e$|noun
  • Optimized Set: 2015 Optimized Set
2016
  • Baseline Set (include parent-child rules):132
  • Total Unique Rules: 111
  • Total Good Rules: 82
  • Total Valid SD-pairs (SD-Facts: Relevant): 50,814
  • Opti. System Precision: 95.00%
  • Opti. System Recall: 95.26%
  • Opti. System Performance: 1.9026
  • Cutoff Rule: $|noun|ist$|noun
  • Optimized Set: 2016 Optimized Set
2017
  • Baseline Set (include parent-child rules):142
  • Total Unique Rules: 119
  • Total Good Rules: 86
  • Total Valid SD-pairs (SD-Facts: Relevant): 51,788
  • Opti. System Precision: 95.09%
  • Opti. System Recall: 94.92%
  • Opti. System Performance: 1.9001
  • Cutoff Rule: $|noun|ist$|noun
  • Optimized Set: 2017 Optimized Set
2020
  • Baseline Set (include parent-child rules):153
  • Total Unique Rules: 130
  • Total Good Rules: 93
  • Total Valid SD-pairs (SD-Facts: Relevant): 53,440
  • Opti. System Precision: 95.00%
  • Opti. System Recall: 94.48%
  • Opti. System Performance: 1.8948
  • Cutoff Rule: ar$|adj|e$|noun
  • Optimized Set: 2020 Optimized Set
2021
  • Baseline Set (include parent-child rules):170
  • Total Unique Rules: 148
  • Total Good Rules: 104
  • Total Valid SD-pairs (SD-Facts: Relevant): 54,421
  • Opti. System Precision: 95.12%
  • Opti. System Recall: 93.45%
  • Opti. System Performance: 1.8857
  • Cutoff Rule: ctic$|adj|xis$|noun
  • Optimized Set: 2021 Optimized Set

For the Optimial set:

  • The optimized set is similar between releases of 2014 and 2015, please see SD-Rule rank mapping, 2014-15 for details.
  • The optimized set (good rules stay good) are consistent over the years:
    • 2014 optimal set has 96 SD-Rules, 73 of them are good.
    • 2015 optimal set has 101 SD-Rules, 76 of them are good.
    • 2016 optimal set has 111 SD-Rules, 82 of them are good.
    • 2017 optimal set has 119 SD-Rules, 86 of them are good.
    • 2020 optimal set has 130 SD-Rules, 93 of them are good.

    • All good rules in 2014 are good in 2015.
    • All good rules in 2015 are good in 2016, except for 1 (ar$|adj|e$|noun).
    • All good rules in 2016 are good in 2017.
    • All good rules in 2017 are good in 2020.

III. Transaction History:

Baseline
Collected Candidate SD-Rules
Unique Rules
Remove child-rules from Baseline
Good Rules
Used in Lexical Tools SD-Rule set
2014107 96
  • removed 11 child-rules from baseline
  • 96 = 107 - 11
73
New Rules15
  • ES (Expert-Suggest)NOM_DORG_DSub-Total
    Total Rules76215
    Duplicated2002
    Total non-dul-rules56213
    Bad Rules2002
    Good Rules36211
  • details
2015120
  • 2 new rules out of 15 are child-rules of existing rules, not added
  • 120 = 107 + 15 - 2
101 76
  • 4 of good new rules are parent-rules of 4 existing rules (+0)
  • 2 of good new rules are parent-rules of 4 existing rules (-2)
  • 5 of good new rules have no parent-rules relationship with existing rule (+5)
  • 76 = 73 + 0 - 2 + 5
New Rules12
  • ES (Expert-Suggest)NOM_DORG_DSub-Total
    Total Rules25512
    Duplicated0101
    Total non-dup-rules24511
    Bad Rules1113
    Good Rules1348
  • details
2016132
  • 1 existing rule add child-rule nce$|noun|nt$|adj in 2015
  • 1 new rules of out 12 is duplicated, not added
  • 132 = 120 + 1 + 12 -1
111 82
New Rules11
  • ES (Expert-Suggest)NOM_DORG_DSub-Total
    Total Rules25411
    Duplicated0100
    Total non-dup-rules24410
    Bad Rules2024
    Good Rules0426
  • details
2017142
  • 1 new rules of out 11 is duplicatedm not added
  • 142 = 132 + 11 -1
119 86
New Rules11
  • ES (Expert-Suggest)NOM_DORG_DSub-Total
    Total Rules210618
    Duplicated0527
    Total non-dup-rules25411
    Bad Rules2024
    Good Rules0527
  • details
2020153
  • 7 new rules of out 18 is duplicatedm not added
  • 153 = 142 + 18 - 7
130 93

The Trascation history is not tracked after 2021+ release.

Details:

The conclusion is the optimized set of SD-Rules is very steady (consistent) as we expected.