Lexical Tools

Example - Add SD-Rules Derived from nomD

As discussed in the nominalization derivations session, valid nomD pairs from Lexicon can be generated by computer program automatically. Most of them are valid suffixD. In 2013 release, 14,368 of suffixD are generated from 14,668 valid nomD. A set of program is developed to derive possible SD-rule from these valid SD-pairs and then add to the SD-rule set (from previous session) to increase coverage:

A program identifies possible SD-Rules by stripping the same starting characters of each valid SD-pair from nomD.
For example, SD-pair of nomD, location|noun|locate|verb, generates SD-rule of ion|noun|e|verb by stripping "locat". shell> GetSdRule 2013 2 nomD ...

Select SD-rules to add to SD-rules set:

The results include all possible SD-rules in sdRulesFromSdPairs.rpt
Select high coverage (occurrence) because a good rule must have good coverage rate
Check if it is a root parent-rules. Ideally, we want to work on root parent-rules for a comprehensive decompose. The decomposed child-rules can be evaluated as the same procedure in baseline optimization
Check if the SD-Rule is related to current set:
- duplicated: not select because can't two identical rules in a set
- child-rule: not select because its parent-rule already in the set
- parent-rule: not select to simply the analysis. However, these rules should be evaluated in the future releases when more resource are available to have a better set.

As a result, four SD-rules are selected to be evaluated in 2013 to add to SD-rule set as shown in the following table. More SD-rules should be evaluated in the future for better coverage. The bottom line is more SD-rules are evaluated, the more complete and bigger coverage of the SD-rule set are.


	
shell>GetSdRule 2013
	
5
	
2013
	
$|adj|ness$|noun
	
...

Possible SD-rule from nomD	Occurrence	Root	Related	Notes
$\|adj\|ness$\|noun	2489	Yes	Duplicated	Done - not selected
e$\|verb\|ion$\|noun	1740	Yes	parents-rule of `ate$\|verb\|ation$\|noun se$\|verb\|sion$\|noun`	To be evaluated next
$\|adj\|ity$\|noun	1635	Yes	Duplicated	Done - not selected
ility$\|noun\|le$\|adj	1295	Yes	parents-rule of `ability$\|noun\|able$\|adj`	To be evaluated next
ation$\|noun\|e$\|verb	1164	Yes	Duplicated	Done - not selected
e$\|adj\|ity$\|noun	604	Yes	Duplicated	Done - not selected
ce$\|noun\|t$\|adj	522	Yes	parents-rule of `ance$\|noun\|ant$\|adj ence$\|noun\|ent$\|adj iance$\|noun\|iant$\|adj`	To be evaluated next
iness$\|noun\|y$\|adj	501	Yes	None	Selected
$\|verb\|ment$\|noun	467	Yes	Duplicated	Done - not selected
$\|verb\|ion$\|noun	381	Yes	None	Selected
cy$\|noun\|t$\|adj	292	Yes	parents-rule of `ency$\|noun\|ent$\|adj iency$\|noun\|ient$\|adj`	To be evaluated next
ication$\|noun\|y$\|verb	232	Yes	Duplicated	Done - not selected
$\|verb\|ation$\|noun	214	Yes	Duplicated	Done - not selected
ed$\|adj\|ion$\|noun	200	Yes	None	Selected
$\|verb\|ing$\|noun	194	Yes	None	Selected
e$\|adj\|ion$\|noun	103	Yes	None	Not selected due to Low frequency (coverage)
...	...	...	...	Not selected due to low frequency (coverage)

Apply the same procedures to get the optimized set as in optimizing baseline set by using the optimized set of 2.3 as new baseline. This task involves:

Retrieve all raw SD-pairs from Lexicon (2013) of above four selected SD-rules
Tag raw SD-pairs
Get stats of SD-pairs of these four SD-rules
Add to SD-rules set and find the optimization
The total valid SD-Pair no. (TotalYes) needs to be calculated as total valid SD-pair no. from all parent-rules.

The iterative results are shown as follows:

ID	New Candidate Rule	Total Yes	Total Rule No.	Rule No.	A. Rate	Occr.	Yes	No	SD-Rule	Status	Source	Notes	Sys A. Rate	Sys C. Rate	Sys. Perf	Notes
2.3 (prev. optimized set)		37,136	87	65	60.66%	183	111	72	ar$\|adj\|e$\|noun	2013	ORG_RULE	SELF	95.01%	94.30%	1.8931	Baseline
2.3.1	`13\|99.81%\|536\|535\|1\|0\|iness$\|noun\|y$\|adj\|2013\|NOM_D\|SELF`	37,671 = 37,136 + 535	88	66	60.66%	183	111	72	ar$\|adj\|e$\|noun	2013	ORG_RULE	SELF	95.08%	94.38%	1.8946	Better
2.3.2	`32\|97.70%\|651\|636\|15\|0\|ed$\|adj\|ion$\|noun\|2013\|NOM_D\|SELF`	38,307 = 37,671 + 636	89	67	60.66%	183	111	72	ar$\|adj\|e$\|noun	2013	ORG_RULE	SELF	95.13%	94.47%	1.8960	Better
2.3.3.0	`46\|93.31%\|553\|516\|37\|0\|$\|verb\|ion$\|noun\|2013\|NOM_D\|PARENT` remove child-rule: `35\|95.88%\|97\|93\|4\|0\|ss$\|verb\|ssion$\|noun\|2013\|ORG_RULE\|SELF`	38,730 = 38,307 + 516 - 93	89	67	60.66%	183	111	72	ar$\|adj\|e$\|noun	2013	ORG_RULE	SELF	95.10%	94.53%	1.8963	Better
2.3.3.1	`1\|429\|414\|15\|t$\|verb\|tion$\|noun\|96.50%\|77.58%` Decomposed from parent-rule: `46\|93.31%\|553\|516\|37\|0\|$\|verb\|ion$\|noun\|2013\|NOM_D\|PARENT`	38,730	89	67	60.66%	183	111	72	ar$\|adj\|e$\|noun	2013	ORG_RULE	SELF	95.14%	94.27%	1.8941	Worse
2.3.4	`50\|91.57%\|510\|467\|43\|0\|$\|verb\|ing$\|noun\|2013\|NOM_D\|SELF`	39,197 = 38,730 + 467	90	68	60.66%	183	111	72	ar$\|adj\|e$\|noun	2013	ORG_RULE	SELF	95.05%	94.60%	1.8965	Better

The table above shows the iterative results by adding new rules derived from nomD step by step. Please note that SD-rule ss$|verb|ssion$|noun is removed because it is a child-rule of newly added SD-rule $|verb|ion$|noun for case 2.3.3. The results show all four selected SD-rules (with the highest frequency from nomD) improve the system performance. Thus, all these four SD-rules are added to the SD-rule set to reach better coverage rate (94.60%) and system performance (1.8965) with accuracy rate of 95.05% to include 68 (out of 90) SD-rule in the optimized set. The diagram below shows the system accuracy and coverage curves of this optimized set.