Canonicalize
A core LVG technique is to "uninflect" input terms to their base form. This process occasionally results in two or more legitimate uninflected forms for the same inflected input.
For example, left uninflects to both left and leave reflecting its ambiguity as an adjective or verb. A technique to manage this ambiguity produces only one "canonical" base form for any given input term. The process of canonicalization pre-computes all uninflected forms and then arranges these into classes composed of terms that could be expanded to the same inflected form. The canonical form is an arbitrarily chosen member of this class and represents all the members of the class.
For example, the terms left, leave, and leaf are all included in one such class, and the canonical form is leaf, the shortest member of the class first and then by alphabetically order. Additionally, the member of the class is chosen to be a form from the lexicon and pure ASCII if possible. This is an attempt to limit the number of word fragments that show up as canonical representations of the class of terms.
In addition, same canonical forms are returned for spelling variants by using citation form. For example, "analog" and "analogue" have same canonical form of "analog". There is always only one record from the result of this flow component.
A set of numbers is returned on the additional information output field when the -m option is specified. These numbers are the numeric form of the canonical forms. Please refer to canonical form design documents for details.
shell> lvg -f:C being being|i|2047|16777215|C|1| shell> lvg -f:C -m being being|i|2047|16777215|C|1|96180| color color|color|2047|16777215|C|1|291700| colour colour|color|2047|16777215|C|1|291700| colored colored|color|2047|16777215|C|1|291700| coloured coloured|color|2047|16777215|C|1|291700|More examples