Sorting Order: Pure ASCII First
I. Description:
A base form contains pure ASCII characters should be on the top of the sorting order (and thus chosen as the citation form).
This requirement has been added in 2012 for NLP projects only deal with pure ASCII (LEXICON) for the reason that some ASCII conversions of non-ASCII Unicode citation forms do not exist in Lexicon or represent different lexical records.
Please refer to ASCII issues for details.
II. Example
Aicardi-Goutieres syndrome
is chosen over Aicardi-Goutières syndrome
as the citation form because it does not contain any non-ASCII Unicode character.
2011- | 2012+ |
---|---|
{base=Aicardi-Goutières syndrome spelling_variant=Aicardi-Goutieres syndrome entry=E0572939 cat=noun variants=uncount } |
{base=Aicardi-Goutieres syndrome spelling_variant=Aicardi-Goutières syndrome entry=E0572939 cat=noun variants=uncount } |
However, a citation form may contain non-ASCII Unicode characters (only) if there is no pure ASCII base form exists. For examples, divorcé
is the citation form of the following record.
2011- | 2012+ |
---|---|
{base=divorcé entry=E0543077 cat=noun variants=reg } | {base=divorcé entry=E0543077 cat=noun variants=reg } |
III. Impacts (on Norm)
The results of NLP programs use citation forms might change accordingly and result in more pure ASCII results. In Norm, the Unicode Core Norm flow component (-f:q7) is processed after -f:Ct and thus results in same normalized pure ASCII term for above two examples as shown bellows:
Aicardi-Goutières syndrome
aicardi goutiere syndrome
and aicardi goutieres syndrome
aicardi goutiere syndrome
and aicardi goutieres syndrome
divorcé
divorce
divorce