SPECIALIST Lexicon

Sorting Order: Pure ASCII First

I. Description:
A base form contains pure ASCII characters should be on the top of the sorting order (and thus chosen as the citation form). This requirement has been added in 2012 for NLP projects only deal with pure ASCII (LEXICON) for the reason that some ASCII conversions of non-ASCII Unicode citation forms do not exist in Lexicon or represent different lexical records. Please refer to ASCII issues for details.

II. Example
Aicardi-Goutieres syndrome is chosen over Aicardi-Goutières syndrome as the citation form because it does not contain any non-ASCII Unicode character.

2011-	2012+
{base=Aicardi-Goutières syndrome spelling_variant=Aicardi-Goutieres syndrome entry=E0572939 cat=noun variants=uncount }	{base=Aicardi-Goutieres syndrome spelling_variant=Aicardi-Goutières syndrome entry=E0572939 cat=noun variants=uncount }

2011-

2012+

{base=Aicardi-Goutières syndrome
spelling_variant=Aicardi-Goutieres syndrome
entry=E0572939
	cat=noun
	variants=uncount
}

{base=Aicardi-Goutieres syndrome
spelling_variant=Aicardi-Goutières syndrome
entry=E0572939
	cat=noun
	variants=uncount
}

However, a citation form may contain non-ASCII Unicode characters (only) if there is no pure ASCII base form exists. For examples, divorcé is the citation form of the following record.

2011-	2012+
{base=divorcé entry=E0543077 cat=noun variants=reg }	{base=divorcé entry=E0543077 cat=noun variants=reg }

III. Impacts (on Norm)
The results of NLP programs use citation forms might change accordingly and result in more pure ASCII results. In Norm, the Unicode Core Norm flow component (-f:q7) is processed after -f:Ct and thus results in same normalized pure ASCII term for above two examples as shown bellows:

Input: Aicardi-Goutières syndrome
Results (2013-): aicardi goutiere syndrome and aicardi goutieres syndrome
Results (2014+): aicardi goutiere syndrome and aicardi goutieres syndrome

Input: divorcé
Results (2013-): divorce
Results (2014+): divorce

The SPECIALIST Lexicon