The SPECIALIST Lexicon

Sorting Order: Pure ASCII First

I. Description:
A base form contains pure ASCII characters should be on the top of the sorting order (and thus chosen as the citation form). This requirement has been added in 2012 for NLP projects only deal with pure ASCII (LEXICON) for the reason that some ASCII conversions of non-ASCII Unicode citation forms do not exist in Lexicon or represent different lexical records. Please refer to ASCII issues for details.

II. Example
Aicardi-Goutieres syndrome is chosen over Aicardi-Goutières syndrome as the citation form because it does not contain any non-ASCII Unicode character.

2011-2012+
{base=Aicardi-Goutières syndrome
spelling_variant=Aicardi-Goutieres syndrome
entry=E0572939
	cat=noun
	variants=uncount
}
{base=Aicardi-Goutieres syndrome
spelling_variant=Aicardi-Goutières syndrome
entry=E0572939
	cat=noun
	variants=uncount
}

However, a citation form may contain non-ASCII Unicode characters (only) if there is no pure ASCII base form exists. For examples, divorcé is the citation form of the following record.

2011-2012+
{base=divorcé
entry=E0543077
    cat=noun
    variants=reg
}
{base=divorcé
entry=E0543077
    cat=noun
    variants=reg
}

III. Impacts (on Norm)
The results of NLP programs use citation forms might change accordingly and result in more pure ASCII results. In Norm, the Unicode Core Norm flow component (-f:q7) is processed after -f:Ct and thus results in same normalized pure ASCII term for above two examples as shown bellows:

  • Input: Aicardi-Goutières syndrome
  • Results (2013-): aicardi goutiere syndrome and aicardi goutieres syndrome
  • Results (2014+): aicardi goutiere syndrome and aicardi goutieres syndrome
  • Input: divorcé
  • Results (2013-): divorce
  • Results (2014+): divorce