The SPECIALIST Lexicon

Spelling Variant Patterns - Normalization

I. Introduction

Normalization can be used to find a group of spelling variants from a list of words (such as N-grams). Java programs include:

II. Development Notes

SpVarNorm are tested on Lexicon.2015. All False-Positive are retrieved and analyzed to improve the algorithm to higher precision algorithm. Please see spVarNorm Development notes fordetails.

III. Algorithm Details

DescriptionRuleExample
Convert non-ASCII unicode to ASCII
  • Lexical Tools - ToAsciiApi
  • Labbé|Labbe
  • λmax|lamdamax
Synonym substitution
  • ^St. => ^Saint
St. Anthony's fire|Saint Anthony's fire
Spelling variant substitution
  • labelled => labeled
  • programme => program
  • tumour => tumor
  • carbon 14 labelled|carbon 14 labeled
  • drug benefit programme|drug benefit program
  • CPA tumour|CPA tumor
Rank substitution
  • 1st => first
  • 2nd => second
  • 3rd => third
  • Vth => 5th
  • 5th => fifth
  • 8th => eighth
  • 9th => ninth
  • 12th => twelfth
  • Vth nerve|5th nerve
Number substitution
  • 60 => sixty
  • 50 => fifty
  • 40 => forty
  • 30 => thirty
  • 20 => twenty
  • 19 => nineteen
  • 18 => eighteen
  • 17 => seventeen
  • 16 => sixteen
  • 15 => fifteen
  • 14 => fourteen
  • 13 => thirteen
  • 12 => twelve
  • 11 => eleven
  • 10 => ten
  • 9 => nine
  • 8 => eight
  • 7 => seven
  • 6 => six
  • 5 => five
  • 4 => four
  • 3 => three
  • 2 => two
  • 1 => one
  • 3-membered ring|three membered ring|three-membered ring
  • 12-lead|twelve-lead
Roman Number substitution
  • class-II, type-II, TBD
  • BoHV-I|BoHV-1
  • BoHVI|BoHV1
Punctuation
  • - => space
  • . => space
  • " => space
  • ! => space
  • & => space
  • ( => space
  • ) => space
  • [ => space
  • ] => space
  • / => space
  • lamin-A|lamin A
  • A.A.D.|AAD
  • University of Rome "Tor Vergata"|University of Rome Tor Vergata
  • !Kung|Kung
  • L & A|L A (L and A?)
  • aflatoxin M(1)|aflatoxin M1
  • B(a)PDE|B[a]PDE|BaPDE
Genitive
  • s's => s
  • s' => s
  • 's => space
  • ' => space

Process this operation only the matching pattern are not the end of the term
  • Addison's disease|Addisons disease
  • bilateral Wilms' tumor|bilateral Wilms tumor

  • Laufe's forceps|Laufe forceps are not spVar because an extra s-/z- sound. It is refered as "strict homonymy" (same spelling, same pronunciation, different meaning)
Lower case
  • toLowerCase()
  • Latter-Day Saint|Latter-day Saint
Remove Space
  • space =>
  • lattice work|latticework