The SPECIALIST Lexicon

The SPECIALIST Lexicon project icon

Lexicon, UTF-8, XML, ASCII, 2026 Release:

The SPECIALIST lexicon is a large syntactic lexicon of biomedical and general English. It was designed and developed to provide the rich lexical information required for the UMLS SPECIALIST Natural Language Processing (NLP) System, which includes: Metathesaurus, Semantic Network, MetaMap, and the Lexical Tools. The SPECIALIST Lexicon is intended to serve as a general English lexicon with extensive biomedical coverage. It includes: commonly occurring English words and biomedical vocabulary from diverse sources. These resources include (not limited to) MEDLINE citation records, terms in the Dorland's Illustrated Medical dictionary, the 10,000 most frequent words listed in the American Heritage Word Frequency book and the 2,000 lexical items used in the controlled definitions of Longman's Dictionary of Contemporary English, words in WordNet. Each entry in the Lexicon, whether a single word or multiword term, contains detailed syntactic, morphological (inflection and derivation), and orthographic (spelling variants) information required by the SPECIALIST NLP System. The latest version contains over one million lexical items, making it one of largest and most coprehensive publicly available lexical Lexicon resources.

The table below highlights the major enhancements and release history of the SPECIALIST Lexicon:

ReleaseEnhancementBrief Description
1994First release The SPECIALIST LEXICON (unit lexical record formatted file), along with its associated relational files, was released annually as part of the UMLS Knowledge Sources. In addition to its distribution with the UMLS, the Lexicon is also available as an open-source resource, subject to the applicable terms and conditions.
2003Numbers Numbers and number words, including cardinal, ordinal and fractions, were added in the Lexicon release.
2003XML format: The XML format of unit lexical record is available through LexAccess. In addition, XML schemas and JAXB (Java Architecture XML Binding) APIs are released in LexCheck.
2006Unicode (UTF-8) The Lexicon was migrated to Unicode and has been released in UTF-8 format ever since.
2009Pure ASCII version The pure ASCII file, LEXICON.ascii, was added to the annual release for NLP projects interests only in ASCII.
2013Derivational variants Derivational variants in Lexicon, including zeroD, suffixD, and prefixD, along with negation information, are systematically generated and added to annual release (DM.DB). For more details, please see Lexicon derivational variants - DM.DB. The latest version provides over 158K derivation pairs, making it one of the most comprehensive publicly available derivation resources.
2017Synonyms Synonymous terms (Synonyms) in the Lexicon are incorporated into the synonym database file (SM.DB). For more information, please see Lexicon Synonyms - SM.DB. The latest version provides over 283K synonym pairs, making it a valuable publicly available synonym resource.
2022Antonyms Antonyms in the Lexicon are included in the antonym database file (AM.DB). For more details, please see Lexicon Antonyms - AM.DB. The latest version provides over 17.7K canonical antonym pairs, making it one of comprehensive publicly available antonym resources.
2025Verb complementation types Multiwords from light verb constructions and verb-particle constructions in the Lexicon were added as multiword resources from the verb complementation type files.
2026Antonyms Antonyms extracted from the Semantic Network model have been completed and incorporated into the antonym database file (AM.DB). For more information, please see Lexicon Antonyms - AM.DB. The latest version provides over 17.7K canonical antonym pairs (high-quality, domain-specific), making it one of the most comprehensive publicly available antonym resources.

The SPECIALIST Lexical Tools tr built on the SPECIALIST Lexicon data and provide a comprehensive toolset, along with Java APIs, to support fundamental NLP functions. These include: retrieving syntactic category, inflectional variations, spelling variations, abbreviations, acronyms, derivational variations, synonyms, antonyms, normalization, Unicode-to-ASCII conversion, tokenization, and stopword removal.