The SPECIALIST Lexicon

The SPECIALIST Lexicon project icon

Lexicon, UTF-8, XML, ASCII, 2025 Release:

The SPECIALIST lexicon is a large syntactic lexicon of biomedical and general English, designed/developed to provide the lexical information needed for the UMLS SPECIALIST Natural Language Processing System (NLP) which includes Metathesaurus, Semantic Network, MetaMap, and the Lexical Tools. It is intended to be a general English lexicon that includes many biomedical terms. Coverage includes both commonly occurring English words and biomedical vocabulary from a variety of sources. These include (not limited to) MEDLINE citation records, terms in the Dorland's Illustrated Medical dictionary, the 10,000 most frequent words listed in the American Heritage Word Frequency book and the 2,000 lexical items used in the controlled definitions of Longman's Dictionary of Contemporary English, words in WordNet. The lexicon entry for each lexical item (word or term) records the syntactic, morphological (inflection and derivation), and orthographic (spelling variants) information needed by the SPECIALIST NLP System. The latest version provides over one million of words, including single words and multiwords as one of greatest publicly available Lexicon resources.

Major enhancements and release log are described in the following table:

ReleaseEnhancementDescription
1994First release The SPECIALIST LEXICON (unit lexical record formatted file) along with relational files are released annually as one of the UMLS Knowledge Sources. In addition to its distribution with the UMLS, it is available as an open source resource subject to these terms and conditions.
2003Numbers Numbers and number words, including cardinal, ordinal and fractions, were added to the Lexicon release.
2003XML format: The XML format of unit lexical record was available through LexAccess. In addition, XML schemas and JAXB (Java Architecture XML Binding) APIs are released in LexCheck.
2006Unicode (UTF-8) The Lexicon migrated to Unicode and has been released in UTF-8 format.
2009Pure ASCII version The pure ASCII file, LEXICON.ascii, is added to the annual release for NLP projects interests only in ASCII.
2013Derivational variants Derivational variants in Lexicon, including zeroD, suffixD, and prefixD, along with negation information are added to annual release (DM.DB) by a systematic methodology. Please see details at Lexicon derivational variants - DM.DB. The latest version provides over 158K derivation pairs as one of greatest publicly available derivation resources.
2017Synonyms Synonymous terms in the Lexicon (lexSynonyms) are added to the synonym database file (SM.DB). Please see detail s at Lexicon Synonyms - SM.DB. The latest version provides over 282K synonym pairs as a publicly available synonym resource.
2022Antonyms Antonyms in the Lexicon are added to the antonym database file (AM.DB). Please see details at Lexicon Antonyms - AM.DB. The latest version provides over 13K (canonical) antonym pairs as one of greatest publicly available antonym resources.
2025Verb complementation types Multiwords from light verb constructions and verb-particle constructions in the Lexicon are added as multiword resources from the verb complementation type files.

The SPECIALIST Lexical Tools utilize the SPECIALIST Lexicon data to provide a comprehensive toolset and Java APIs for NLP fundamental functions, including retrieving syntactic category, inflectional variations, spelling variations, abbreviations, acronyms, derivational variations, synonyms, antonyms, normalization, Unicode-to-ASCII conversion, tokenization, and stopword removal.