You are here

Lexical Systems & Tools (SPECIALIST)

Project information
Research Area: 

LHNCBC's Lexical Systems Group develops and maintains the SPECIALIST lexicon and the tools that support and exploit it. The SPECIALIST Lexicon and NLP Tools are at the center of NLM's natural language research, providing a foundation for all our natural language processing efforts. In general, we investigate the contributions that natural language processing techniques can make to the task of mediating between the language of users and the language of online biomedical information resources. The SPECIALIST NLP Tools facilitate natural language processing by helping application developers with lexical variation and text analysis tasks in the biomedical domain. A set of JAVA programs designed to help users manage lexical variation, indexing, and normalization, etc. in biomedical text. A nested set of JAVA objects designed to help users analyze free text documents into words, terms, phrases, sentences and sections. Java version of JDI tool to categorizes biomedical text for applications include Word Sense Disambiguation and retrieval according to discipline. Includes two programs GSpell a spelling suggestion tool and BagOwordsPlus a phrase retrieval tool. The dTagger us a Part of Speech (POS) tagger. Visual Tagging Tool (VTT) is a simple, lightweight, portable, Java Swing based annotation tool. It is designed to easily markup text. Sub-Term Mapping Tools (STMT) is a generic tool set that provides comprehensive sub-term related features for NLP applications. It is used to find the longest prefix, prefixes, sub-terms, and synonymous sub-term substitutions in query expansion. CSpell (spell checker for consumer language), a distributable spelling tool to detect and correct various types of spelling errors. CSpell handles non-word errors, real-words errors, word boundary infraction, punctuation errors, informal expression, and combinations of the above and result in high F1 score and real-time performance.

Recently, the Lexical Systems Group began a project to enhance the derivational-variants function of the lexical tools. The derivational-variants function uses a set of derivational facts and rules to generate or identify derivational variants of input terms. Derivational variants are words related by a word-formation process like suffixation, prefixation or conversion (change of category). The current derivational variant system has only suffix rules and facts. These rules and facts are hand entered and curated. In order to add suffixation and conversion functionality to the system, the PDM team has developed a method to automatically extract candidate pairs of words that may be derivationally related, which helps automate the creation of rules and facts for suffixation and conversion. In addition, LSG started to include consumer data in the Lexicon. This task includes: 1) adding terms from a consumer corpus to the Lexicon. This corpus is established by collecting health related articles from 16 public consumer-facing NIH websites. 2) collecting and tagging patient-centered interface terminology/vocabulary through a new feature of classification type in the Lexicon to cover the colloquial terms.

The SPECIALIST Lexicon and Lexical tools are open source and freely downloadable. The 2019 release of the SPECIALIST Lexicon contains over 515,000 records, representing over 972,000 forms, an increase of over 9,800 records from the 2018 release. These new terms are derived from a systematic approach using the NLM MEDLINE n-gram set and expert system models for effective Lexicon building.

Lu C, Tormey D, McCreedy L, Browne AC. Enhancing LexSynonym Features in the Lexical Tools AMIA 2017 Annual Symposium, Washington, DC, November 04-08, 2017, p. 2090
Bhupatiraju R, Fung K, Bodenreider O. MetaMapLite in Excel: Biomedical named-entity recognition for non-technical users. Stud Health Technol Inform (Proc Medinfo): 1252.
Lu C, Tormey D, McCreedy L, Browne AC. Generating A Distilled N-Gram Set: Effective Lexical Multiword Building in the SPECIALIST Lexicon. The 10th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), Vol(5): HEALTHINF, PORTO, Portugal, February 21-23, 2017, p. 77-87.
Lu C, Tormey D, McCreedy L, Browne AC. Enhanced LexSynonym Acquisition for Effective UMLS Concept Mapping. Stud Health Technol Inform. 2017;245:501-505.
Kastrin A, Rindflesch TC, Hristovski D. Link prediction on a network of co-occurring MeSH terms: Towards literature-based discovery. Methods of Information in Medicine 55(4):340-6.
Lu C, Tormey D, McDreedy L, Browne AC. Multiword frequency analysis based on MEDLINE N - gram set [Poster]. AMIA 2016 Annual Symposium, Chicago, IL, November 12-16, 2016.
Lu C, Tormey D, McDreedy L, Browne AC. Generating SD-Rules in the SPECISLIST Lexical Tools - Optimization for suffix derivation rule set [Poster]. BIOSTEC 2016), Vol(5): HEALTHINF, Rome, Italy, February 21-23, 2016, p. 353-358 (9th International Conference on Health Informatics, HEALTHINF/BIOSTEC, 2016 Best Poster Award
Lu C, Tormey D, MCreedy L, Browne AC. Generating the MEDLINE N-Gram Set [Poster]. AMIA 2015 Annual Symposium, San Francisco, CA, November 14-18, 2015, p. 1569.
Cho D, Mork JG, Aronson A, Demner-Fushman D, Schmidt S, Ozga D, Pash J, Kilbourne J. MeSH on demand: an easy way to identify relevant MeSH terms and related articles from text [Poster]. National Cancer Institute (NCI) Symposium "RNA Biology 2015".
Lu C, Tormey D, McDreedy L, Browne AC. Using element words to generate (multi)words for the SPECIALIST Lexicon [Poster]. AMIA 2014 Annual Symposium, Washington, DC, November 15-19, 2014, p. 1499. Poster