AREAS OF INTEREST
Evaluation of Word Sense Disambiguation methods (WSD) in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We have developed a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE.
The resulting dataset is called MSH WSD and consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous words. Each instance containing the ambiguous word was assigned a CUI from the 2009AB version of the UMLS. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE; totaling 37,888 ambiguity cases in 37,090 MEDLINE citations.
The "MSH WSD Data Set" contains contains the benchmark_mesh.txt file which lists the ambiguous word and candidate CUIs and the term_pmid_cui file containing one line for each ambiguous word, the PMID, and the disambiguated CUI. The data set also contains a file for each of the 203 ambiguous words containing the pmid, the citation text (title and abstract only), and the sense based on the name derived from the benchmark file (M1, M2, ...). In the citation text, the instance of the ambiguous word considered for disambiguation is denoted by the e tag (e.g.AA). There is a README.txt file in the download which explains the files in more detail.
Please Note: Users are responsible for compliance with the UMLS Metathesaurus License Agreement.
To use this test collection, you must have accepted the terms of the UMLS Metathesaurus License Agreement, which requires you to respect the copyrights of the constituent vocabularies and to file a brief annual report on your use of the UMLS. You also must have activated a UMLS Terminology Services (UTS) account.
The 37,090 MEDLINE citations included in this "MSH WSD Data Set" are for exclusive use with the MSH WSD Data Set and cannot be redistributed. In addition, the citations were retrieved in July 2010 and represent a static view of MEDLINE at that time. The data set has been reformatted such that none of the MEDLINE ASCII element labels (e.g., PMID- or TI -") remain and only the Title (TI) and Abstract (AB) elements were used.
MSHWSD Data Set (17 MB compressed, 53 MB uncompressed)
Bridget T. McInnes, University of Minnesota Twin Cities (contact) has kindly provided us with these matchups between the various WSD Ambiguity choices and their corresponding UMLS CUIs. This is a gzipped tar file which has a directory containing a file for each of the 50 ambiguities showing the original choices and the UMLS CUI at the end of the list. Bridget is responsible for the 1999 mappings.
Mark Stevenson, University of Sheffield (contact) has kindly provided us with the 2007AB UMLS matchups between the various WSD Ambiguity choices and their corresponding UMLS CUIs.
M1|Adjustment <1> (Individual Adjustment)|inbe, Individual Behavior|C0376209
M2|Adjustment <3> (Adjustment Action)|ftcn, Functional Concept|C0456081
M3|adjustment <5> (Psychological adjustment)|menp, Mental Process|C0683269
PLEASE NOTE: The UMLS CUIs in these files are based on the 1999 and 2007AB UMLS data! Some changes do occur with every UMLS release and some changes may have occurred to these specific concepts since the releases of the 1999 and 2007AB UMLS data files.
Now Available from Dr. Ted Pedersen at the University of Minnesota, Duluth:
A small utility package called nlm2sval2, which will take the WSD Test Collection and convert it into the Senseval-2 lexical sample format. nlm2sval2 is written in Perl, and is freely available from their data conversion page at the following URL: http://www.d.umn.edu/~tpederse/tools.html