Access to SemRep/SemMedDB/SKR Resources

The SKR project maintains a database of 96.3 million SemRep predications extracted from all MEDLINE citations. This database supports the Semantic MEDLINE web application, which integrates PubMed searching, SemRep predications, automatic summarization, and data visualization. The application is intended to help users manage the results of PubMed searches. Output is visualized as an informative graph with links to the original MEDLINE citations.

To access any of the SemRep/SemMedDB/SKR Data Sets or the SemMedDB Database, users must have accepted the terms of the UMLS Metathesaurus License Agreement, which requires users to respect the copyrights of the constituent vocabularies and to file a brief annual report on their use of the UMLS. Users must also have activated a UMLS Terminology Services (UTS) account. For information on how to use UTS authentication, please click here.

For details of the licenses, please see the UMLS Metathesaurus License Agreement and How to License and Access the Unified Medical Language System (UMLS) Data.

The SemRep source code and associated annotated and test data sets are publicly available at our GitHub site: SemRep GitHub.

Learn more about SemRep here.

The Semantic MEDLINE Database (SemMedDB) is a repository of semantic predications (subject-predicate-object triples) extracted by SemRep, a semantic interpreter of biomedical text. SemMedDB currently contains information about approximately 96.3 million predications from all of PubMed citations (about 29.1 million citations) and forms the backbone of the Semantic MEDLINE application.

For details about the SemMedDB schema, click here.

To Download the SemMedDB Database click here.

To learn more about Semantic Medline click here.

In early 2011, we conducted a gold standard annotation study in which we annotated with semantic predications a set of 500 sentences randomly selected from MEDLINE abstracts. The results are mainly intended to serve as an evaluation testbed for SemRep. They can also be used by other information extraction systems based on UMLS domain knowledge. The study consisted of three phases: a) the practice phase, b) the main annotation phase, and c) the adjudication phase.

Here, we present two sets of annotations from the main phase as well as the adjudicated gold standard. For further details, please refer to our BMC Bioinformatics paper Constructing A Semantic Predication Gold Standard from the Biomedical Literature.

Available Files:

 Annotator A: Main Phase (main_A.xml) (1.3 mb)

 Annotator B: Main Phase (main_B.xml) (1.4 mb)

 Annotator C: Adjudication (adjudicated.xml) (1.4 mb)

 DTD file (annotations.dtd) (1.8 kb)

In order to develop and evaluate a sortal anaphora resolution module, we annotated a corpus of 320 MEDLINE citations with pairwise sortal anaphora relations consisting of the anaphoric expressions and their correspondent antecedents. Since we aimed at a general approach that takes into account all the semantic types and consequently supports SemRep, we collected MEDLINE abstracts on a wide range of topics, including molecular biology and clinical medicine.

For further details, please refer to our BMC Bioinformatics paper Sortal anaphora resolution to enhance relation extraction from biomedical literature.

Sortal Anaphora dataset:

 Sortal Anaphora Dataset

Biomedical knowledge claims are often expressed with extra-propositional entities such as hypotheses, speculations, or opinions, rather than explicit facts (assertions or propositions). Currently, SemRep extracts propositional content in the form of predications. We studied the feasibility of incorporating extra-propositional information by assessing the factuality level of SemRep predications. To this end, we annotated semantic predications extracted from 500 PubMed abstracts with seven factuality values (FACT, PROBABLE, POSSIBLE, DOUBTFUL, COUNTERFACT, UNCOMMITTED, and CONDITIONAL).

For further details, please refer to our PLoS ONE paper Assigning factuality values to semantic relations extracted from Biomedical Research Literature.

Factuality dataset:

 Factuality Dataset

This is a database of approximately 80K PubMed abstracts on Parkinson disease published since 1950, from which we extracted study characteristics relevant to assessing translatability of pre-clinical animal research to human subjects. The dataset was generated using a text-mining tool named Menagerie (Zeiss et al., 2019). Characteristics extracted include species, models, interventions, genes, outcome polarity, and functional outcome measures.

For further details, please refer to our PLoS ONE paper Menagerie: A text-mining tool to support animal-human translation in neurodegeneration research.

TranslationResearch70 Database:

 TranslationResearch70 Database

While the UMLS provides the predication arguments and the linking predicates, indicator rules map syntactic elements in the text, such as verbs and nominalizations, to predicates in the SN (e.g., TREATS, PREVENTS, AFFECTS, and so on). The indicator file contains the SemRep indicators for the SN predicates that SemRep uses. At this time (version v1.7), SemRep is using an earlier Prolog file. The new JAVA file was created for use with SemRep v1.8 in a JAVA implementation. Thus, some rules have not been implemented yet but are included in the file for future implementation after SemRep Java is complete. There are different types of indicator rules, from simple to multi-phrase, and the format varies for each.


 README File for SemRep Indicator Rules (PDF)

SemRep Indicator Rules File:

 SemRep Indicator Rules File (XML)