SemMedDB Database Details

TOOLS

SemMedDB Database Details

In this page, we provide detailed information about the SemMedDB schema. The database tables, their fields, and the relationships between the tables are explained. Recently we changed the database schema as shown below and applied it in building the latest databases, semmedVER30 and semmedVER30_A. For the previous version of the database schema, click here. Examples for each table are provided below.

TABLES:

Name: CITATIONS table
This table contains relevant metadata for each PubMed citation and has the following data fields:

PMID: PubMed identifier of the citation
ISSN: ISSN identifier of the journal or the proceedings where the article was published
DP: Publication date for the citation
EDAT: The date when the citation was added to PubMed
PYEAR: Completion date for the citation

PMID	ISSN	DP	EDAT	PYEAR
19851774	1432-203X	2009 Dec	2010 01 21	2009

Name: GENERIC_CONCEPT table
This table contains the UMLS Metathesaurus concepts that are considered too generic based upon the 2006AA release. Concepts that are not stored in this table are considered novel. This table is used to populate the SUBJECT_NOVELTY and OBJECT_NOVELTY columns in the PREDICATION table defined below. Data fields in this table are as follows:

CONCEPT_ID: Auto generated primary key for each concept
CUI: The Concept Unique Identifier (CUI)
PREFERRED_NAME: The preferred name of the concept

1956C0699748Pathogenesis

CONCEPT_ID	CUI	PREFERRED_NAME

Name: SENTENCE table
This table contains information about individual sentences from PubMed citations and includes the following data fields:

SENTENCE_ID: Auto-generated primary key for each sentence
PMID: The PubMed identifier of the citation to which the sentence belongs
TYPE: 'ti' for the title of the citation, 'ab' for the abstract
NUMBER: The location of the sentence within the title or abstract
SENT_START_INDEX: The character position within the text of the MEDLINE citation of the first character of the sentence NEW
SENT_END_INDEX: The character position within the text of the MEDLINE citation of the last character of the sentence NEW
SECTION_HEADER: Section header name of structured abstract (from Version 3.1)
NORMALIZED_SECTION_HEADER: Normalized section header name of structured abstract (from Version 3.1)
SENTENCE: The actual string or text of the sentence

SENTENCE_ ID	PMID	TYPE	NUMBER	SENT_ START_ INDEX	SENT_ END_ INDEX	SECTION_ HEADER	NORMALIZED_ SECTION_ HEADER	SENTENCE
226	32335253	ab	1	168	317	INTRODUCTION	BACKGROUND	INTRODUCTION: Long term secondary aortic reinterventions (SARs) can be a sing of (lack of) effectiveness of abdominal aortic aneurysm (AAA) surgery.

Name: PREDICATION table
Each record in this table identifies a unique predication. The data fields are as follows:

PREDICATION_ID: Auto-generated primary key for each unique predication
SENTENCE_ID: Foreign key to the SENTENCE table
PMID: The PubMed identifier of the citation to which the predication belongs
PREDICATE: The string representation of each predicate (for example TREATS, PROCESS_OF)
SUBJECT_CUI: The CUI of the subject of the predication
SUBJECT_NAME: The preferred name of the subject of the predication
SUBJECT_SEMTYPE: The semantic type of the subject of the predication
SUBJECT_NOVELTY: The novelty of the subject of the predication
OBJECT_CUI: The CUI of the object of the predication
OBJECT_NAME: The preferred name of the object of the predication
OBJECT_SEMTYPE: The semantic type of the object of the predication
OBJECT_NOVELTY: The novelty of the object of the predication

PREDICATION_ID	SENTENCE_ID	PMID	PREDICATE	SUBJECT_C UI	...	OBJECT_ CUI	...	OBJECT_ NOVELTY
1252467	3369924	16655556	AFFECTS	C1306232	...	C1326386	...	1

Name: PREDICATION_AUX table
This table has auxiliary information for the predications recorded in the PREDICATION table. There is a 1-to-1 relation between the PREDICATION and the PREDICATION_AUX table. For a full list of indicator types, see the Appendix in [2]. The PREDICATION_AUX table includes the following data fields:

PREDICATION_AUX_ID: Auto-generated primary key for the auxiliary information of each unique predication
PREDICATION_ID: Foreign key to the PREDICATION table

The rest of the fields in PREDICATION_AUX table provide mention-level information for the elements of the predication.

SUBJECT_TEXT: Text that maps to the subject
SUBJECT_DIST: The distance of the subject mention (counted in noun phrases) from the predicate mention (0 for certain indicator types, such as NOM)
SUBJECT_MAXDIST: The number of potential arguments (in noun phrases) from the predicate mention in the direction of the subject mention (0 for certain indicator types, such as NOM)
SUBJECT_START_INDEX: The first character position (in document) of the text denoting the subject entity
SUBJECT_END_INDEX: The last character position (in document) of the text denoting the subject entity
SUBJECT_SCORE: The confidence score of the mapping between the subject string and the subject concept
INDICATOR_TYPE: The part of speech of the predicate, such as VERB for verbal predicates and NOM for nominalizations and other argument-taking nouns. For a full list of indicator types, see the Appendix in [2]
PREDICATE_START_INDEX: The first character position (in document) of the text denoting the relation
PREDICATE_END_INDEX: The last character position (in document) of the text denoting the relation
OBJECT_*: The fields representing information about the object, in the same way the SUBJECT_* fields do for the subject
CURR_TIMESTAMP: The timestamp for the record

PREDICATION_ AUX_ID	PREDICATION_ ID	SUBJECT_ TEXT	SUBJECT_ DIST	SUBJECT_ MAX_ DIST	...	OBJECT_ TEXT	...	OBJECT_ SCORE
1252473	1252467	severing	1	2	...	transpiration	...	888

Name: COREFERENCE table
This table has coreference information generated by SemRep with Anaphora (option -A). It includes the following data fields:

COREFERENCE_ID: Auto-generated primary key for each unique coreference
PMID: The PubMed identifier of the citation to which the coreference belongs
ANA_CUI: The CUI of the anaphor element of the coreference
ANA_NAME: The preferred name of the anaphor element of the coreference
ANA_SEMTYPE: The semantic type of the anaphor element of the coreference
ANA_TEXT: The text that maps to the antedecent
ANA_SENTENCE_ID: The foreign key to SENTENCE of the anaphor element of the coreference
ANA_START_INDEX: The first character position (in document) of the text denoting the anaphor
ANA_END_INDEX: The last character position (in document) of the text denoting the anaphor
ANA_SCORE: The confidence score of the mapping between the anaphor text and the anaphor concept
ANT_CUI: The CUI of the antecedent element of the coreference
ANT_NAME: The preferred name of the antedecent element of the coreference
ANT_SEMTYPE: The semantic type of the antedecent element of the coreference
ANT_TEXT: The text that maps to the antedecent
ANT_SENTENCE_ID: The foreign key to SENTENCE of the antedecent element of the coreference
ANT_START_INDEX: The first character position (in document) of the text denoting the antedecent
ANT_END_INDEX: The last character position (in document) of the text denoting the antedecent
ANT_SCORE: The confidence score of the mapping between the antedecent text and the anaphor concept
CURR_TIMESTAMP: The timestamp for the record

COREFERENCE_ID	PMID	ANA_CUI	ANA_NAME	ANA_SEMTYPE	...	ANT_CUI	...	CURR_TIMESTAMP
355391	1000385	C0029235	Organism	orgm	...	C0317850	...	2017-01-26 17:21:42

Name: ENTITY table
This table contains entity information whose data come from ENTITY output generated using full fielded output. It includes the following data fields:

ENTITY_ID: Auto-generated primary key for each unique entity
SENTENCE_ID: The foreign key to SENTENCE table
CUI: The CUI of the entity
NAME: The preferred name of the entity
TYPE: The semantic type of the entity
GENE_ID: The EntrezGene ID of the entity
GENE_NAME: The EntrezGene name of the entity
TEXT: The text in the utterance that maps to the entity
START_INDEX: The first character position (in document) of the text denoting the entity
END_INDEX: The last character position (in document) of the text denoting the entity
SCORE: The confidence score

ENTITY_ID	SENTENCE_ID	CUI	NAME	TYPE	...	TEXT	START_INDEX	END_INDEX	SCORE
12845406	3369924	C0806140	Flow	orga	...	flow	154	158	790

The entity-relationship diagram of SemMedDB version 4.2 or higher version is shown below graphically: NEW

Fiszman M., et al. (2004). Abstraction summarization for managing the biomedical research literature. Proceedings HLT-NAACL Workshop on Computational Lexical Semantics. 76-83.
Kilicoglu, H., et al. (2011). Constructing a semantic predication gold standard from the biomedical literature. BMC Bioinformatics, 12(486).