You are here

Medical Article Record System

Project information
Research Area: 

The Medical Article Records System (MARS) project develops automated systems to extract bibliographic text from journal articles, in both paper as well as electronic forms. For the approximately 1000 journal titles that arrive at NLM in paper form, a production MARS system combines document scanning, optical character recognition (OCR), and rule-based and machine learning algorithms to yield citation data that NLM’s indexers use to complete bibliographic records for MEDLINE. Our algorithms extract this data in a pipeline process: segmenting page images into zones, assigning labels to the zones signifying its contents (title, author names, abstract, etc.), pattern matching to identify these entities, lexicon-based pattern matching to correct OCR errors and reduce words that are incorrectly labeled as errors to increase operator productivity.

A recently-developed system, Publisher Data Review (PDRS), is designed to provide data missing from the XML citations received from publishers, such as databank accession numbers, NIH grant numbers, grant support categories, Investigator Names, and Commented-on Article information. By providing these missing data, PDRS reduces the manual effort in completing the citations sent in by publishers, as well as correct their errors. The automated steps to fill in missing data and to correct wrong data substantially reduces the load on the operators, eliminating the need to look through an entire article to find this information, and then to key them in.

A third system, WebMARS, addresses cases where NLM is missing a journal issue or when citation data from publishers is incomplete. WebMARS is a software tool that operators can use to automatically create missing citations from these problematic issues. This eliminates the current manual labor on part of the operators to type, copy, and paste data from online articles, a very time-consuming step.

The MARS, PDR and WebMARS systems rely on underlying research in image analysis enables the creation of new initiatives in which these techniques find application.

Kim IC, Le DX, Thoma GR. Hybrid approach combining contextual and statistical information for identifying and statistical information for identifying MEDLINE citation terms. Proc. SPIE-IS/T Electronic Imaging. San Jose, CA. January 2008;6815:68150P(1-9)
Zou J, Le DX, Thoma GR. Extracting a Sparsely-Located Named Entity from Online HTML Medical Articles Using Support Vector Machine Proc SPIE-IS/T Electronic Imaging. San Jose, CA. January 2008;6815:6815OP(1-10)
Zou J, Le DX, Thoma GR. Online Medical Journal Article Layout Analysis Proc SPIE-IS&T Electronic Imaging 2007, SPIE Vol. 6500: 65000V (1-12)
Chen S, Mao S, Thoma GR. Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents Proc ICDAR2007. Curitiba, Brazil; September 2007, pp. 118-22
Zou J, Le DX, Thoma GR. Structure and Content Analysis for HTML Medical Articles: A Hidden Markov Model Approach Proc August 2007 ACM Symposium on Document Engineering. pp. 199-201
Kim IC, Le DX, Thoma GR. Identification of "comment-on sentences" in online biomedical documents using support vector machines. Proc. SPIE conference on Document Recognition and Retrieval, 6500:65000O (1-8), San Jose, January 2007.
Zou J, Le DX, Thoma GR. Combining DOM Tree and Geometric Layout Analysis for Online Medical Journal Article Segmentation Proc JCDL, June 2006, Chapel Hill, NC; 119-28
Kim J, Le DX, Thoma GR. Automatic Extraction of Bibliographic Information from Biomedical Online Journal Articles Using a String Matching Algorithm Proc IEEE CBMS, June 2006, Salt Lake City, Utah; 905-10
Demner-Fushman D, Few B, Hauser SE, Thoma GR. Automatically Identifying Health Outcome Information in MEDLINE Records J Am Med Inform Assoc. 2006 Jan-Feb;13(1):52-60. Epub 2005 Oct 12.
Sabir TF, Hauser SE, Thoma GR. Historical Author Affiliations Assist Verification of Automatically Generated MEDLINE Citations AMIA Annu Symp Proc. 2006:1082