You are here

Medical Article Record System

Project information
Research Area: 
Researchers: 

The Medical Article Records System (MARS) project develops automated systems to extract bibliographic text from journal articles, in both paper as well as electronic forms. For the approximately 1000 journal titles that arrive at NLM in paper form, a production MARS system combines document scanning, optical character recognition (OCR), and rule-based and machine learning algorithms to yield citation data that NLM’s indexers use to complete bibliographic records for MEDLINE. Our algorithms extract this data in a pipeline process: segmenting page images into zones, assigning labels to the zones signifying its contents (title, author names, abstract, etc.), pattern matching to identify these entities, lexicon-based pattern matching to correct OCR errors and reduce words that are incorrectly labeled as errors to increase operator productivity.

A recently-developed system, Publisher Data Review (PDRS), is designed to provide data missing from the XML citations received from publishers, such as databank accession numbers, NIH grant numbers, grant support categories, Investigator Names, and Commented-on Article information. By providing these missing data, PDRS reduces the manual effort in completing the citations sent in by publishers, as well as correct their errors. The automated steps to fill in missing data and to correct wrong data substantially reduces the load on the operators, eliminating the need to look through an entire article to find this information, and then to key them in.

A third system, WebMARS, addresses cases where NLM is missing a journal issue or when citation data from publishers is incomplete. WebMARS is a software tool that operators can use to automatically create missing citations from these problematic issues. This eliminates the current manual labor on part of the operators to type, copy, and paste data from online articles, a very time-consuming step.

The MARS, PDR and WebMARS systems rely on underlying research in image analysis enables the creation of new initiatives in which these techniques find application.

Publications/Tools: 
Mao S, Kanungo T. Empirical Performance Evaluation of Page Segmentation Algorithms SPIE conference on Document Recognition and Retrieval. 2000 Jan.;:303-314.
Hauser SE, Sabir TF, Thoma GR. OCR Correction Using Historical Relationships from Verified Text in Biomedical Citations Proc. of 2003 Symposium on Document Image Understanding Technology. College Park MD: Institute for Advanced Computer Studies, University of Maryland. 2003 April;: 171-7.
Le DX, Thoma GR. Page Layout Classification Technique for Biomedical Documents Proc. World Multiconference on Systems, Cybernetics and Informatics (SCI). 2000 Jul.;X: 348-52.
Pearson G, Moon CW. Bridging Two Biomedical Journal Databases with XML - A Case Study. Proc. 14th IEEE Symposium on Computer-Based Medical Systems: IEEE Computer Society. 2001 Jul;:309-14.
Kim J, Le DX, Thoma GR. Automated Labeling of Bibliographic Data Extracted from Biomedical Online Journals Proc. SPIE Electronic Imaging. 2003 Jan;5010: 47-56.
Tran LQ, Moon CW, Le DX, Thoma GR. Web Page Downloading and Classification Proc. 14th IEEE Symposium on Computer-Based Medical Systems: IEEE Computer Society. 2001 Jul;:321-6.
Mao S, Rosenfeld A, Kanungo T. Document Structure Analysis Algorithms: A Literature Survey Proc. SPIE Electronic Imaging. 2003 Jan;5010:197-207.
Hauser SE, Schlaifer J, Sabir TF, Demner-Fushman D, Thoma GR. Correcting OCR Text by Association with Historic Datasets Proc. SPIE Electronic Imaging. 2003 Jan;5010: 84-93.

Pages