You are here

Medical Article Record System

Project information
Research Area: 

The Medical Article Records System (MARS) project develops automated systems to extract bibliographic text from journal articles, in both paper as well as electronic forms. For the approximately 1000 journal titles that arrive at NLM in paper form, a production MARS system combines document scanning, optical character recognition (OCR), and rule-based and machine learning algorithms to yield citation data that NLM’s indexers use to complete bibliographic records for MEDLINE. Our algorithms extract this data in a pipeline process: segmenting page images into zones, assigning labels to the zones signifying its contents (title, author names, abstract, etc.), pattern matching to identify these entities, lexicon-based pattern matching to correct OCR errors and reduce words that are incorrectly labeled as errors to increase operator productivity.

A recently-developed system, Publisher Data Review (PDRS), is designed to provide data missing from the XML citations received from publishers, such as databank accession numbers, NIH grant numbers, grant support categories, Investigator Names, and Commented-on Article information. By providing these missing data, PDRS reduces the manual effort in completing the citations sent in by publishers, as well as correct their errors. The automated steps to fill in missing data and to correct wrong data substantially reduces the load on the operators, eliminating the need to look through an entire article to find this information, and then to key them in.

A third system, WebMARS, addresses cases where NLM is missing a journal issue or when citation data from publishers is incomplete. WebMARS is a software tool that operators can use to automatically create missing citations from these problematic issues. This eliminates the current manual labor on part of the operators to type, copy, and paste data from online articles, a very time-consuming step.

The MARS, PDR and WebMARS systems rely on underlying research in image analysis enables the creation of new initiatives in which these techniques find application.

Thoma GR, Ford G, Le DX, Li Z. Text Verification in an Automated System for the Extraction of Bibliographic Data Proc. 5th International Workshop on Document Analysis Systems, Springer-Verlag: Berlin. 2002 Aug;: 423-32.
Le DX, Straughan SR, Thoma GR. Greek Alphabet Recognition Technique for Biomedical Documents Proc. 6th World Multiconference on Systemics, Cybernetics and Informatics, eds: Callaos N, et al. 2002 July;III: 86-91.
Thoma GR, Ford G. Automated Data Entry System: Performance Issues Proc. SPIE: Document Recognition and Retrieval IX. 2002 Jan;4670: 181-90.
Mao S, Kim J, Le DX, Thoma GR. Generating Robust Features for Style-Independent Labeling of Bibliographic Fields in Medical Journal Articles Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics.2003 July;III:53-6.
Mao S, Kanungo T. Automatic Training of Page Segmentation Algorithms: An Optimization Approach International Conference on Pattern Recognition. 2000 Sept.;:531-534.
Le DX, Thoma GR. Automated Document Labeling for Web-Based Online Medical Journals Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics. 2003 July;II: 411-15.
Hauser SE, Le DX, Thoma GR. Automated Zone Correction in Bitmapped Document Images SPIE: Document Recognition and Retrieval VII. 2000 Jan;3976: 248-58.
Ford G, Thoma GR. Ground Truth Data for Document Image Analysis Proceedings of 2003 Symposium on Document Image Understanding and Technology. 2003 April 9-11;: 199-205.
Hauser SE, Schlaifer J, Sabir TF, Demner-Fushman D, Thoma GR. Correcting OCR Text by Association with Historic Datasets Proc. SPIE Electronic Imaging. 2003 Jan;5010: 84-93.