You are here

De-Identification Tools

Project information
Research Area: 

Narrative clinical reports contain a rich set of clinical knowledge that could be invaluable for clinical research. However, they usually also contain personal identifiers that are considered protected health information and are associated with use restrictions and risks to privacy. Computational de-identification seeks to remove all of the identifiers in such narrative text in order to produce de-identified documents that can be used in research while protecting patient privacy. Computational de-identification uses natural language processing (NLP) tools and techniques to recognize patient-related individually identifiable information (e.g., names, addresses, and telephone and social security numbers) in the text, and redacts them. In this way, patient privacy is protected and clinical knowledge is preserved.

LHNCBC is developing a new software application that is capable of de-identifying many kinds of clinical reports with high accuracy. The software design uses a number of deterministic and probabilistic pattern recognition algorithms and various computational linguistic methods. The application accepts narrative reports in plain text or in HL7 format. When the reports are formatted as HL7 messages, the application leverages the labeled patient-related information embedded in various HL7 segments to find such information in the free text narrative.

The application software includes an editor for visualization and markup called the Visual Tagging Tool (VTT) that we use to produce gold standards against which to test the tool. Although designed specifically for tagging identifiers that contain personally identifiable, protected health information, VTT has been made publicly available to the greater NLP community for expanded lexical tagging and text annotation.

Kayaalp M. ICU Outcome Predictions using Physiologic Trends in the First Two Days. Computing in Cardiology (39)977–980.
He Y, Kayaalp M. Biological Entity Recognition with Conditional Random Fields AMIA Annu Symp Proc. 2008 Nov 6:293-7
Friedlin FJ, McDonald C. A Software Tool for Removing Patient Identifying Information from Clinical Documents J Am Med Inform Assoc. 2008 Sep-Oct;15(5):601-10. Epub 2008 Jun 25
Kayaalp M. Separation of Data, Interpreters and Likelihood March 2007 Technical Report to the LHNCBC Board of Scientific Counselors.
He Y, Kayaalp M. A Comparison of 13 Tokenizers on MEDLINE December 2006 Technical Report.