De-Identification Tools

Focus Area: Natural Language Processing

Project link: https://scrubber.nlm.nih.gov/

Narrative clinical reports contain a rich set of clinical knowledge that could be invaluable for clinical research. However, they usually also contain personal identifiers that are considered protected health information and are associated with use restrictions and risks to privacy. Computational de-identification seeks to remove all of the identifiers in such narrative text in order to produce de-identified documents that can be used in research while protecting patient privacy. Computational de-identification uses natural language processing (NLP) tools and techniques to recognize patient-related individually identifiable information (e.g., names, addresses, and telephone and social security numbers) in the text, and redacts them. In this way, patient privacy is protected and clinical knowledge is preserved.

LHNCBC is developing a new software application that is capable of de-identifying many kinds of clinical reports with high accuracy. The software design uses a number of deterministic and probabilistic pattern recognition algorithms and various computational linguistic methods. The application accepts narrative reports in plain text or in HL7 format. When the reports are formatted as HL7 messages, the application leverages the labeled patient-related information embedded in various HL7 segments to find such information in the free text narrative.

The application software includes an editor for visualization and markup called the Visual Tagging Tool (VTT) that we use to produce gold standards against which to test the tool. Although designed specifically for tagging identifiers that contain personally identifiable, protected health information, VTT has been made publicly available to the greater NLP community for expanded lexical tagging and text annotation.