You are here

De-Identification Tools

Project information

Narrative clinical reports contain a rich set of clinical knowledge that could be invaluable for clinical research. However, they usually also contain personal identifiers that are considered protected health information and are associated with use restrictions and risks to privacy. Computational de-identification seeks to remove all of the identifiers in such narrative text in order to produce de-identified documents that can be used in research while protecting patient privacy. Computational de-identification uses natural language processing (NLP) tools and techniques to recognize patient-related individually identifiable information (e.g., names, addresses, and telephone and social security numbers) in the text, and redacts them. In this way, patient privacy is protected and clinical knowledge is preserved.

LHNCBC is developing a new software application that is capable of de-identifying many kinds of clinical reports with high accuracy. The software design uses a number of deterministic and probabilistic pattern recognition algorithms and various computational linguistic methods. The application accepts narrative reports in plain text or in HL7 format. When the reports are formatted as HL7 messages, the application leverages the labeled patient-related information embedded in various HL7 segments to find such information in the free text narrative.

The application software includes an editor for visualization and markup called the Visual Tagging Tool (VTT) that we use to produce gold standards against which to test the tool. Although designed specifically for tagging identifiers that contain personally identifiable, protected health information, VTT has been made publicly available to the greater NLP community for expanded lexical tagging and text annotation.

Kayaalp M. Patient Privacy in the Era of Big Data. Balkan Med J. 2018 Jan 20;35(1):8-17. doi: 10.4274/balkanmedj.2017.0966. Epub 2017 Sep 13.
Kayaalp M. Modes of De-identification. AMIA Annu Symp Proc. 2018 Apr 16;2017:1044-1050. eCollection 2017.
Kayaalp M, Dodd Z, Browne AC, Sagan, P, McDonald CJ. Software: NLM-Scrubber
Kayaalp M, Browne AC, Dodd Z, Sagan P, McDonald CJ. De-identification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports. AMIA Annu Symp Proc. 2014 Nov 14;2014:767-76. eCollection 2014.
Browne AC, Kayaalp M, Dodd Z, Sagan, P, McDonald CJ. The Challenges of Creating a Gold Standard for De-identification Research. AMIA Annu Symp Proc. 2014 Nov 14;2014:353-8. eCollection 2014.
Huser, V, Kayaalp M, Dodd Z, Cimino J. Piloting a deceased subject integrated data repository and protecting privacy of relatives. AMIA Annu Symp Proc. 2014 Nov 14;2014:719-28. eCollection 2014.
Ozturk S, Kayaalp M, McDonald CJ. Visualization of patient prescription history data in emergency care. AMIA Annu Symp Proc. 2014 Nov 14;2014:963-8. eCollection 2014.
Kayaalp M, Browne AC, Callaghan FM, Dodd Z, Divita G, Ozturk S, McDonald CJ. The pattern of name tokens in narrative clinical text and a comparison of five systems for redacting them. J Am Med Inform Assoc. 2014 May-Jun;21(3):423-31. doi: 10.1136/amiajnl-2013-001689. Epub 2013 Sep 11.
Kayaalp M, Browne AC, Dodd Z, Sagan P, McDonald CJ. Clinical Text De-Identification Research September 2013 Technical Report to the LHNCBC Board of Scientific Counselors
Kang YS, Kayaalp M. Extracting laboratory test information from biomedical text. J Pathol Inform. 2013 Aug 31;4:23. doi: 10.4103/2153-3539.117450. eCollection 2013.