|   | CMU-ISRI-04-115 Institute for Software Research International
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-ISRI-04-115
 
How (Not) to Protect Genomic Data Privacy in a Distributed Network:Using Trail Re-identification to Evaluate and Design Privacy Protection
Systems
 
Bradley Malin, Latanya Sweeney 
May 2004  
CMU-ISRI-04-115.psCMU-ISRI-04-115.pdf
 Keywords: Privacy, anonymity, re-identification, genomics, DNA 
Databases
 The increasing integration of patient-specific genomic data 
into clinical practice and research raises serious privacy 
concerns. Various systems have been proposed that protect 
privacy by removing or encrypting explicitly identifying 
information, such as name or social security number, into 
pseudonyms. Though these systems claim to protect identity 
from being disclosed, they lack formal proofs. In this paper, 
we study the erosion of privacy when genomic data, either 
pseudonymous or data believed to be anonymous, is released 
into a distributed healthcare environment. Several algorithms 
are introduced, collectively called RE-Identification of Data 
In Trails (REIDIT), which link genomic data to named individuals 
in publicly available records by leveraging unique features in 
patient-location visit patterns. Algorithmic proofs of 
re-identification are developed and we demonstrate, with 
experiments on real-world data, that susceptibility to 
re-identification is neither trivial nor the result of 
bizarre isolated occurrences. We propose that such 
techniques can be applied as system tests of privacy 
protection capabilities.
 
17 pages 
 |