|   | CMU-CS-02-189 Computer Science Department
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-CS-02-189
 
Compromising Privacy in Distributed Population-BasedDatabases with Trail matching: A DNA Example
 
Bradley Malin, Latanya Sweeney 
December 2002  
CMU-CS-02-189.psCMU-CS-02-189.pdf
 Keywords: Data privacy, anonymity, security, re-identification
algorithms, databases
 This paper is concerned with the privacy of person-specific data 
collected over multiple institutions. In particular, we focus on an 
example of person-specific DNA sequences collected and stored at 
various hospitals in a defined geographic region. The applications of human genetics and genomic analysis have generated much discussion with respect
to privacy and confidentiality in ethical, legal, and social issues. 
For the most part, the previous analysis has concentrated on direct 
application and disclosure of the genetic information of an individual, 
however, there has been much less attention devoted to the question of 
computational challenges to privacy in the secondary sharing of 
de-identified databases (i.e. released in a format devoid of directly 
identifying information, such as name, address, or phone number). We 
introduce methods for determining the re-identifiability of such DNA 
data and, in the process of doing so, prove that the removal of 
identifying information from DNA does not sufficiently protect the 
privacy of the entities to which the data was derived from. We 
demonstrate, through several novel re-identification algorithms, that 
despite a lack of personal demographic information, such database 
entries can be re-identified through linkage to other publicly 
available databases, such as hospital discharge information through 
the use of hospital visit and data collection patterns, which we 
refer to as data trails, which are iteratively discovered from 
released data collections. Using real-world data, we are able to 
determine when identifiable linkages can occur for a substantial 
number of individuals with particular gene-based disorders. Furthermore,
we provide empirical analysis of the re-identification algorithms with 
respect to population-institution visit distributions and data trails.
 
23 pages 
 |