Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
-
Upload
branden-harmon -
Category
Documents
-
view
223 -
download
2
Transcript of Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Thesis
Sumathie Sundaresan
Advisor: Dr. Huiping Guo
My topic
•How to share medical records to other third parties without compromising data privacy
Reported by AMRA
"...medical information is routinely shared with and viewed by third parties who are not involved in patient care .... The American Medical Records Association has identified twelve categories of information seekers outside of the health care industry who have access to health care files, including employers, government agencies, credit bureaus, insurers, educational institutions, and the media."
Privacy preserving data publishing
Microdata
• Purposes:– Allow researchers to effectively study the correlation
between various attributes – Protect the privacy of every patient
bronchitis30000F70Mandyflu25000F65Alice
gastritis25000F65Lindaflu54000F61Jane
pneumonia12000M59Samdyspepsia59000M35Peterdyspepsia13000M27Ken
pneumonia11000M23BobDiseaseZipcodeSexAgeName
A naïve solution
• It does not work. See next.
publish
bronchitis30000F70Mandyflu25000F65Alice
gastritis25000F65Lindaflu54000F61Jane
pneumonia12000M59Samdyspepsia59000M35Peterdyspepsia13000M27Ken
pneumonia11000M23BobDiseaseZipcodeSexAgeName
bronchitis30000F70flu25000F65
gastritis25000F65flu54000F61
pneumonia12000M59dyspepsia59000M35dyspepsia13000M27
pneumonia11000M23DiseaseZipcodeSexAge
Inference attack
• An adversary knows that Bob – has been hospitalized
before– is 23 years old– lives in an area with
zipcode 11000
bronchitis30000F70flu25000F65
gastritis25000F65flu54000F61
pneumonia12000M59dyspepsia59000M35dyspepsia13000M27
pneumonia11000M23DiseaseZipcodeSexAge
Published table
Quasi-identifier (QI) attributes
Background
• Generalization• Anatomy
Generalization
A generalized table
bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]
pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge
11000M23BobZipcodeSexAgeName
• Transform each QI value into a less specific form
How much generalization do we need?
l-diversity
• A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m / l times in the QI-group.
• A table is l-diverse, iff all of its QI-groups are l-diverse.
• The above table is 2-diverse.
2 QI-groups
Quasi-identifier (QI) attributes Sensitive attribute
bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]
pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge
What l-diversity guarantees
• From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l
bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]
pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge
11000M23BobZipcodeSexAgeName
A 2-diverse generalized table
A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity.
ICDE 2006
Defect of generalization• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]
pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge
• Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions
Defect of generalization (cont.)
• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
• p = Area( R1 ∩ Q ) / Area( R1 ) = 0.05
• Estimated answer for query A: 2 * p = 0.1
pneumonia[10001, 60000]M[21, 60]pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge
20
10k
7060504030
60k
50k
40k
30k
20k
AgeZ
ipco
de
Q
R1
Defect of generalization (cont.)• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
• Estimated answer from the generalized table: 0.1
bronchitis30000F70Mandyflu25000F65Alice
gastritis25000F65Lindaflu54000F61Jane
pneumonia12000M59Samdyspepsia59000M35Peterdyspepsia13000M27Ken
pneumonia11000M23BobDiseaseZipcodeSexAgeName
• The exact answer should be: 1
Basic Idea of Anatomy
• For a given microdata table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST)
22211
1gastritis2flu1bronchitis2pneumonia2dyspepsia
CountDiseaseGroup-ID
230000F70225000F65225000F65254000F61112000M59159000M35113000M27111000M23
Group-IDZipcodeSexAge
Quasi-identifier Table (QIT)
Sensitive Table (ST)
bronchitis30000F70flu25000F65
gastritis25000F65flu54000F61
pneumonia12000M59dyspepsia59000M35dyspepsia13000M27
pneumonia11000M23DiseaseZipcodeSexAge
microdata
Basic Idea of Anatomy (cont.)
1. Select a partition of the tuples
bronchitis30000F70flu25000F65
gastritis25000F65flu54000F61
pneumonia12000M59dyspepsia59000M35dyspepsia13000M27
pneumonia11000M23
DiseaseZipcodeSexAge
QI group 1
QI group 2
a 2-diverse partition
Basic Idea of Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition
bronchitisflu
gastritisflu
pneumoniadyspepsiadyspepsia
pneumonia
Disease
30000F7025000F6525000F6554000F61
12000M5959000M3513000M2711000M23
ZipcodeSexAge
group 1
group 2
quasi-identifier table (QIT) sensitive table (ST)
Basic Idea of Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition
bronchitis2flu2
gastritis2flu2
pneumonia1dyspepsia1dyspepsia1
pneumonia1
DiseaseGroup-ID
230000F70225000F65225000F65254000F61
112000M59159000M35113000M27111000M23
Group-IDZipcodeSexAge
quasi-identifier table (QIT) sensitive table (ST)
Basic Idea of Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition
22211
1gastritis2flu1bronchitis2pneumonia2dyspepsia
CountDiseaseGroup-ID
230000F70225000F65225000F65254000F61112000M59159000M35113000M27111000M23
Group-IDZipcodeSexAge
quasi-identifier table (QIT)
sensitive table (ST)
Privacy Preservation
• From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l
22211
1gastritis2flu1bronchitis2pneumonia2dyspepsia
CountDiseaseGroup-ID
230000F70225000F65225000F65254000F61112000M59159000M35113000M27111000M23
Group-IDZipcodeSexAge
quasi-identifier table (QIT)
sensitive table (ST)
11000M23BobZipcodeSexAgeName
Accuracy of Data Analysis• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
22211
1gastritis2flu1bronchitis2pneumonia2dyspepsia
CountDiseaseGroup-ID
230000F70225000F65225000F65254000F61112000M59159000M35113000M27111000M23
Group-IDZipcodeSexAge
quasi-identifier table (QIT)
sensitive table (ST)
Accuracy of Data Analysis (cont.)• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
• 2 patients have contracted pneumonia
• 2 out of 4 patients satisfies the query condition on Age and Zipcode
• Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata
112000M59159000M35113000M27111000M23
Group-IDZipcodeSexAge
20
10k
7060504030
60k
50k
40k
30k
20k
x (Age)
y (Z
ipco
de)
t1
Q
t2
t3
t4
t1t2t3t4
Anatomy vs. Generalization Revisit
• Sometimes the adversary is not sure whether an individual appears in the microdata or not
bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]
pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]
pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge
A 2-diverse generalized table
30000M40Mark40000M50Ric
…………12000M59Sam
59000M35Peter13000M27Ken11000M23BobZipcodeSexAgeName
A Voter Registration List
Anatomy vs. Generalization Revisit
• From the adversary’s perspective:– Bob has 4 / 6 probability to be in the microdata– If Bob indeed appears the microdata, there is 2 / 4 probability that
he has contracted pneumonia– So Bob has 4/6 * 2/4 = 1/3 probability to have contracted
pneumonia
…………pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]
pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge
A 2-diverse generalized table
30000M40Mark40000M50Ric
…………12000M59Sam
59000M35Peter13000M27Ken11000M23BobZipcodeSexAgeName
A Voter Registration List
Anatomy vs. Generalization Revisit
• The adversary knows that– Bob must appear the microdata– There is 1/2 probability that Bob
has contracted pneumonia
…11
……2pneumonia2dyspepsia
CountDiseaseGroup-ID
…………112000M59159000M35113000M27111000M23
Group-IDZipcodeSexAge
2-diverse QIT
2-diverse ST
30000M40Mark40000M50Ric
…………12000M59Sam
59000M35Peter13000M27Ken11000M23BobZipcodeSexAgeName
Anatomy vs. Generalization Revisit
• For a given value of l, l-diverse generalization may lead to higher privacy protection than l-diverse anatomy does.
• But is not always the case, since:– the external database may not contain any irrelevant individuals– the adversary may know that some individuals indeed appear in
the microdata
30000M40Mark40000M50Ric
…………12000M59Sam
59000M35Peter13000M27Ken11000M23BobZipcodeSexAgeName