Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

25
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo

Transcript of Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Page 1: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Thesis

Sumathie Sundaresan

Advisor: Dr. Huiping Guo

Page 2: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

My topic

•How to share medical records to other third parties without compromising data privacy

Page 3: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Reported by AMRA

"...medical information is routinely shared with and viewed by third parties who are not involved in patient care .... The American Medical Records Association has identified twelve categories of information seekers outside of the health care industry who have access to health care files, including employers, government agencies, credit bureaus, insurers, educational institutions, and the media."

Page 4: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Privacy preserving data publishing

Microdata

• Purposes:– Allow researchers to effectively study the correlation

between various attributes – Protect the privacy of every patient

bronchitis30000F70Mandyflu25000F65Alice

gastritis25000F65Lindaflu54000F61Jane

pneumonia12000M59Samdyspepsia59000M35Peterdyspepsia13000M27Ken

pneumonia11000M23BobDiseaseZipcodeSexAgeName

Page 5: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

A naïve solution

• It does not work. See next.

publish

bronchitis30000F70Mandyflu25000F65Alice

gastritis25000F65Lindaflu54000F61Jane

pneumonia12000M59Samdyspepsia59000M35Peterdyspepsia13000M27Ken

pneumonia11000M23BobDiseaseZipcodeSexAgeName

bronchitis30000F70flu25000F65

gastritis25000F65flu54000F61

pneumonia12000M59dyspepsia59000M35dyspepsia13000M27

pneumonia11000M23DiseaseZipcodeSexAge

Page 6: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Inference attack

• An adversary knows that Bob – has been hospitalized

before– is 23 years old– lives in an area with

zipcode 11000

bronchitis30000F70flu25000F65

gastritis25000F65flu54000F61

pneumonia12000M59dyspepsia59000M35dyspepsia13000M27

pneumonia11000M23DiseaseZipcodeSexAge

Published table

Quasi-identifier (QI) attributes

Page 7: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Background

• Generalization• Anatomy

Page 8: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Generalization

A generalized table

bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]

pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge

11000M23BobZipcodeSexAgeName

• Transform each QI value into a less specific form

How much generalization do we need?

Page 9: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

l-diversity

• A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m / l times in the QI-group.

• A table is l-diverse, iff all of its QI-groups are l-diverse.

• The above table is 2-diverse.

2 QI-groups

Quasi-identifier (QI) attributes Sensitive attribute

bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]

pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge

Page 10: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

What l-diversity guarantees

• From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l

bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]

pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge

11000M23BobZipcodeSexAgeName

A 2-diverse generalized table

A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity.

ICDE 2006

Page 11: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Defect of generalization• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]

pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge

• Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions

Page 12: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Defect of generalization (cont.)

• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

• p = Area( R1 ∩ Q ) / Area( R1 ) = 0.05

• Estimated answer for query A: 2 * p = 0.1

pneumonia[10001, 60000]M[21, 60]pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge

20

10k

7060504030

60k

50k

40k

30k

20k

AgeZ

ipco

de

Q

R1

Page 13: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Defect of generalization (cont.)• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

• Estimated answer from the generalized table: 0.1

bronchitis30000F70Mandyflu25000F65Alice

gastritis25000F65Lindaflu54000F61Jane

pneumonia12000M59Samdyspepsia59000M35Peterdyspepsia13000M27Ken

pneumonia11000M23BobDiseaseZipcodeSexAgeName

• The exact answer should be: 1

Page 14: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Basic Idea of Anatomy

• For a given microdata table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST)

22211

1gastritis2flu1bronchitis2pneumonia2dyspepsia

CountDiseaseGroup-ID

230000F70225000F65225000F65254000F61112000M59159000M35113000M27111000M23

Group-IDZipcodeSexAge

Quasi-identifier Table (QIT)

Sensitive Table (ST)

bronchitis30000F70flu25000F65

gastritis25000F65flu54000F61

pneumonia12000M59dyspepsia59000M35dyspepsia13000M27

pneumonia11000M23DiseaseZipcodeSexAge

microdata

Page 15: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Basic Idea of Anatomy (cont.)

1. Select a partition of the tuples

bronchitis30000F70flu25000F65

gastritis25000F65flu54000F61

pneumonia12000M59dyspepsia59000M35dyspepsia13000M27

pneumonia11000M23

DiseaseZipcodeSexAge

QI group 1

QI group 2

a 2-diverse partition

Page 16: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Basic Idea of Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition

bronchitisflu

gastritisflu

pneumoniadyspepsiadyspepsia

pneumonia

Disease

30000F7025000F6525000F6554000F61

12000M5959000M3513000M2711000M23

ZipcodeSexAge

group 1

group 2

quasi-identifier table (QIT) sensitive table (ST)

Page 17: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Basic Idea of Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition

bronchitis2flu2

gastritis2flu2

pneumonia1dyspepsia1dyspepsia1

pneumonia1

DiseaseGroup-ID

230000F70225000F65225000F65254000F61

112000M59159000M35113000M27111000M23

Group-IDZipcodeSexAge

quasi-identifier table (QIT) sensitive table (ST)

Page 18: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Basic Idea of Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition

22211

1gastritis2flu1bronchitis2pneumonia2dyspepsia

CountDiseaseGroup-ID

230000F70225000F65225000F65254000F61112000M59159000M35113000M27111000M23

Group-IDZipcodeSexAge

quasi-identifier table (QIT)

sensitive table (ST)

Page 19: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Privacy Preservation

• From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l

22211

1gastritis2flu1bronchitis2pneumonia2dyspepsia

CountDiseaseGroup-ID

230000F70225000F65225000F65254000F61112000M59159000M35113000M27111000M23

Group-IDZipcodeSexAge

quasi-identifier table (QIT)

sensitive table (ST)

11000M23BobZipcodeSexAgeName

Page 20: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Accuracy of Data Analysis• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

22211

1gastritis2flu1bronchitis2pneumonia2dyspepsia

CountDiseaseGroup-ID

230000F70225000F65225000F65254000F61112000M59159000M35113000M27111000M23

Group-IDZipcodeSexAge

quasi-identifier table (QIT)

sensitive table (ST)

Page 21: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Accuracy of Data Analysis (cont.)• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

• 2 patients have contracted pneumonia

• 2 out of 4 patients satisfies the query condition on Age and Zipcode

• Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata

112000M59159000M35113000M27111000M23

Group-IDZipcodeSexAge

20

10k

7060504030

60k

50k

40k

30k

20k

x (Age)

y (Z

ipco

de)

t1

Q

t2

t3

t4

t1t2t3t4

Page 22: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Anatomy vs. Generalization Revisit

• Sometimes the adversary is not sure whether an individual appears in the microdata or not

bronchitis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

gastritis[10001, 60000]F[61, 70]flu[10001, 60000]F[61, 70]

pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]

pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge

A 2-diverse generalized table

30000M40Mark40000M50Ric

…………12000M59Sam

59000M35Peter13000M27Ken11000M23BobZipcodeSexAgeName

A Voter Registration List

Page 23: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Anatomy vs. Generalization Revisit

• From the adversary’s perspective:– Bob has 4 / 6 probability to be in the microdata– If Bob indeed appears the microdata, there is 2 / 4 probability that

he has contracted pneumonia– So Bob has 4/6 * 2/4 = 1/3 probability to have contracted

pneumonia

…………pneumonia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]dyspepsia[10001, 60000]M[21, 60]

pneumonia[10001, 60000]M[21, 60]DiseaseZipcodeSexAge

A 2-diverse generalized table

30000M40Mark40000M50Ric

…………12000M59Sam

59000M35Peter13000M27Ken11000M23BobZipcodeSexAgeName

A Voter Registration List

Page 24: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Anatomy vs. Generalization Revisit

• The adversary knows that– Bob must appear the microdata– There is 1/2 probability that Bob

has contracted pneumonia

…11

……2pneumonia2dyspepsia

CountDiseaseGroup-ID

…………112000M59159000M35113000M27111000M23

Group-IDZipcodeSexAge

2-diverse QIT

2-diverse ST

30000M40Mark40000M50Ric

…………12000M59Sam

59000M35Peter13000M27Ken11000M23BobZipcodeSexAgeName

Page 25: Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.

Anatomy vs. Generalization Revisit

• For a given value of l, l-diverse generalization may lead to higher privacy protection than l-diverse anatomy does.

• But is not always the case, since:– the external database may not contain any irrelevant individuals– the adversary may know that some individuals indeed appear in

the microdata

30000M40Mark40000M50Ric

…………12000M59Sam

59000M35Peter13000M27Ken11000M23BobZipcodeSexAgeName