Survey of Privacy Protection for Medical Data

Sumathie SundaresanAdvisor : Dr. Huiping Guo

Survey of Privacy Protection for Medical Data

AbstractExpanded scientific knowledge, combined with the

development of the net and widespread use of computers have increased the need for strong privacy protection for medical records. We have all heard stories of harassment that has resulted because of the lack of adequate privacy protection of medical records.

"...medical information is routinely shared with and viewed by third parties who are not involved in patient care .... The American Medical Records Association has identified twelve categories of information seekers outside of the health care industry who have access to health care files, including employers, government agencies, credit bureaus, insurers, educational institutions, and the media."

MethodsGeneralizationk-anonymityl-diversityt-closenessm-invariancePersonalized Privacy PreservationAnatomy

Privacy preserving data publishing

Microdata

Name Age Zipcode DiseaseBob 21 12000 dyspepsia

Alice 22 14000 bronchitisAndy 24 18000 fluDavid 23 25000 gastritisGary 41 20000 fluHelen 36 27000 gastritisJane 37 33000 dyspepsiaKen 40 35000 flu

Linda 43 26000 gastritisPaul 52 33000 dyspepsiaSteve 56 34000 gastritis

Classification of AttributesKey Attribute:

Name, Address, Cell Phonewhich can uniquely identify an individual directlyAlways removed before release.

Quasi-Identifier: 5-digit ZIP code,Birth date, genderA set of attributes that can be potentially linked

with external information to re-identify entities87% of the population in U.S. can be uniquely

identified based on these attributes, according to the Census summary data in 1991.

Suppressed or generalized

Classification of Attributes(Cont’d)

Sensitive Attribute: Medical record, wage,etc.Always released directly. These attributes is

what the researchers need. It depends on the requirement.

Inference attack

Age Zipcode Disease21 12000 dyspepsia22 14000 bronchitis24 18000 flu23 25000 gastritis41 20000 flu36 27000 gastritis37 33000 dyspepsia40 35000 flu43 26000 gastritis52 33000 dyspepsia56 34000 gastritis

Published table

An adversary

Quasi-identifier (QI) attributes

Name Age ZipcodeBob 21 12000

GeneralizationTransform the QI values into less specific

forms

generalize

Age Zipcode Disease21 12000 dyspepsia22 14000 bronchitis24 18000 flu23 25000 gastritis41 20000 flu36 27000 gastritis37 33000 dyspepsia40 35000 flu43 26000 gastritis52 33000 dyspepsia56 34000 gastritis

Age Zipcode Disease[21, 22] [12k, 14k] dyspepsia[21, 22] [12k, 14k] bronchitis[23, 24] [18k, 25k] flu[23, 24] [18k, 25k] gastritis[36, 41] [20k, 27k] flu[36, 41] [20k, 27k] gastritis[37, 43] [26k, 35k] dyspepsia[37, 43] [26k, 35k] flu[37, 43] [26k, 35k] gastritis[52, 56] [33k, 34k] dyspepsia[52, 56] [33k, 34k] gastritis

GeneralizationTransform each QI value into a less specific

formA generalized table

An adversary


Age Zipcode Disease[21, 22] [12k, 14k] dyspepsia[21, 22] [12k, 14k] bronchitis[23, 24] [18k, 25k] flu[23, 24] [18k, 25k] gastritis[36, 41] [20k, 27k] flu[36, 41] [20k, 27k] gastritis[37, 43] [26k, 35k] dyspepsia[37, 43] [26k, 35k] flu[37, 43] [26k, 35k] gastritis[52, 56] [33k, 34k] dyspepsia[52, 56] [33k, 34k] gastritis

K-Anonymity Sweeny came up with a formal protection

model named k-anonymityWhat is K-Anonymity?

If the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release.

Example.If you try to identify a man from a release, but the only information you have is his birth date and gender. There are k people meet the requirement. This is k-Anonymity.

Attacks Against K-Anonymity

Unsorted Matching AttackThis attack is based on the order in which

tuples appear in the released table.Solution:

Randomly sort the tuples before releasing.

Attacks Against K-Anonymity(Cont’d)

Zipcode

Age Disease

476** 2* Heart Disease



4790* ≥ 40 Flu

4790* ≥ 40 Heart Disease

4790* ≥ 40 Cancer


476** 3* Cancer

476** 3* Cancer

Bob

Zipcode Age

47678 27

A 3-anonymous patient table

Carl

Zipcode Age

47673 36

• k-Anonymity does not provide privacy if:Sensitive values in an equivalence class lack diversity• The attacker has background knowledgeHomogeneity Attack

Background Knowledge Attack

A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

l-Diversity

Distinct l-diversityEach equivalence class has at least l well-

represented sensitive valuesLimitation:

Example.In one equivalent class, there are ten tuples. In the “Disease” area, one of them is “Cancer”, one is “Heart Disease” and the remaining eight are “Flu”. This satisfies 3-diversity, but the attacker can still affirm that the target person’s disease is “Flu” with the accuracy of 70%.


l-Diversity(Cont’d)

Entropy l-diversityEach equivalence class not only must have enough

different sensitive values, but also the different sensitive values must be distributed evenly enough.

Sometimes this maybe too restrictive. When some values are very common, the entropy of the entire table may be very low. This leads to the less conservative notion of l-diversity.

Recursive (c,l)-diversityThe most frequent value does not appear too

frequently


Limitations of l-Diversity

l-diversity may be difficult and unnecessary to achieve.

• A single sensitive attribute• Two values: HIV positive (1%) and HIV

negative (99%)• Very different degrees of sensitivity

• l-diversity is unnecessary to achieve• 2-diversity is unnecessary for an

equivalence class that contains only negative records

• l-diversity is difficult to achieve• Suppose there are 10000 records in total• To have distinct 2-diversity, there can be at

most 10000*1%=100 equivalence classes

Limitations of l-Diversity(Cont’d)

l-diversity is insufficient to prevent attribute disclosure.

Skewness Attack

l-diversity does not consider the overall distribution of sensitive values

• Two sensitive values• HIV positive (1%) and HIV negative (99%)

• Serious privacy risk• Consider an equivalence class that contains an equal

number of positive records and negative records• l-diversity does not differentiate:

• Equivalence class 1: 49 positive + 1 negative• Equivalence class 2: 1 positive + 49 negative

Limitations of l-Diversity(Cont’d)

BobZip Age

47678 27

Zipcode

Age Salary Disease

476** 2* 3K Gastric Ulcer

476** 2* 4K Gastritis

476** 2* 5K Stomach Cancer

4790* ≥ 40 6K Gastritis

4790* ≥ 40 11K Flu

4790* ≥ 40 8K Bronchitis

476** 3* 7K Bronchitis

476** 3* 9K Pneumonia

476** 3* 10K Stomach Cancer

A 3-diverse patient table

Conclusion1. Bob’s salary is in [3k,5k], which

is relative low.2. Bob has some stomach-related

disease.

l-diversity does not consider semantic meanings of sensitive values

l-diversity is insufficient to prevent attribute disclosure.

Similarity Attack

t-Closeness: A New Privacy Measure

RationaleAge Zipcode …… Gender Disease

* * …… * Flu

* * …… * Heart Disease

* * …… * Cancer

.

.

.

.

.

.

………………

.

.

.

.

.

.

* * …… * Gastritis

ExternalKnowledge

Overall distribution Q of sensitive values

Belief Knowledge

B0

B1

A completely generalized table


Rationale

ExternalKnowledge

Age Zipcode ……

Gender Disease

2* 479** ……

Male Flu

2* 479** ……

Male Heart Disease

2* 479** ……

Male Cancer

.

.

.

.

.

.

………………

.

.

.

.

.

.

≥ 50 4766* ……

* Gastritis


Distribution Pi of sensitive values in each equi-class

Belief Knowledge

B0

B1

B2

A released table


Rationale

ExternalKnowledge


Distribution Pi of sensitive values in each equi-class

Belief Knowledge

B0

B1

B2

• Observations• Q should be public • Knowledge gain in two parts:

• Whole population (from B0 to B1)• Specific individuals (from B1 to B2)

• We bound knowledge gain between B1 and B2 instead

• Principle• The distance between Q and Pi

should be bounded by a threshold t.

How to calculate EMDEMD for numerical attributes

Ordered-distance is a metric Non-negative, symmetry, triangle inequality

Let ri=pi-qi, then D[P,Q] is calculated as:

Earth Mover’s DistanceExample

{3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} Move 1/9 probability for each of the following

pairs3k->5k,3k->4k cost: 1/9*(2+1)/84k->8k,4k->7k,4k->6k cost: 1/9*(4+3+2)/85k->11k,5k->10k,5k->9k cost: 1/9*(5+6+4)/8

Total cost: 1/9*27/8=0.375With P2={6k,8k,11k} , we can get the total cost

is 0.167 < 0.375. This make more sense than the other two distance calculation method.

Motivating Example A hospital keeps track of the medical records collected in

the last three months. The microdata table T(1), and its generalization T*(1),

published in Apr. 2007.Name Age Zipcode DiseaseBob 21 12000 dyspepsia



Microdata T(1)

G. ID Age Zipcode Disease1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis2 [23, 24] [18k, 25k] flu2 [23, 24] [18k, 25k] gastritis3 [36, 41] [20k, 27k] flu3 [36, 41] [20k, 27k] gastritis4 [37, 43] [26k, 35k] dyspepsia4 [37, 43] [26k, 35k] flu4 [37, 43] [26k, 35k] gastritis5 [52, 56] [33k, 34k] dyspepsia5 [52, 56] [33k, 34k] gastritis

2-diverse Generalization T*(1)

Motivating ExampleBob was hospitalized in Mar. 2007


G. ID Age Zipcode Disease1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis2 [23, 24] [18k, 25k] flu2 [23, 24] [18k, 25k] gastritis3 [36, 41] [20k, 27k] flu3 [36, 41] [20k, 27k] gastritis4 [37, 43] [26k, 35k] dyspepsia4 [37, 43] [26k, 35k] flu4 [37, 43] [26k, 35k] gastritis5 [52, 56] [33k, 34k] dyspepsia5 [52, 56] [33k, 34k] gastritis


Motivating ExampleOne month later, in May 2007




Microdata T(1)

Motivating ExampleOne month later, in May 2007Some obsolete tuples are deleted from the

microdata.

Microdata T(1)




Motivating ExampleBob’s tuple stays.

Microdata T(1)


David 23 25000 gastritisGary 41 20000 fluJane 37 33000 dyspepsia

Linda 43 26000 gastritisSteve 56 34000 gastritis

Motivating ExampleSome new records are inserted.

Microdata T(2)


David 23 25000 gastritisEmily 25 21000 fluJane 37 33000 dyspepsia

Linda 43 26000 gastritisGary 41 20000 fluMary 46 30000 gastritisRay 54 31000 dyspepsia

Steve 56 34000 gastritisTom 60 44000 gastritis

Vince 65 36000 flu

Motivating ExampleThe hospital published T*(2).





Vince 65 36000 flu

Microdata T(2)

G. ID Age Zipcode Disease1 [21, 23] [12k, 25k] dyspepsia1 [21, 23] [12k, 25k] gastritis2 [25, 43] [21k, 33k] flu2 [25, 43] [21k, 33k] dyspepsia3 [25, 43] [21k, 33k] gastritis3 [41, 46] [20k, 30k] flu4 [41, 46] [20k, 30k] gastritis4 [54, 56] [31k, 34k] dyspepsia4 [54, 56] [31k, 34k] gastritis5 [60, 65] [36k, 44k] gastritis5 [60, 65] [36k, 44k] flu


Motivating ExampleConsider the previous adversary.


G. ID Age Zipcode Disease1 [21, 23] [12k, 25k] dyspepsia1 [21, 23] [12k, 25k] gastritis2 [25, 43] [21k, 33k] flu2 [25, 43] [21k, 33k] dyspepsia3 [25, 43] [21k, 33k] gastritis3 [41, 46] [20k, 30k] flu4 [41, 46] [20k, 30k] gastritis4 [54, 56] [31k, 34k] dyspepsia4 [54, 56] [31k, 34k] gastritis5 [60, 65] [36k, 44k] gastritis5 [60, 65] [36k, 44k] flu


Motivating ExampleWhat the adversary learns from T*(1).

What the adversary learns from T*(2).

So Bob must have contracted dyspepsia!A new generalization principle is needed.


G. ID Age Zipcode Disease1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis

……


G. ID Age Zipcode Disease1 [21, 23] [12k, 25k] dyspepsia1 [21, 23] [12k, 25k] gastritis

……

The critical absence phenomenon

We refer to such phenomenon as the critical absence phenomenon

A new generalization method is needed.





Vince 65 36000 flu

Microdata T(2)


G. ID Age Zipcode Disease1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis

……

What the adversary learns

from T*(1)

Name Group-ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsiac1 1 [21, 22] [12k, 14k] bronchitis

David 2 [23, 25] [21k, 25k] gastritisEmily 2 [23, 25] [21k, 25k] fluJane 3 [37, 43] [26k, 33k] dyspepsiac2 3 [37, 43] [26k, 33k] flu

Linda 3 [37, 43] [26k, 33k] gastritisGary 4 [41, 46] [20k, 30k] fluMary 4 [41, 46] [20k, 30k] gastritisRay 5 [54, 56] [31k, 34k] dyspepsia

Steve 5 [54, 56] [31k, 34k] gastritisTom 6 [60, 65] [36k, 44k] gastritis

Vince 6 [60, 65] [36k, 44k] flu

Counterfeited generalization T*(2)

Group-ID Count

1 13 1

The auxiliary relation R(2) for T*(2)





Vince 65 36000 flu

Microdata T(2)

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsiac1 1 [21, 22] [12k, 14k] bronchitis




Vince 6 [60, 65] [36k, 44k] flu

Counterfeited Generalization T*(2)

Group-ID Count

1 13 1

The auxiliary relation R(2) for T*(2)

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsia

Alice 1 [21, 22] [12k, 14k] bronchitisAndy 2 [23, 24] [18k, 25k] fluDavid 2 [23, 24] [18k, 25k] gastritisGary 3 [36, 41] [20k, 27k] fluHelen 3 [36, 41] [20k, 27k] gastritisJane 4 [37, 43] [26k, 35k] dyspepsiaKen 4 [37, 43] [26k, 35k] flu

Linda 4 [37, 43] [26k, 35k] gastritisPaul 5 [52, 56] [33k, 34k] dyspepsiaSteve 5 [52, 56] [33k, 34k] gastritis

Generalization T*(1)


m-uniquenessA generalized table T*(j) is m-unique, if and only if

each QI-group in T*(j) contains at least m tuplesall tuples in the same QI-group have different sensitive

values.G. ID Age Zipcode Disease

1 [21, 22] [12k, 14k] dyspepsia1 [21, 22] [12k, 14k] bronchitis2 [23, 24] [18k, 25k] flu2 [23, 24] [18k, 25k] gastritis3 [36, 41] [20k, 27k] flu3 [36, 41] [20k, 27k] gastritis4 [37, 43] [26k, 35k] dyspepsia4 [37, 43] [26k, 35k] flu4 [37, 43] [26k, 35k] gastritis5 [52, 56] [33k, 34k] dyspepsia5 [52, 56] [33k, 34k] gastritis

A 2-unique generalized table

Signature

The signature of Bob in T*(1) is {dyspepsia, bronchitis}

The signature of Jane in T*(1) is {dyspepsia, flu, gastritis}


Alice 1 [21, 22] [12k, 14k] bronchitis… … … … …

Jane 4 [37, 43] [26k, 35k] dyspepsiaKen 4 [37, 43] [26k, 35k] flu

Linda 4 [37, 43] [26k, 35k] gastritis… … … … …

T*(1)

The m-invariance principleA sequence of generalized tables T*(1), …,

T*(n) is m-invariant, if and only ifT*(1), …, T*(n) are m-unique, andeach individual has the same signature in

every generalized table s/he is involved.

Name G.ID Age Zipcode DiseaseBob 1 [21, 22] [12k, 14k] dyspepsiac1 1 [21, 22] [12k, 14k] bronchitis




Vince 6 [60, 65] [36k, 44k] flu



Alice 1 [21, 22] [12k, 14k] bronchitisAndy 2 [23, 24] [18k, 25k] fluDavid 2 [23, 24] [18k, 25k] gastritisGary 3 [36, 41] [20k, 27k] fluHelen 3 [36, 41] [20k, 27k] gastritisJane 4 [37, 43] [26k, 35k] dyspepsiaKen 4 [37, 43] [26k, 35k] flu

Linda 4 [37, 43] [26k, 35k] gastritisPaul 5 [52, 56] [33k, 34k] dyspepsiaSteve 5 [52, 56] [33k, 34k] gastritis


A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only ifT*(1), …, T*(n) are m-unique, andeach individual has the same signature in every

generalized table s/he is involved.

Motivation 1: Personalization Andy does not want anyone to know that he had a stomach

problem Sarah does not mind at all if others find out that she had flu

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

A 2-diverse table An external database

Motivation 2: SA generalization How many female patients are there with age above 30? 4 ∙ (60 – 30 ) / (60 – 20 ) = 3 Real answer: 1



[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

A generalized tableName Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

An external database

Motivation 2: SA generalization (cont.) Generalization of the sensitive attribute is beneficial in this

case



[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] flu

56 F 58000respiratory infection

A better generalized tableName Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

An external database

Personalized anonymityWe propose

a mechanism to capture personalized privacy requirements

criteria for measuring the degree of security provided by a generalized table

Guarding nodeany illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Andy does not want anyone to know that he had a stomach problem

He can specify “stomach disease” as the guarding node for his tuple

The data publisher should prevent an adversary from associating Andy with “stomach disease”

Name Age Sex Zipcode Disease guarding node

Andy 4 M 12000 gastric ulcer stomach disease





gastritisulcer

Sarah is willing to disclose her exact symptom She can specify Ø as the guarding node for her tuple


Sarah 28 F 37000 flu Ø





gastritisulcer

Bill does not have any special preference He can specify the guarding node for his tuple as the same

with his sensitive value


Bill 5 M 14000 dyspepsia dyspepsia

A personalized approachany illness




gastritisulcer

Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis Ø

Sarah 28 F 37000 flu ØMary 56 F 58000 flu flu

Personalized anonymity

A table satisfies personalized anonymity with a parameter pbreach Iff no adversary can breach the privacy requirement of any tuple with a

probability above pbreach

If pbreach = 0.3, then any adversary should have no more than 30% probability to find out that: Andy had a stomach disease Bill had dyspepsia etc

Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis Ø

Sarah 28 F 37000 flu ØMary 56 F 58000 flu flu

Personalized anonymityPersonalized anonymity with respect to a

predefined parameter pbreachan adversary can breach the privacy requirement of any

tuple with a probability at most pbreach

Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia

21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection

• We need a method for calculating the breach probabilities

What is the probability that Andy had some stomach problem?

Combinatorial reconstructionAssumptions

the adversary has no prior knowledge about each individual

every individual involved in the microdata also appears in the external database

Combinatorial reconstructionAndy does not want anyone to know that he had

some stomach problemWhat is the probability that the adversary can find

out that “Andy had a stomach disease”?

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia

21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection

Combinatorial reconstruction (cont.)

Can each individual appear more than once?No = the primary caseYes = the non-primary case

Some possible reconstructions:

AndyBillKenNashMike

gastric ulcerdyspepsiapneumoniabronchitis

the primary case

AndyBillKenNashMike


the non-primary case

Breach probability (primary)

Totally 120 possible reconstructions If Andy is associated with a stomach disease in nb

reconstructions The probability that the adversary should associate Andy with

some stomach problem is nb / 120

Andy is associated withgastric ulcer in 24 reconstructionsdyspepsia in 24 reconstructionsgastritis in 0 reconstructions

nb = 48 The breach probability for Andy’s tuple is 48 / 120 = 2 / 5

any illness




gastritisulcer

AndyBillKenNashMike


Breach probability (non-primary)

Totally 625 possible reconstructionsAndy is associated with gastric ulcer or

dyspepsia or gastritis in 225 reconstructions

nb = 225The breach probability for Andy’s tuple is

225 / 625 = 9 / 25

any illness




gastritisulcer

AndyBillKenNashMike


Defect of generalization Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]Age Sex Zipcode Disease

[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis

• Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions

Defect of generalization (cont.)Query A: SELECT COUNT(*) from Unknown-Microdata


AND Zipcode in [10001, 20000]

p = Area( R1 ∩ Q ) / Area( R1 ) = 0.05Estimated answer for query A: 2 * p = 0.1

20

10k

7060504030

60k

50k

40k

30k

20k

AgeZ

ipco

de

Q

R1

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] pneumonia

Defect of generalization (cont.)Query A: SELECT COUNT(*) from Unknown-Microdata


AND Zipcode in [10001, 20000]Estimated answer from the generalized table: 0.1

Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu

Linda 65 F 25000 gastritisAlice 65 F 25000 flu

Mandy 70 F 30000 bronchitis

• The exact answer should be: 1

Basic Idea of AnatomyFor a given microdata table, Anatomy releases a

quasi-identifier table (QIT) and a sensitive table (ST)

Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1

Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2

Quasi-identifier Table (QIT)

Sensitive Table (ST)

Age Sex Zipcode Disease23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis

microdata

Basic Idea of Anatomy (cont.)1. Select a partition of the tuples

Age Sex Zipcode Disease

23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia

61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis

QI group 1

QI group 2

a 2-diverse partition

Basic Idea of Anatomy (cont.)2. Generate a quasi-idnetifier table (QIT) and a

sensitive table (ST) based on the selected partition

Disease

pneumoniadyspepsiadyspepsia

pneumonia

flugastritis

flubronchitis

Age Sex Zipcode

23 M 1100027 M 1300035 M 5900059 M 12000

61 F 5400065 F 2500065 F 2500070 F 30000

group 1

group 2

quasi-identifier table (QIT) sensitive table (ST)



Group-ID Disease

1 pneumonia1 dyspepsia1 dyspepsia1 pneumonia

2 flu2 gastritis2 flu2 bronchitis

Age Sex Zipcode Group-ID

23 M 11000 127 M 13000 135 M 59000 159 M 12000 1

61 F 54000 265 F 25000 265 F 25000 270 F 30000 2

quasi-identifier table (QIT) sensitive table (ST)




Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2

quasi-identifier table (QIT)

sensitive table (ST)

Privacy PreservationFrom a pair of QIT and ST generated from an l-

diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/lName Age Sex Zipcode

Bob 23 M 11000


Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2quasi-identifier table (QIT)


Accuracy of Data Analysis Query A: SELECT COUNT(*) from Unknown-Microdata


AND Zipcode in [10001, 20000]Group-ID Disease Count

1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1

Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2quasi-identifier table (QIT)


Accuracy of Data Analysis (cont.)Query A: SELECT COUNT(*) from Unknown-Microdata


AND Zipcode in [10001, 20000]

2 patients have contracted pneumonia2 out of 4 patients satisfies the query condition on Age and

ZipcodeEstimated answer for query A: 2 * 2 / 4 = 1, which is also the

actual result from the original microdata

20

10k

7060504030

60k

50k

40k

30k

20k

x (Age)y

(Zip

code

)

t1

Q

t2

t3

t4

Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 1

t1t2t3t4

ConclusionLimitations of l-diversity

l-diversity is difficult and unnecessary to achieve

l-diversity is insufficient in preventing attribute disclosure

t-Closeness as a new privacy measureThe overall distribution of sensitive values

should be public informationThe separation of the knowledge gain

EMD to measure distanceEMD captures semantic distance wellSimple formulas for three ground distances

Conclusionsm-invariant table support republication

of dynamic datasetsGuarding nodes allow individuals to

describe their privacy requirements better

Anatomy outperforms generalization by allowing much more accurate data analysis on the published data.

Thank you!

Questions?

Survey of Privacy Protection for Medical Data

Documents

Transcript of Survey of Privacy Protection for Medical Data