Beyond k-AnonymityBeyond k-Anonymity
Arik FriedmanArik FriedmanNovember 2008November 2008
Seminar in Databases (236826) Seminar in Databases (236826)
22
OutlineOutline
Recap – privacy and Recap – privacy and kk-anonymity-anonymity -diversity -diversity (beyond k-anonymity)(beyond k-anonymity)
t-closeness t-closeness (beyond k-anonymity and l-diversity)(beyond k-anonymity and l-diversity)
Privacy?Privacy?
Recap - Recap - kk-Anonymity -Anonymity
Using medical data without disclosing patients’ identity:Using medical data without disclosing patients’ identity:
The problem: the ability of an attacker to cross the released data with external data.
ZipBirthdateGender
EthnicityVisit dateDiagnosisProcedureMedication
Total charge
NameAddress
Date registered
Party affiliationDate last votedMedical data Voter List
Quasi-identifier
44
K-Anonymity K-Anonymity –– Formal Definition Formal Definition
RT - Released TableRT - Released Table (A1,A2,(A1,A2,……,An) - Attributes,An) - Attributes QIQIRTRT - Quasi Identifier - Quasi Identifier
RT[QIRT[QIRTRT] – Projection of RT on QI] – Projection of RT on QIRTRT
Example – original dataExample – original dataNon-Sensitive DataSensitive Data
#ZIPAgeNationalityCondition
11305328RussianHeart Disease
21306829AmericanHeart Disease
31306821JapaneseViral Infection
41305323AmericanViral Infection
51485350IndianCancer
61485355RussianHeart Disease
71485047AmericanViral Infection
81485049AmericanViral Infection
91305331AmericanCancer
101305337IndianCancer
111306836JapaneseCancer
121306835AmericanCancer
Example - 4-anonymized TableExample - 4-anonymized TableNon-Sensitive DataSensitive Data
#ZIPAgeNationalityCondition
11305328*Heart Disease
21306829*Heart Disease
31306821*Viral Infection
41305323*Viral Infection
51485350*Cancer
61485355*Heart Disease
71485047*Viral Infection
81485049*Viral Infection
91305331*Cancer
101305337*Cancer
111306836*Cancer
121306835*Cancer
Example - 4-anonymized TableExample - 4-anonymized TableNon-Sensitive DataSensitive Data
#ZIPAgeNationalityCondition
113053<30*Heart Disease
213068<30*Heart Disease
313068<30*Viral Infection
413053<30*Viral Infection
51485340*Cancer
61485340*Heart Disease
71485040*Viral Infection
81485040*Viral Infection
9130533**Cancer
10130533**Cancer
11130683**Cancer
12130683**Cancer
Example - 4-anonymized TableExample - 4-anonymized TableNon-Sensitive DataSensitive Data
#ZIPAgeNationalityCondition
1130**<30*Heart Disease
2130**<30*Heart Disease
3130**<30*Viral Infection
4130**<30*Viral Infection
51485*40*Cancer
61485*40*Heart Disease
71485*40*Viral Infection
81485*40*Viral Infection
9130**3**Cancer
10130**3**Cancer
11130**3**Cancer
12130**3**Cancer
Example - 4-anonymized TableExample - 4-anonymized TableNon-Sensitive DataSensitive Data
#ZIPAgeNationalityCondition
1130**<30*Heart Disease
2130**<30*Heart Disease
3130**<30*Viral Infection
4130**<30*Viral Infection
51485*40*Cancer
61485*40*Heart Disease
71485*40*Viral Infection
81485*40*Viral Infection
9130**3**Cancer
10130**3**Cancer
11130**3**Cancer
12130**3**Cancer
We have 4-anonymity!!!We have privacy!!!!
Example - 4-anonymized TableExample - 4-anonymized TableNon-Sensitive DataSensitive Data
#ZIPAgeNat.Condition
1130**<30*Heart Disease
2130**<30*Heart Disease
3130**<30*Viral Infection
4130**<30*Viral Infection
51485*40*Cancer
61485*40*Heart Disease
71485*40*Viral Infection
81485*40*Viral Infection
9130**3**Cancer
10130**3**Cancer
11130**3**Cancer
12130**3**Cancer
Suppose attacker knows the non-sensitive attributes of
And the fact that Japanese have very low incidence of heart disease
NameZipAgeNational
Umeko1306821Japanese
Bob1305331American
Example - 4-anonymized TableExample - 4-anonymized TableNon-Sensitive DataSensitive Data
#ZIPAgeNat.Condition
1130**<30*Heart Disease
2130**<30*Heart Disease
3130**<30*Viral Infection
4130**<30*Viral Infection
51485*40*Cancer
61485*40*Heart Disease
71485*40*Viral Infection
81485*40*Viral Infection
9130**3**Cancer
10130**3**Cancer
11130**3**Cancer
12130**3**Cancer
Suppose attacker knows the non-sensitive attributes of
And the fact that Japanese have very low incidence of heart disease
NameZipAgeNational
Umeko1306821Japanese
Bob1305331American
Bob has cancer!
Umeko has viral infection!
kk-Anonymity Drawbacks-Anonymity Drawbacks
Basic Reasons for leak:Basic Reasons for leak: Sensitive attributes lack Sensitive attributes lack diversitydiversity in values in values
• Homogeneity AttackHomogeneity Attack Attacker has additional Attacker has additional background knowledgebackground knowledge
• Background knowledge AttackBackground knowledge Attack
Hence a new solution has been proposed Hence a new solution has been proposed in-in-addition addition to k-anonymity – to k-anonymity – -diversity-diversity
Adversary’s background knowledgeAdversary’s background knowledge Has access to published table Has access to published table T* T* and knows that it and knows that it
is a generalization of some base table is a generalization of some base table TT Instance-level background knowledge:Instance-level background knowledge:
Some individuals are present in the table. Some individuals are present in the table. Knowledge about sensitive attributes of specific Knowledge about sensitive attributes of specific
individuals. individuals.
Demographic background knowledgeDemographic background knowledge Partial knowledge about the distribution of sensitive and Partial knowledge about the distribution of sensitive and
non-sensitive attributes in the population.non-sensitive attributes in the population.
Diversity in the sensitive attribute values Diversity in the sensitive attribute values should mitigate both!should mitigate both!
Some notation…Some notation… T = {tT = {t11, t, t22,…, t,…, tnn} : } :
A table with attributes AA table with attributes A11, A, A22,…, A,…, Amm
Subset of some population Subset of some population t[C] = (t[Ct[C] = (t[C11, C, C22, …, C, …, Cpp]) :]) :
Projection of t onto a set of attributes CProjection of t onto a set of attributes CA A
SSA – sensitive attributesA – sensitive attributes QIQIA – quasi-identifier attributesA – quasi-identifier attributes T*: anonymized tableT*: anonymized table qq*-block – the set of records that were generalized *-block – the set of records that were generalized
to the same value q* in T*to the same value q* in T*
Bayes Optimal PrivacyBayes Optimal Privacy
Ideal notion of privacy: models background Ideal notion of privacy: models background knowledge as probability distribution over knowledge as probability distribution over attributesattributes
Uses Bayesian Inference techniquesUses Bayesian Inference techniques Simplifying assumptions:Simplifying assumptions:
A single, multi-dimensional quasi-identifier attribute QA single, multi-dimensional quasi-identifier attribute Q A single sensitive attribute SA single sensitive attribute S T is a simple random sample from T is a simple random sample from Adversary Alice knows complete joint distribution f of Q Adversary Alice knows complete joint distribution f of Q
and S (worst case assumption)and S (worst case assumption)
Bayes Optimal PrivacyBayes Optimal Privacy
Assume Bob appears in generalized table T*.Assume Bob appears in generalized table T*. Alice’s Alice’s prior beliefprior belief of Bob’s sensitive attribute:of Bob’s sensitive attribute:
(q,s)(q,s)=P=Pff ( t[S] = s | t[Q] = q) ( t[S] = s | t[Q] = q)
After seeing After seeing T*,T*, Alice’s belief changes to its Alice’s belief changes to its posteriorposterior value value (or (or observed beliefobserved belief):):
(q,s,T*)(q,s,T*)=P=Pff ( t[S] = s | t[Q] = q ( t[S] = s | t[Q] = q t*t*T*, t* generalizes t)T*, t* generalizes t)
We wouldn’t want Alice to learn “much”: We wouldn’t want Alice to learn “much”: (q,s)(q,s)(q,s,T*)(q,s,T*)
Bayes Optimal Privacy - ExampleBayes Optimal Privacy - Example Bob, Alice’s neighbor, is a 62 years old state employee.Bob, Alice’s neighbor, is a 62 years old state employee. Alice’s Alice’s prior beliefprior belief: 10% of men over 60 have cancer:: 10% of men over 60 have cancer:
(age(age6060 ZIPcode=02138,cancer) ZIPcode=02138,cancer) = = (age(age60,cancer)60,cancer) = 0.1 = 0.1
In In kk-anonymized GIC data T*, the following lines could -anonymized GIC data T*, the following lines could relate to Bob:relate to Bob:
Alice’s belief changes to its Alice’s belief changes to its posterior valueposterior value::
(age(age60 60 ZIPcode=02138,cancer,T*) ZIPcode=02138,cancer,T*) = 0.5 = 0.5
AgeZipcodeDiagnosis
6002138Cancer
6002138Cancer
6002138Healthy
6002138Pneumonia
Bayes Optimal PrivacyBayes Optimal Privacy
Theorem 3.1:Theorem 3.1:
where n(q*,s’) is the number of tuples in T* where n(q*,s’) is the number of tuples in T* with t*[Q] = q* and t*[S] = s’ with t*[Q] = q* and t*[S] = s’
( )
( )( )( )
( )( )( )
*,
, , *
*, ''
|
| *
' |
' | *
q s
q s T
q ss S
f s qn
f s q
f s qn
f s q
b
Î
=
å
Privacy principlesPrivacy principles
Positive disclosure:Positive disclosure: the adversary can the adversary can correctly identify the value of a sensitive correctly identify the value of a sensitive attribute: attribute: q,s such that q,s such that (q,s,T*)(q,s,T*)>1->1- for a given for a given
Negative disclosure: Negative disclosure: the adversary can the adversary can correctly eliminate the value of a sensitive correctly eliminate the value of a sensitive attribute: attribute: (q,s,T*)(q,s,T*)<< for a given for a given and and ttT such that T such that
t[Q]=q but t[S]t[Q]=q but t[S]ss
Privacy principlesPrivacy principles
Note not all positive and negative disclosures Note not all positive and negative disclosures are badare bad If Alice already knew Bob has Cancer, there is If Alice already knew Bob has Cancer, there is
nothing much one can do!nothing much one can do! Uninformative principle: there should not be Uninformative principle: there should not be
a large difference between the prior and a large difference between the prior and posterior beliefsposterior beliefs
Bayes Optimal PrivacyBayes Optimal Privacy
Limitations in practiceLimitations in practice Insufficient knowledge: data publisher unlikely to Insufficient knowledge: data publisher unlikely to
know know ff Publisher does not know how much the adversary Publisher does not know how much the adversary
actually knowsactually knows• He may have instance level knowledgeHe may have instance level knowledge
• No way to model non-probabilistic knowledgeNo way to model non-probabilistic knowledge Multiple adversaries having different levels of Multiple adversaries having different levels of
knowledgeknowledge Hence a Hence a practical practical definition is neededdefinition is needed
-diversity principle-diversity principle
Revisit:Revisit:
Positive disclosure can occur when:Positive disclosure can occur when:
( )
( )( )( )
( )( )( )
*,
, , *
*, ''
|
| *
' |
' | *
q s
q s t
q ss S
f s qn
f s q
f s qn
f s q
b
Î
=
å
-diversity principle-diversity principle
Could occur due to combination of:Could occur due to combination of: Lack of diversityLack of diversity
Strong background KnowledgeStrong background Knowledge
Mitigate by requiring “well-
represented” sensitive values
At least -1 damaging pieces of background
knowledge required to succeed
-diversity principle-diversity principle
A A qq*-block is *-block is -diverse if it contains at -diverse if it contains at least least well-represented well-represented values for the values for the sensitive attribute S. sensitive attribute S.
A table is A table is -diverse if every -diverse if every qq*-block is *-block is -diverse.-diverse.
Example – Example – distinct distinct -diversity-diversity: there are at : there are at least l distinct values for the sensitive attribute least l distinct values for the sensitive attribute in each in each qq*-block.*-block.
Non-Sensitive DataSensitive Data
#ZIPAgeNationalityCondition
11305*<= 40*Heart Disease
21305*<= 40*Viral Infection
31305*<= 40*Cancer
41305*<= 40*Cancer
51485*>= 40*Cancer
61485*>= 40*Heart Disease
71485*>= 40*Viral Infection
81485*>= 40*Viral Infection
91306*<= 40*Heart Disease
101306*<= 40*Viral Infection
111306*<= 40*Cancer
121306*<= 40*Cancer
Example – 3-distinct diverse Table
We have 3-distinct diversity!!!
We have privacy!!!!
Example - 3-distinct diverse tableExample - 3-distinct diverse tableNon-Sensitive DataSensitive Data
#ZIPAgeNat.Condition
1130**<30*Heart Disease
2130**<30*Heart Disease
3130**<30*Viral Infection
4130**<30*Viral Infection
5130**<30*Viral Infection
6130**<30*Viral Infection
7130**<30*Viral Infection
8130**<30*Viral Infection
9130**<30*Viral Infection
10130**<30*Viral Infection
11130**<30*Viral Infection
12130**<30*Cancer
Suppose attacker knows the non-sensitive attributes of
And the fact that Japanese have very low incidence of heart disease
NameZipAgeNational
Umeko1306821Japanese
Still very likely that Umeko has viral infection!
A table is Entropy A table is Entropy -Diverse if for every q*--Diverse if for every q*-block:block:
wherewhere
Entropy Entropy -diversity-diversity
( ) ( )( ) ( )*, *,log logq s q ss S
p pÎ
- ³å
( )( )
( )
*,
*,*,
'
q s
q sq s
s S
np
nÎ
=å p(S1)p(S2)Entropy
1001
0.90.10.141.38
0.80.20.221.65
0.70.30.271.84
0.60.40.291.96
0.50.50.32
Not feasible when one value is very common
Example with 2 sensitive attribute values
Recursive (Recursive (cc,,)-diversity)-diversity None of the sensitive values should occur None of the sensitive values should occur too too
frequently.frequently. Let Let rrii be the i be the ithth most frequent sensitive value most frequent sensitive value
Given const Given const c, recursive (c, c, recursive (c, ))-diversity is satisfied if -diversity is satisfied if
rr11 < < c ( rc ( r + r + r+1+1 + … + r + … + rmm ) )
For example, with 3 attributes (m=3):For example, with 3 attributes (m=3): (2,2)-diversity: r(2,2)-diversity: r11<2(r<2(r22+r+r33))
(2,3)-diversity: r(2,3)-diversity: r11<2r<2r33 Equivalently: even if we eliminate a sensitive value, we still have (2,2)-diversityEquivalently: even if we eliminate a sensitive value, we still have (2,2)-diversity
An algorithm for An algorithm for -diversity?-diversity?
Monotonicity property:Monotonicity property:If If T*T* preserves privacy, preserves privacy,
then so does every generalization of itthen so does every generalization of it
Satisfied by Satisfied by kk-anonymity-anonymity Most k-anonymization algorithms work for any privacy Most k-anonymization algorithms work for any privacy
measure that satisfies monotonicity - measure that satisfies monotonicity - We can re-use We can re-use previous algorithms directlyprevious algorithms directly
Bayes optimal privacy is not monotonicBayes optimal privacy is not monotonic -diversity variants are monotonic!-diversity variants are monotonic!
Mondrian(partition) if (no allowable multidimensional cut for
partition)return : partition summary
else dim choose dimension() fs frequency set(partition, dim) splitVal find median(fs) lhs {t partition : t.dim splitVal} rhs {t partition : t.dim > splitVal} return Mondrian(rhs) Mondrian(lhs)
We
igh
t
35 4540 5550 6560 7050
55
60
65
70
75
80
85
Age
Example: Mondrian-entropy diverse, = 1.89(for two sensitive attributes, equivalent to limiting prevalence to up to 2/3. Also equivalent to recursive (2,2)-diversity)
ExperimentsExperiments Used Incognito (a popular generalization algorithm)Used Incognito (a popular generalization algorithm) Adult dataset (Census data) from the UCI machine Adult dataset (Census data) from the UCI machine
learning repository learning repository ((http://archive.ics.uci.edu/ml/datasets/Adult))
Adult Database
Description
Experiment results refer to this sensitive attribute
Experiments - UtilityExperiments - Utility
Intuitively: “usefulness” of the Intuitively: “usefulness” of the -diverse and -diverse and kk-anonymized -anonymized tables. Used tables. Used k, k, = 2, 4, 6, 8= 2, 4, 6, 8
Number of generalization steps that were performed vs. k,
Average size of q*-blocks generated (similar to CAVG) vs. k,
Non-Sensitive DataSensitive Data
#ZIPAgeNationalityCondition
11305*<= 40*Heart Disease
21305*<= 40*Viral Infection
31305*<= 40*Cancer
41305*<= 40*Cancer
51485*>= 40*Cancer
61485*>= 40*Heart Disease
71485*>= 40*Viral Infection
81485*>= 40*Viral Infection
91306*<= 40*Heart Disease
101306*<= 40*Viral Infection
111306*<= 40*Cancer
121306*<= 40*Cancer
Example – 3-diverse Table
We have 3-diversity!!!We have privacy!!!!
Similarity attackSimilarity attack
BobZipAge
4767827
Zipcode
AgeSalaryDisease
476**2*20KGastric Ulcer
476**2*30KGastritis
476**2*40KStomach Cancer
4790*≥ 4050KGastritis
4790*≥ 40100KFlu
4790*≥ 4070KBronchitis
476**3*60KBronchitis
476**3*80KPneumonia
476**3*90KStomach Cancer
A 3-diverse patient table
Conclusion1. Bob’s salary is in [20k,40k], which is
relative low.
2. Bob has some stomach-related disease.
l-diversity does not consider semantic meanings of sensitive values
l-diversity is insufficient to prevent attribute disclosure.
Skewness attackSkewness attackNon-Sensitive
DataSensitive
Data
#AgeCondition
1<30Cancer
2<30Cancer
3<30Healthy
4<30Healthy
53*Cancer
63*Healthy
73*Healthy
83*Healthy
93*Healthy
1030Healthy
1130Cancer
1230Cancer
1330Cancer
1430Cancer
Two sensitive values in :
Cancer (1%) and Healthy (99%)
(entropy: 1.0576)
entropy: 2
entropy: 1.65
entropy: 1.65
Equivalent in terms of -diversity, but very different semantically
Attacker learned a lot!
tt-Closeness: the main idea-Closeness: the main idea
RationaleRationaleAgeZipcode……GenderDisease
**……*Flu
**……*Heart Disease
**……*Cancer
.
.
.
.
.
.
………………
.
.
.
.
.
.
**……*Gastritis
ExternalKnowledge
Overall distribution Q of sensitive values
BeliefKnowledge
B0
B1
A completely generalized table
tt-Closeness: the main idea-Closeness: the main idea
RationaleRationale
ExternalKnowledge
AgeZipcode……
GenderDisease
2*479**……
MaleFlu
2*479**……
MaleHeart Disease
2*479**……
MaleCancer
.
.
.
.
.
.
………………
.
.
.
.
.
.
≥ 504766*……
*Gastritis
Overall distribution Q of sensitive values
Distribution Pi of sensitive values in each equivalence class
BeliefKnowledge
B0
B1
B2
A released table
tt-Closeness: the main idea-Closeness: the main idea
RationaleRationale
ExternalKnowledge
Overall distribution Q of sensitive values
Distribution Pi of sensitive values in each equivalence class
BeliefKnowledge
B0
B1
B2
Observations Q should be treated as public Knowledge gain in two parts:
Whole population (from B0 to B1) Specific individuals (from B1 to B2)
We bound knowledge gain between B1 and B2 instead
Principle The distance between Q and Pi
should be bounded by a threshold t.
tt-closeness-closenessAn equivalence class is said to have An equivalence class is said to have tt-closeness if -closeness if the distance between the distribution of a sensitive the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in this class and the distribution of the attribute in the whole table is no more than a attribute in the whole table is no more than a threshold threshold tt. .
A table is said to have A table is said to have tt-closeness if all -closeness if all equivalence classes have t-closeness.equivalence classes have t-closeness.
A distance measure called Earth Movers Distance A distance measure called Earth Movers Distance is used. It maintains monotonicity!is used. It maintains monotonicity!
Non-Sensitive DataSensitive Data
#ZIPAgeSalaryCondition
14767*<= 403KGastric ulcer
24767*<= 405KStomach cancer
34767*<= 409KPneumonia
44790*>= 406KGastritis
54790*>= 4011KFlu
64790*>= 408KBronchitis
74760*<= 404KGastritis
84760*<= 407KBronchitis
94760*<= 4010KStomach cancer
Example – t-closeness
We have 0.167-closeness w.r.t. Salary and 0.278-closeness
w.r.t. Disease!!!We have privacy!!!!
Netflix privacy breachNetflix privacy breach(Robust De-anonymization of Large Sparse Datasets, (Robust De-anonymization of Large Sparse Datasets,
Narayanan and Shmatikov, 2008)Narayanan and Shmatikov, 2008)
Released for the Netflix Prize contestReleased for the Netflix Prize contest 17,770 movie titles17,770 movie titles 480,189 users with random customer IDs480,189 users with random customer IDs Ratings: 1-5Ratings: 1-5 For each movie we have the ratings:For each movie we have the ratings:
• (MovieID, CustomerID, Rating, Date)(MovieID, CustomerID, Rating, Date)
Re-arrange by customerID:Re-arrange by customerID:
4141
MovieCustomerIDRankDate
The Godfather17236420.5
Quantum of Solace17236220.11
Hamlet17236514.10
The Scorpion King17236112.8
The profit17236511.8
Netflix privacy breachNetflix privacy breach(Robust De-anonymization of Large Sparse Datasets, (Robust De-anonymization of Large Sparse Datasets,
Narayanan and Shmatikov, 2008)Narayanan and Shmatikov, 2008)
Can be linked, e.g., with IMDB data, to re-Can be linked, e.g., with IMDB data, to re-identify individuals!identify individuals!
4242
MovieCustomerIDRankDate
The Godfather17236420.5
Quantum of Solace17236220.11
Hamlet17236514.10
The Scorpion King17236112.8
The profit17236511.8
Netflix data
IMDB data
)This example is made up. Possibly, James Hitchcock has nothing to do with Netflix(
EpilogueEpilogue
4343
“You have zero privacy anyway.Get over it”.
Scott McNeally (SUN CEO, January 1999)
HIPAA excerptHIPAA excerptHealth Insurance Portability and Accountability Act of 1996Health Insurance Portability and Accountability Act of 1996
4545
Thank you!
4646
BibliographyBibliography
““Mondrian Multidimensional k-Anonymity”,K. LeFevre, D.J. Mondrian Multidimensional k-Anonymity”,K. LeFevre, D.J. DeWitt, R. Ramakrishnan,2006DeWitt, R. Ramakrishnan,2006
-diversity: Privacy beyond -diversity: Privacy beyond kk-anonymity, A. Machanavajjhala, -anonymity, A. Machanavajjhala, Johannes Gehrke, Daniel Kifer, 2006Johannes Gehrke, Daniel Kifer, 2006
T-closeness: Privacy beyond T-closeness: Privacy beyond kk-anonymity and -anonymity and -diversity, -diversity, Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian, 2006Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian, 2006
Presentations:Presentations: ““Privacy In Databases”, B. Aditya PrakashPrivacy In Databases”, B. Aditya Prakash ““K-Anonymity and Other Cluster-Based Methods”, Ge. RuanK-Anonymity and Other Cluster-Based Methods”, Ge. Ruan
Top Related