20130219 nofreelunch arai

33
PPDM勉強会第3“No Free Lunch in Data Privacy” (Daniel Kifer, Ashwin Machanavajjhala SIGMOD2011) 理化学研究所 荒井ひろみ

Transcript of 20130219 nofreelunch arai

Page 1: 20130219 nofreelunch arai

PPDM勉強会第3回 “No Free Lunch in Data Privacy”

(Daniel Kifer, Ashwin Machanavajjhala SIGMOD2011)

理化学研究所 荒井ひろみ

Page 2: 20130219 nofreelunch arai

Contributions of this paper1.  Simplify the impossibility result* 2.  Show privacy definition that relies on assumptions about

data generating mechanism and compare with DP 3.  Propose a guideline for determining whether DP is suitable

for a given application. 4.  Demonstrate cases that DP does not meet the guideline

1.  when applied to arbitrary SNS data. 2.  when applied to tabular data when an attacker has aggregate-level

background knowledge.

5.  Propose a modification of DP for tabular data with aggregate-level background knowledge

*briefly, “answering many queries w/ bounded noise does not preserve privacy”

Page 3: 20130219 nofreelunch arai

Outline}  Brief review of differential privacy (DP) }  Analysis of attacker

}  Define discriminant, non privacy game, note no free lunch theorem

}  Show relation between DP and no free lunch theorem }  Privacy risks of DP algorithms for various attackers

}  Unsuitable cases : naïve application for correlated data }  New DP definition subject to background knowledge

}  Introducing restrictions that represent previously released exact query answers

Page 4: 20130219 nofreelunch arai

Differential privacy : problem setting}  Query-answering mechanism

顧客 age Item A

Item B

A 25 0 1 …

B 42 1 1 …

answer 顧客 Type

x Type Y

Type Z

20代 324 1 52

30代 34 34 13

DB wants to preserve privacy of individuals

Query answers: statistical information etc.

query

Collection of private data

Page 5: 20130219 nofreelunch arai

Differential privacy : motivation}  A privacy guarantee that limits risk incurred by JOINING

encourages participation in the dataset. }  minimize the increased risk to an individual incurred by joining

(or leaving) the database. (NOT comparing an adversary's prior and posterior views )

w/ Yuko

w/o Yuko

Not so different

Recall : Dalenius’s problem }  if the statistical database teaches us anything at all, then it should change our beliefs about

individuals }  the things that statistical databases are designed to teach can, sometimes indirectly, cause

damage to an individual, even if this individual is not in the database.

Page 6: 20130219 nofreelunch arai

Differential privacy : definition

D D’

Almost same probability

Randomized query-response algorithm K

[Dwork06]

S S’

P(K(D) ∈S) P(K(D’) ∈S)

Range(K)

Definition of the neighboring DBs is very important in this paper

Page 7: 20130219 nofreelunch arai

Differential privacy: mechanism

•  Density function of the Laplace distribution

•  for any z, z’ such that |z – z’| ≦ 1,the density at z is at most  times the density at z’, satisfying the condition in

[Dwork06]

Page 8: 20130219 nofreelunch arai

Definitions of DP}  Two flavors of DP

}  Deleting of inserting a tuple: unbounded

}  Changing tuple value: bounded

•  Note that the existence of the tuple ≠participation !

Page 9: 20130219 nofreelunch arai

The no-free-lunch theorem}  It is not possible to guarantee privacy and utility w/o

making assumptions about the data-generating mechanism…

}  To discuss this problem: }  Define the discriminant ω as a lower bound on utility }  Analyze ω of the Laplace mechanism }  Define the non-privacy game }  Propose no free lunch theorem }  Free lunch theorem for DP

Page 10: 20130219 nofreelunch arai

Discriminant (as a utility measure)

}  ω : a measure for query accuracy. If ω ~1 * , A answers with reasonable accuracy.

}  A : randomized answering query processor }  Integer k : like anonymity parameter ? }  Constraint c : lower bound of utility of A with parameter k.

* Note that the discriminant is 1 for deterministic algorithm e.g. k-anonymity algorithm.

Page 11: 20130219 nofreelunch arai

Discriminant

D D’

e.g. k=2

S S’

P(A(D) ∈S≧c P(A(D’) ∈S’)≧c

Range(K)

Page 12: 20130219 nofreelunch arai

Example of discriminant

}  Canser-patient DB }  # of canser pationts in DB : D1: 0 / D2: 10,000 / D3: 20,000 }  S1=[0,1000], S2=[9000,11000], S3=[19000,∞], }  P(A(Di)∈Si)≧0.95 for all i

Page 13: 20130219 nofreelunch arai

Discriminant of the Laplace mechanism

Intuitive description: }  Laplace mechanism w/ sensitivity 0.5 }  Choose n large enough → we can choose {Di} and {Si} so that the distances between Di ‘s and the ranges of Si’s are large enough → discriminant became 1

∝n

Page 14: 20130219 nofreelunch arai

Non-privacy game}  Privacy definition as a game:

}  Assume a data-generating mechanism P }  The attacker guess a true answer q(D) from a randomized

answer A(D) against a sensitive query q

P D

q

A(D)

q(D) ?

Page 15: 20130219 nofreelunch arai

No free lunch theorem}  Providing both privacy (as a game) and utility is impossible

if there are no restriction on data-generating mechanism

}  If D is uniformly distributed, the attacker’s strategy is to guess q(D) if A(D) ∈ Si

}  The attacker’s guess is correct w/ probability 1/k w/o A(D). }  He wins w/ probability ~1 w/ A(D) !

Page 16: 20130219 nofreelunch arai

No free lunch and differential privacy}  Privacy definition w/o assumption about the data :

}  Note: the discriminant ω(k;A) of any algorithm A satisfying ε-free-lunch privacy is bounded by }  (my interpretation) Let P(A(D1)∈S)=c. There are at least k-1

possible DB instances {Di} where ce-ε ≦ P(A(Di)∈S)≦ceε. Using Σi P(A(Di)∈S)=1, c≦

Page 17: 20130219 nofreelunch arai

Privacy risks in differential privacy}  General guideline for determining a privacy definition }  Note that the DP for more knowledgeable attacker add

less noise !

Consider three kinds of DP algorithm

Page 18: 20130219 nofreelunch arai

example}  Consider the table with 1 tuple (Bob) and two 2-bit attributes R1 and R2:

00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

neighbors

neighbors neighbors’ neighbors

Bounded DP (tuple) Attribute DP 00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

neighbors’ neighbors’ neighbors

Bit DP

neighbors’ neighbors

neighbors

Probability of answering the true record

Question:boundではないのか…?

Page 19: 20130219 nofreelunch arai

example}  Consider the table with 1 tuple (Bob) and two 2-bit attributes R1 and R2:

00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

neighbors

neighbors neighbors’ neighbors

Bounded DP (tuple) Attribute DP 00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

neighbors’ neighbors’ neighbors

Bit DP

neighbors’ neighbors

neighbors

Probability of answering the true record

Higher lower

Probability of answering the true record

Page 20: 20130219 nofreelunch arai

Problem: correlated data}  If several records are known to have the same attribute

value, sensitivity must be larger }  E.g. disease database : bob and his family might have the same

disease

}  How should we deal with this problem? }  Hide evidence of participation (any influence of a certain

participation) }  Discussion

}  Growing SNSs }  Prior knowledge about exact statistics

Page 21: 20130219 nofreelunch arai

Growing social networksAssume edge-growing SNSs. }  Let the network grow, after which the attacker will

ask the query “how many edges are there between the two communities".

Can we preserve privacy of Bob’s external link?

→making assumptions about data generating model 1.  Forest Fire model

2.  Copying model

3.  MVS model

From simulation: }  1,2→ we cannot set a noise parameter ε reliably

unless we know the network parameters (model parameters or final edge number).

}  3→ has a steady state distribution, rather favorable

Only bob has external link

Initial state of two clusters

Charlie

Bob

MVS model

Page 22: 20130219 nofreelunch arai

Privacy breach after some exact data releases

}  Example: contingency tables(deterministic) and additional differential private data release

}  a demonstration for additional privacy breach (4.1) }  Consider a table T, attribute R w/ domain {r1,…,rk} }  k-1 queries :

}  If we additionally knew the exact answer to “select count(*) from T where R=ri”, we would be able to exactly reconstruct the table. → the tuples are correlated !!

}  Additional differential private answers...

Page 23: 20130219 nofreelunch arai

Privacy breach after some exact statistics release (2)

}  Consider a table T, attribute R w/ domain {r1,…,rk} }  k-1 queries :

}  Additional k ε-differential private answers...

}  If k is large (e.g. d-bit vector w/ 2^d possible value) the variance is small (recall 2.2 knowledge vs privacy risk) → T is reconstructed w/ very high probability … (due to correlation w/ prior release of information)

Page 24: 20130219 nofreelunch arai

A plausible deniability (idea)}  What we should do to maintain consistency w/ previously

deterministic query answers have been released? }  We should choose bounded DP

}  If the number of tuples had been answered previously, the number of tuples might be stay the same

}  In general, we can maintain consistency in several ways… }  exchange attribute values collaboratively, for example

Page 25: 20130219 nofreelunch arai

differential privacy subject to background knowledge}  definitions

R L 計

M 43 9 52

F 44 4 48

計 87 13 100

contingency table

cell cell count

table T

id gender handedness

taro male left

hana female right

… … …

R L 計

M 42 9 52

F 44 4 48

計 87 13 100

move

Page 26: 20130219 nofreelunch arai

differential privacy subject to background knowledge

}  Define DP for neighboring tables

Page 27: 20130219 nofreelunch arai

Neighbors induced by other prior statistics

}  Example: exact query answer for “select gender, count(*) from T group by gender” }  × unbounded DP : the number of tuples is already published }  × bounded DB : we cannot arbitrarily modify a single tuple.. }  Define neighbors that maintaining consistency with the prior

query answers:

Page 28: 20130219 nofreelunch arai

Neighbor-based algorithm for DP

}  Definitions : }  distance function between two contingency tables: }  To achieve 2ε-generic DP, exponential mechanism

[McSherry06], can be used. (Δq=d(Ta, Tb))

Page 29: 20130219 nofreelunch arai

Neighbor-based algorithm for DP

}  Laplace mechanism : }  Sensitivity

}  The Laplace mechanism adds noise (the probability of density function is ) to the query answer.

Page 30: 20130219 nofreelunch arai

NP-hard problem}  Dealing with neighbors under constraints is NP-hard

problem

→the general problem of finding an upper bound on the sensitivity of a query is at least co-NP-hard, and we suspect that the problem is πp2-complete.

Page 31: 20130219 nofreelunch arai

The case where efficient algorithms exists…

}  Consider 2d table }  Let the query qall as: “SELECT R1, R2, COUNT(*) FROM T

GROUP BY R1, R2”. }  The sensitivity of qall can be computed using the following

lemma:

}  Removing a subset paths that form Hamiltonian cycles, it is shown that the original set of moves was the smallest set of moves.

Page 32: 20130219 nofreelunch arai

Related works}  Impossibility result [Dwork06, Dinor & Nissim03, etc.]

}  Answering many queries w/ bounded noise does not preserve privacy

}  SNS privacy }  Relationship privacy [Rastogi09]

}  Adversarial privacy : an algorithm is private if the posterior P(t|O) dist. Is close to the prior P(t) (weaker than indistinguishability).

}  Assumptions about the data (SNSs?) }  Small perturbation, higher utility than the existing laplace mechanism

}  Resistance to various attackers [Kasiviswanathan08]

Page 33: 20130219 nofreelunch arai

Summary1.  They proposed no free lunch theorem (as a simplified

version of the impossibility result) based on privacy game. 2.  They proposed a guideline for DP application ( we must

consider not only data generating mechanism, also previously released data ).

3.  Show examples: 1.  when applied to arbitrary SNS data. 2.  when applied to tabular data when an attacker has aggregate-level

background knowledge.

4.  They proposed a modification of DP for tabular data with aggregate-level background knowledge (contingency table). The knowledge can be described as the constraint of the existence of neighbors.