20130219 nofreelunch arai

PPDM勉強会第3回 “No Free Lunch in Data Privacy”

(Daniel Kifer, Ashwin Machanavajjhala SIGMOD2011)

理化学研究所　荒井ひろみ

Contributions of this paper1.  Simplify the impossibility result* 2.  Show privacy definition that relies on assumptions about

data generating mechanism and compare with DP 3.  Propose a guideline for determining whether DP is suitable

for a given application. 4.  Demonstrate cases that DP does not meet the guideline

1.  when applied to arbitrary SNS data. 2.  when applied to tabular data when an attacker has aggregate-level

background knowledge.

5.  Propose a modification of DP for tabular data with aggregate-level background knowledge

*briefly, “answering many queries w/ bounded noise does not preserve privacy”

Outline}  Brief review of differential privacy (DP) }  Analysis of attacker

}  Define discriminant, non privacy game, note no free lunch theorem

}  Show relation between DP and no free lunch theorem }  Privacy risks of DP algorithms for various attackers

}  Unsuitable cases : naïve application for correlated data }  New DP definition subject to background knowledge

}  Introducing restrictions that represent previously released exact query answers

Differential privacy : problem setting}  Query-answering mechanism

顧客 age Item A

Item B

…

A 25 0 1 …

B 42 1 1 …

…

answer 顧客 Type

x Type Y

Type Z

20代 324 1 52

30代 34 34 13

…

DB wants to preserve privacy of individuals

Query answers: statistical information etc.

query

Collection of private data

Differential privacy : motivation}  A privacy guarantee that limits risk incurred by JOINING

encourages participation in the dataset. }  minimize the increased risk to an individual incurred by joining

(or leaving) the database. (NOT comparing an adversary's prior and posterior views )

w/ Yuko

w/o Yuko

Not so different

Recall : Dalenius’s problem }  if the statistical database teaches us anything at all, then it should change our beliefs about

individuals }  the things that statistical databases are designed to teach can, sometimes indirectly, cause

damage to an individual, even if this individual is not in the database.

Differential privacy : definition

D D’

…

Almost same probability

Randomized query-response algorithm K

[Dwork06]

S S’

P（K(D) ∈S) P（K(D’) ∈S)

Range(K)

Definition of the neighboring DBs is very important in this paper

Differential privacy: mechanism

•  Density function of the Laplace distribution

•  for any z, z’ such that |z – z’| ≦ 1，the density at z is at most 　times the density at z’, satisfying the condition in

[Dwork06]

Definitions of DP}  Two flavors of DP

}  Deleting of inserting a tuple: unbounded

}  Changing tuple value: bounded

•  Note that the existence of the tuple ≠participation !

The no-free-lunch theorem}  It is not possible to guarantee privacy and utility w/o

making assumptions about the data-generating mechanism…

}  To discuss this problem: }  Define the discriminant ω as a lower bound on utility }  Analyze ω of the Laplace mechanism }  Define the non-privacy game }  Propose no free lunch theorem }  Free lunch theorem for DP

Discriminant (as a utility measure)

}  ω : a measure for query accuracy. If ω ~1 * , A answers with reasonable accuracy.

}  A : randomized answering query processor }  Integer k : like anonymity parameter ? }  Constraint c : lower bound of utility of A with parameter k.

* Note that the discriminant is １ for deterministic algorithm e.g. k-anonymity algorithm.

Discriminant

D D’

…

e.g. k=2

S S’

P（A(D) ∈S≧c P（A(D’) ∈S’)≧c

Range(K)

Example of discriminant

}  Canser-patient DB }  # of canser pationts in DB : D1: 0 / D2: 10,000 / D3: 20,000 }  S1=[0,1000], S2=[9000,11000], S3=[19000,∞], }  P(A(Di)∈Si)≧0.95 for all i

Discriminant of the Laplace mechanism

Intuitive description: }  Laplace mechanism w/ sensitivity 0.5 }  Choose n large enough →　we can choose {Di} and {Si} so that the distances between Di ‘s and the ranges of Si’s are large enough →　discriminant became 1

∝n

Non-privacy game}  Privacy definition as a game:

}  Assume a data-generating mechanism P }  The attacker guess a true answer q(D) from a randomized

answer A(D) against a sensitive query q

P D

q

A(D)

q(D) ?

No free lunch theorem}  Providing both privacy (as a game) and utility is impossible

if there are no restriction on data-generating mechanism

}  If D is uniformly distributed, the attacker’s strategy is to guess q(D) if A(D) ∈ Si

}  The attacker’s guess is correct w/ probability 1/k w/o A(D). }  He wins w/ probability ～1　w/ A(D) !

No free lunch and differential privacy}  Privacy definition w/o assumption about the data :

}  Note: the discriminant ω(k;A) of any algorithm A satisfying ε-free-lunch privacy is bounded by }  (my interpretation) Let P(A(D1)∈S)=c. There are at least k-1

possible DB instances {Di} where ce-ε ≦ P(A(Di)∈S)≦ceε. Using Σi P(A(Di)∈S)=1, c≦

Privacy risks in differential privacy}  General guideline for determining a privacy definition }  Note that the DP for more knowledgeable attacker add

less noise !

Consider three kinds of DP algorithm

example}  Consider the table with 1 tuple (Bob) and two 2-bit attributes R1 and R2:

00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

neighbors

neighbors neighbors’ neighbors

Bounded DP (tuple) Attribute DP 00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

neighbors’ neighbors’ neighbors

Bit DP

neighbors’ neighbors

neighbors

Probability of answering the true record

Question:boundではないのか…？

example}  Consider the table with 1 tuple (Bob) and two 2-bit attributes R1 and R2:

00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

neighbors

neighbors neighbors’ neighbors

Bounded DP (tuple) Attribute DP 00 00

00 01

00 10

00 11

01 00

01 01

01 10

01 11

10 00

10 01

10 10

10 11

11 00

11 01

11 10

11 11

neighbors’ neighbors’ neighbors

Bit DP

neighbors’ neighbors

neighbors


Higher lower


Problem: correlated data}  If several records are known to have the same attribute

value, sensitivity must be larger }  E.g. disease database : bob and his family might have the same

disease

}  How should we deal with this problem? }  Hide evidence of participation (any influence of a certain

participation) }  Discussion

}  Growing SNSs }  Prior knowledge about exact statistics

Growing social networksAssume edge-growing SNSs. }  Let the network grow, after which the attacker will

ask the query “how many edges are there between the two communities".

Can we preserve privacy of Bob’s external link?

→making assumptions about data generating model 1.  Forest Fire model

2.  Copying model

3.  MVS model

From simulation: }  1,2→ we cannot set a noise parameter ε reliably

unless we know the network parameters (model parameters or final edge number).

}  3→ has a steady state distribution, rather favorable

Only bob has external link

Initial state of two clusters

Charlie

Bob

MVS model

Privacy breach after some exact data releases

}  Example: contingency tables(deterministic) and additional differential private data release

}  a demonstration for additional privacy breach (4.1) }  Consider a table T, attribute R w/ domain {r1,…,rk} }  k-1 queries :

}  If we additionally knew the exact answer to “select count(*) from T where R=ri”, we would be able to exactly reconstruct the table. → the tuples are correlated !!

}  Additional differential private answers...

Privacy breach after some exact statistics release (2)

}  Consider a table T, attribute R w/ domain {r1,…,rk} }  k-1 queries :

}  Additional k ε-differential private answers...

}  If k is large (e.g. d-bit vector w/ 2^d possible value) the variance is small (recall 2.2 knowledge vs privacy risk) → T is reconstructed w/ very high probability … (due to correlation w/ prior release of information)

A plausible deniability (idea)}  What we should do to maintain consistency w/ previously

deterministic query answers have been released? }  We should choose bounded DP

}  If the number of tuples had been answered previously, the number of tuples might be stay the same

}  In general, we can maintain consistency in several ways… }  exchange attribute values collaboratively, for example

differential privacy subject to background knowledge}  definitions

R L 計

M 43 9 52

F 44 4 48

計 87 13 100

contingency table

cell cell count

table T

id gender handedness

taro male left

hana female right

… … …

R L 計

M 42 9 52

F 44 4 48

計 87 13 100

move

differential privacy subject to background knowledge

}  Define DP for neighboring tables

Neighbors induced by other prior statistics

}  Example: exact query answer for “select gender, count(*) from T group by gender” }  ×　unbounded DP : the number of tuples is already published }  ×　bounded DB : we cannot arbitrarily modify a single tuple.. }  Define neighbors that maintaining consistency with the prior

query answers:

Neighbor-based algorithm for DP

}  Definitions : }  distance function between two contingency tables: }  To achieve 2ε-generic DP, exponential mechanism

[McSherry06], can be used. (Δq=d(Ta, Tb))

Neighbor-based algorithm for DP

}  Laplace mechanism : }  Sensitivity

}  The Laplace mechanism adds noise (the probability of density function is ) to the query answer.

NP-hard problem}  Dealing with neighbors under constraints is NP-hard

problem

→the general problem of finding an upper bound on the sensitivity of a query is at least co-NP-hard, and we suspect that　the problem is πp2-complete.

The case where efficient algorithms exists…

}  Consider 2d table }  Let the query qall as: “SELECT R1, R2, COUNT(*) FROM T

GROUP BY R1, R2”. }  The sensitivity of qall can be computed using the following

lemma:

}  Removing a subset paths that form Hamiltonian cycles, it is shown that the original set of moves was the smallest set of moves.

Related works}  Impossibility result [Dwork06, Dinor & Nissim03, etc.]

}  Answering many queries w/ bounded noise does not preserve privacy

}  SNS privacy }  Relationship privacy [Rastogi09]

}  Adversarial privacy : an algorithm is private if the posterior P(t|O) dist. Is close to the prior P(t) (weaker than indistinguishability).

}  Assumptions about the data (SNSs?) }  Small perturbation, higher utility than the existing laplace mechanism

}  Resistance to various attackers [Kasiviswanathan08]

Summary1.  They proposed no free lunch theorem (as a simplified

version of the impossibility result) based on privacy game. 2.  They proposed a guideline for DP application ( we must

consider not only data generating mechanism, also previously released data ).

3.  Show examples: 1.  when applied to arbitrary SNS data. 2.  when applied to tabular data when an attacker has aggregate-level

background knowledge.

4.  They proposed a modification of DP for tabular data with aggregate-level background knowledge (contingency table). The knowledge can be described as the constraint of the existence of neighbors.

20130219 nofreelunch arai

Documents

Transcript of 20130219 nofreelunch arai