20130219 nofreelunch arai

Upload
hiromiarai 
Category
Documents

view
231 
download
1
Embed Size (px)
Transcript of 20130219 nofreelunch arai
PPDM勉強会第3回 “No Free Lunch in Data Privacy”
(Daniel Kifer, Ashwin Machanavajjhala SIGMOD2011)
理化学研究所 荒井ひろみ
Contributions of this paper1. Simplify the impossibility result* 2. Show privacy definition that relies on assumptions about
data generating mechanism and compare with DP 3. Propose a guideline for determining whether DP is suitable
for a given application. 4. Demonstrate cases that DP does not meet the guideline
1. when applied to arbitrary SNS data. 2. when applied to tabular data when an attacker has aggregatelevel
background knowledge.
5. Propose a modification of DP for tabular data with aggregatelevel background knowledge
*briefly, “answering many queries w/ bounded noise does not preserve privacy”
Outline} Brief review of differential privacy (DP) } Analysis of attacker
} Define discriminant, non privacy game, note no free lunch theorem
} Show relation between DP and no free lunch theorem } Privacy risks of DP algorithms for various attackers
} Unsuitable cases : naïve application for correlated data } New DP definition subject to background knowledge
} Introducing restrictions that represent previously released exact query answers
Differential privacy : problem setting} Queryanswering mechanism
顧客 age Item A
Item B
…
A 25 0 1 …
B 42 1 1 …
…
answer 顧客 Type
x Type Y
Type Z
20代 324 1 52
30代 34 34 13
…
DB wants to preserve privacy of individuals
Query answers: statistical information etc.
query
Collection of private data
Differential privacy : motivation} A privacy guarantee that limits risk incurred by JOINING
encourages participation in the dataset. } minimize the increased risk to an individual incurred by joining
(or leaving) the database. (NOT comparing an adversary's prior and posterior views )
w/ Yuko
w/o Yuko
Not so different
Recall : Dalenius’s problem } if the statistical database teaches us anything at all, then it should change our beliefs about
individuals } the things that statistical databases are designed to teach can, sometimes indirectly, cause
damage to an individual, even if this individual is not in the database.
Differential privacy : definition
D D’
…
Almost same probability
Randomized queryresponse algorithm K
[Dwork06]
S S’
P（K(D) ∈S) P（K(D’) ∈S)
Range(K)
Definition of the neighboring DBs is very important in this paper
Differential privacy: mechanism
• Density function of the Laplace distribution
• for any z, z’ such that z – z’ ≦ 1，the density at z is at most times the density at z’, satisfying the condition in
[Dwork06]
Definitions of DP} Two flavors of DP
} Deleting of inserting a tuple: unbounded
} Changing tuple value: bounded
• Note that the existence of the tuple ≠participation !
The nofreelunch theorem} It is not possible to guarantee privacy and utility w/o
making assumptions about the datagenerating mechanism…
} To discuss this problem: } Define the discriminant ω as a lower bound on utility } Analyze ω of the Laplace mechanism } Define the nonprivacy game } Propose no free lunch theorem } Free lunch theorem for DP
Discriminant (as a utility measure)
} ω : a measure for query accuracy. If ω ~1 * , A answers with reasonable accuracy.
} A : randomized answering query processor } Integer k : like anonymity parameter ? } Constraint c : lower bound of utility of A with parameter k.
* Note that the discriminant is １ for deterministic algorithm e.g. kanonymity algorithm.
Discriminant
D D’
…
e.g. k=2
S S’
P（A(D) ∈S≧c P（A(D’) ∈S’)≧c
Range(K)
Example of discriminant
} Canserpatient DB } # of canser pationts in DB : D1: 0 / D2: 10,000 / D3: 20,000 } S1=[0,1000], S2=[9000,11000], S3=[19000,∞], } P(A(Di)∈Si)≧0.95 for all i
Discriminant of the Laplace mechanism
Intuitive description: } Laplace mechanism w/ sensitivity 0.5 } Choose n large enough → we can choose {Di} and {Si} so that the distances between Di ‘s and the ranges of Si’s are large enough → discriminant became 1
∝n
Nonprivacy game} Privacy definition as a game:
} Assume a datagenerating mechanism P } The attacker guess a true answer q(D) from a randomized
answer A(D) against a sensitive query q
P D
q
A(D)
q(D) ?
No free lunch theorem} Providing both privacy (as a game) and utility is impossible
if there are no restriction on datagenerating mechanism
} If D is uniformly distributed, the attacker’s strategy is to guess q(D) if A(D) ∈ Si
} The attacker’s guess is correct w/ probability 1/k w/o A(D). } He wins w/ probability ～1 w/ A(D) !
No free lunch and differential privacy} Privacy definition w/o assumption about the data :
} Note: the discriminant ω(k;A) of any algorithm A satisfying εfreelunch privacy is bounded by } (my interpretation) Let P(A(D1)∈S)=c. There are at least k1
possible DB instances {Di} where ceε ≦ P(A(Di)∈S)≦ceε. Using Σi P(A(Di)∈S)=1, c≦
Privacy risks in differential privacy} General guideline for determining a privacy definition } Note that the DP for more knowledgeable attacker add
less noise !
Consider three kinds of DP algorithm
example} Consider the table with 1 tuple (Bob) and two 2bit attributes R1 and R2:
00 00
00 01
00 10
00 11
01 00
01 01
01 10
01 11
10 00
10 01
10 10
10 11
11 00
11 01
11 10
11 11
00 00
00 01
00 10
00 11
01 00
01 01
01 10
01 11
10 00
10 01
10 10
10 11
11 00
11 01
11 10
11 11
neighbors
neighbors neighbors’ neighbors
Bounded DP (tuple) Attribute DP 00 00
00 01
00 10
00 11
01 00
01 01
01 10
01 11
10 00
10 01
10 10
10 11
11 00
11 01
11 10
11 11
neighbors’ neighbors’ neighbors
Bit DP
neighbors’ neighbors
neighbors
Probability of answering the true record
Question:boundではないのか…？
example} Consider the table with 1 tuple (Bob) and two 2bit attributes R1 and R2:
00 00
00 01
00 10
00 11
01 00
01 01
01 10
01 11
10 00
10 01
10 10
10 11
11 00
11 01
11 10
11 11
00 00
00 01
00 10
00 11
01 00
01 01
01 10
01 11
10 00
10 01
10 10
10 11
11 00
11 01
11 10
11 11
neighbors
neighbors neighbors’ neighbors
Bounded DP (tuple) Attribute DP 00 00
00 01
00 10
00 11
01 00
01 01
01 10
01 11
10 00
10 01
10 10
10 11
11 00
11 01
11 10
11 11
neighbors’ neighbors’ neighbors
Bit DP
neighbors’ neighbors
neighbors
Probability of answering the true record
Higher lower
Probability of answering the true record
Problem: correlated data} If several records are known to have the same attribute
value, sensitivity must be larger } E.g. disease database : bob and his family might have the same
disease
} How should we deal with this problem? } Hide evidence of participation (any influence of a certain
participation) } Discussion
} Growing SNSs } Prior knowledge about exact statistics
Growing social networksAssume edgegrowing SNSs. } Let the network grow, after which the attacker will
ask the query “how many edges are there between the two communities".
Can we preserve privacy of Bob’s external link?
→making assumptions about data generating model 1. Forest Fire model
2. Copying model
3. MVS model
From simulation: } 1,2→ we cannot set a noise parameter ε reliably
unless we know the network parameters (model parameters or final edge number).
} 3→ has a steady state distribution, rather favorable
Only bob has external link
Initial state of two clusters
Charlie
Bob
MVS model
Privacy breach after some exact data releases
} Example: contingency tables(deterministic) and additional differential private data release
} a demonstration for additional privacy breach (4.1) } Consider a table T, attribute R w/ domain {r1,…,rk} } k1 queries :
} If we additionally knew the exact answer to “select count(*) from T where R=ri”, we would be able to exactly reconstruct the table. → the tuples are correlated !!
} Additional differential private answers...
Privacy breach after some exact statistics release (2)
} Consider a table T, attribute R w/ domain {r1,…,rk} } k1 queries :
} Additional k εdifferential private answers...
} If k is large (e.g. dbit vector w/ 2^d possible value) the variance is small (recall 2.2 knowledge vs privacy risk) → T is reconstructed w/ very high probability … (due to correlation w/ prior release of information)
A plausible deniability (idea)} What we should do to maintain consistency w/ previously
deterministic query answers have been released? } We should choose bounded DP
} If the number of tuples had been answered previously, the number of tuples might be stay the same
} In general, we can maintain consistency in several ways… } exchange attribute values collaboratively, for example
differential privacy subject to background knowledge} definitions
R L 計
M 43 9 52
F 44 4 48
計 87 13 100
contingency table
cell cell count
table T
id gender handedness
taro male left
hana female right
… … …
R L 計
M 42 9 52
F 44 4 48
計 87 13 100
move
differential privacy subject to background knowledge
} Define DP for neighboring tables
Neighbors induced by other prior statistics
} Example: exact query answer for “select gender, count(*) from T group by gender” } × unbounded DP : the number of tuples is already published } × bounded DB : we cannot arbitrarily modify a single tuple.. } Define neighbors that maintaining consistency with the prior
query answers:
Neighborbased algorithm for DP
} Definitions : } distance function between two contingency tables: } To achieve 2εgeneric DP, exponential mechanism
[McSherry06], can be used. (Δq=d(Ta, Tb))
Neighborbased algorithm for DP
} Laplace mechanism : } Sensitivity
} The Laplace mechanism adds noise (the probability of density function is ) to the query answer.
NPhard problem} Dealing with neighbors under constraints is NPhard
problem
→the general problem of finding an upper bound on the sensitivity of a query is at least coNPhard, and we suspect that the problem is πp2complete.
The case where efficient algorithms exists…
} Consider 2d table } Let the query qall as: “SELECT R1, R2, COUNT(*) FROM T
GROUP BY R1, R2”. } The sensitivity of qall can be computed using the following
lemma:
} Removing a subset paths that form Hamiltonian cycles, it is shown that the original set of moves was the smallest set of moves.
Related works} Impossibility result [Dwork06, Dinor & Nissim03, etc.]
} Answering many queries w/ bounded noise does not preserve privacy
} SNS privacy } Relationship privacy [Rastogi09]
} Adversarial privacy : an algorithm is private if the posterior P(tO) dist. Is close to the prior P(t) (weaker than indistinguishability).
} Assumptions about the data (SNSs?) } Small perturbation, higher utility than the existing laplace mechanism
} Resistance to various attackers [Kasiviswanathan08]
Summary1. They proposed no free lunch theorem (as a simplified
version of the impossibility result) based on privacy game. 2. They proposed a guideline for DP application ( we must
consider not only data generating mechanism, also previously released data ).
3. Show examples: 1. when applied to arbitrary SNS data. 2. when applied to tabular data when an attacker has aggregatelevel
background knowledge.
4. They proposed a modification of DP for tabular data with aggregatelevel background knowledge (contingency table). The knowledge can be described as the constraint of the existence of neighbors.