Foundations of Privacy Lecture 5
description
Transcript of Foundations of Privacy Lecture 5
Foundations of Privacy
Lecture 5
Lecturer: Moni Naor
Desirable Properties from a sanitization mechanism
• Composability– Applying the sanitization several time yields a graceful
degradation – Will see: t releases , each -DP, are t¢ -DP– Next class: (√t+t 2,)-DP (roughly)
• Robustness to side information– No need to specify exactly what the adversary knows: – knows everything except one row
Differential Privacy: satisfies both…
Differential PrivacyProtect individual participants:
Curator/Sanitizer M
Curator/Sanitizer M
+
Dwork, McSherry Nissim & Smith 2006
D2
D1
Adjacency: D+I and D-I
Differential PrivacyProtect individual participants:
Probability of every bad event - or any event - increases only by small multiplicative factor when I enter the DB.
May as well participate in DB…
ε-differentially private sanitizer MFor all DBs D, all individuals I and all events T
PrA[M(D+I) 2 T]PrA[M(D-I) 2 T]
≤ eε ≈ 1+ε e-ε ≤
Handles aux input
Differential Privacy
(Bad) Responses: Z ZZ
Pr [response]
ratio bounded
Sanitizer M gives -differential privacy if: for all adjacent D1 and D2, and all A µ range(M):
Pr[M(D1) 2 A] ≤ e Pr[M(D2) 2 A]
Participation in the data set poses no additional risk
Differing in one user
Example of Differential Privacy
X is a set of (name,tag 2 {0,1}) tuplesOne query: #of participants with tag=1
Sanitizer : output #of 1’s + noise• noise from Laplace distribution with parameter 1/ε • Pr[noise = k-1] ≈ eε Pr[noise=k]
0 1 2 3 4 5-1-2-3-4
0 1 2 3 4 5-1-2-3-4
(, ) - Differential Privacy
Bad Responses: Z ZZ
Pr [response]
ratio bounded
This course: negligible
Sanitizer M gives (, ) -differential privacy if: for all adjacent D1 and D2, and all A µ range(M):
Pr[M(D1) 2 A] ≤ e Pr[M(D2) 2 A] +
Typical setting and negligible
Example: NO Differential PrivacyU set of (name,tag 2{0,1}) tuplesOne counting query: #of participants with tag=1
Sanitizer A: choose and release a few random tagsBad event T: Only my tag is 1, my tag releasedPrA[A(D+Me) 2 T] ≥ 1/nPrA[A(D-Me) 2 T] = 0
PrA[A(D+Me) 2
T]PrA[A(D-Me) 2 T]≤ eε ≈ 1+ε e-ε ≤
• Not ε diff private for any ε!• It is (0,1/n) Differential
Private
Counting Queries
Counting-queriesQ is a set of predicates q: U {0,1}Query: how many x participants satisfy q?Relaxed accuracy:
answer query within α additive error w.h.pNot so bad: some error anyway inherent in statistical analysis
U
Database x of size n
Query qn individuals, each contributing a single point in U
Sometimes talk about fraction
Bound on Achievable Privacy
Want to get bounds on the • Accuracy
– The responses from the mechanism to all queries are assured to be within α except with probability
• Number of queries t for which we can receive accurate answers
• The privacy parameter ε for which ε differential privacy is achievable – Or (ε,) differential privacy is achievable
Blatant Non PrivacyMechanism M is Blatantly Non-Private if there is an adversary A that • On any database D of size n can select queries and
use the responses M(D) to reconstruct D’ such that ||D-D’||1 2 o(n)
D’ agrees with D in all but o(n) of the entries.
Claim: Blatant non privacy implies that M is not (, ) -DP for any constant
12
Sanitization Can’t be Too Accurate
Usual counting queries–Query: q µ [n]– i 2 q di Response = Answer + noise
Blatant Non-Privacy: Adversary Guesses 99% bits
Theorem: If all responses are within o(n) of the true answer, then the algorithm is blatantly non-private.
But: require exponential # of queries .
13
Proof: Exponential Adversary• Focus on Column Containing Super Private Bit
• Assume all answers are within error bound .
“The database”
Vector d 2 {0,1}n
0
1
1
1
1
0
0
Will show that cannot be o(n)
14
Proof: Exponential Adversary for Blatant Non Privacy
• Estimate #1’s in all possible sets– 8 S µ [n]: |M(S) – i 2 S di | ≤
• Weed Out “Distant” DBs– For each possible candidate database c 2 {0,1}n:
If for any S µ [n]: |i 2 S ci – M(S)| > ,
then rule out c.– If c not ruled out, halt and output c
Claim: Real database d won’t be ruled out
M(S): answer on S
Proof: Exponential Adversary• Assume: 8 S µ [n]: |M(S) – i 2S di | ≤ Claim: For c that has not been ruled out
Hamming distance (c,d) ≤ 2
0
1 1
S0
S1
d c
10
011
0
1
≤ 4
|M(S0) - i 2S0 ci | ≤ (c not ruled
out)
|M(S1) - i 2S1 ci | ≤ (c not ruled
out)
≤ 2
≤ 2
Impossibility of Exponential QueriesThe result means that we cannot sanitize the data and publish a data structure so that • for all queries the answer can be deduced correctly
to within 2 o(n)
query 1,query 2,. . .
Database
answer 1answer 3
answer 2
?
SanitizerOn the other hand: we will see that we can get accuracy up to log |Q|
What can we do efficiently?Allowed “too” much power to the adversary• Number of queries: exponential• Computation: exponential• On the other hand: lack of wild errors in the responses
Theorem: For any sanitization algorithm: If all responses are within o(√n) of the true answer, then it
is blatantly non-private even against a polynomial time adversary making O(n log2 n) random queries.
The Model
• As before: database d is a bit string of length n.
• Counting queries:– A query is a subset q µ {1, …, n}– The (exact) answer is aq = i 2q di
• -perturbation – for an answer: aq ±
Slide 18
What If We Had Exact Answers? • Consider a mechanism 0-perturbations
– Receive the exact answer aq = i 2q di
Then with n linearly independent queries – over the reals
we could reconstruct d precisely: • Obtain n linearly equations aq = i 2q ci and solve uniquely
When we have -perturbations : get an inequality • aj - ≤ i 2q ci ≤ aj +
Idea: use linear programming
A solution must exist: d itself
Privacy requires Ω(√n) perturbation
Consider a database with o(√n) perturbation• Adversary makes t = n log2 n random queries qj,
getting noisy answers aj• Privacy violating Algorithm:
Construct database c = {ci}1 ≤ i ≤ n by solving Linear Program:
0 ≤ ci ≤ 1 for 1 ≤ i ≤ naj - ≤ i 2q ci ≤ aj + for 1 ≤ j ≤ t
• Round the solution:– if ci > 1/2 set to 1 and to 0 otherwise
A solution must exist: d itself
For every query qj: its answer according to c is
at most 2 far from its (real) answer in d.
Bad solutions to LP do not surviveA query q disqualifies a potential database c 2 [0,1]n if its answer on q is more than 2 far answer in d:
|i 2q ci -i 2q di| > 2• Idea: show that for a database c that is far away from d
a random query disqualifies c with some constant probability
• Want to use the Union Bound: all far away solutions are disqualified w.p. at least 1 – nn(1 - )t = 1–neg(n) How do we limit the solution space?
Round each value to closest 1/n
Privacy requires Ω(√n) perturbationA query q disqualifies a potential database c 2 [0,1]n if its answer on q is more than 2 far answer in d: Lemma: if c is far away from d, then a random query
disqualifies c with some constant probability • If Probi 2 [n] [|di-ci| ¸ 1/3] > ,
then there is a >0 such that Probq 2 {0,1}[n] [|i 2q (ci – di)|¸ 2+1] >
Proof uses Azuma’s inequality
Privacy requires Ω(√n) perturbationCan discretize all potential databases c 2 [0,1]n : Suppose we round each entry ci to closest fraction with denominator n:
|ci – wi/n| · 1/n
The response on q change by at most 1. • If we disqualify all `discrete’ databases then we also
effectively eliminate all c 2 [0,1]n • There are nn `discrete’ databases
Privacy requires Ω(√n) perturbationA query q disqualifies a potential database c 2 [0,1]n if its answer on q is more than 2 far answer in d: Claim:if c is far away from d, then a random query
disqualifies c with some constant probability • Therefore: t = n log2 n queries leave a negligible
probability for each far away reconstruction.• Union bound: all far away suggestions are
disqualified w.p. at least 1 – nn(1 - )t = 1 – neg(n)
Can apply union bound by discretization
Count number of entries far from d
Review and Conclusion
• When the perturbation is o(√n), choosing Õ(n) random queries gives enough information to efficiently reconstruct an o(n)-close db.
• Database reconstructed using Linear programming – polynomial time.
Slide 25
o(√n) databases are Blatantly Non-Private.poly(n) time reconstructable
CompositionSuppose we are going to apply a DP mechanism t times.
– Perhaps on different databases
Want to argue that result is differentially private• A value b 2 {0,1} is chosen • In each of the t rounds adversary A picks two adjacent
databases D0i and D1
i and receives result zi of an -DP mechanism Mi on Db
i
• Want to argue A‘s view is within for both values of b • A‘s view: (z1, z2, …, zt) plus randomness used.
Differential Privacy: Composition
Handles auxiliary informationComposes naturally• A1(D) is ε1-diffP• for all z1, A2(D,z1) is ε2-diffP,Then A2(D,A1(D)) is (ε1+ε2)-diffPProof:
for all adjacent D,D’ and (z1,z2):e-ε1 ≤ P[z1] / P’[z1] ≤ eε1 e-ε2 ≤ P[z2] / P’[z2] ≤ eε2
e-(ε1+ε2) ≤ P[(z1,z2)]/P’[(z1,z2)] ≤ eε1+ε2
P[z1] = Pr z~A1(D)[z=z1]P’[z1] = Pr z~A1(D’)[z=z1]
P[z2] = Pr z~A2(D,z1)[z=z2]P’[z2] = Pr z~A2(D’,z1)[z=z2]
Differential Privacy: Composition
• If all mechanisms Mi are -DP, then for any view the probability that A gets the view when b=0 and when b=1 are with et
Therefore results for a single query translate to results on several queries
Answering a single counting query
U set of (name,tag2 {0,1}) tuplesOne counting query: #of participants with tag=1
Sanitizer A: output #of 1’s + noiseDifferentially private! If choose noise properly
Choose noise from Laplace distribution
0 1 2 3 4 5-1-2-3-4
Laplacian Noise
Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e-|y|/b
Standard deviation: O(b)Take b=1/ε, get that Pr[Y=y] Ç e-|y|
Laplacian Noise: ε-Privacy
Take b=1/ε, get that Pr[Y=y] Ç e-|y|
Release: q(D) + Lap(1/ε)
For adjacent D,D’: |q(D) – q(D’)| ≤ 1For output a: e- ≤ Prby D[a]/Prby D’[a] ≤
e
0 1 2 3 4 5-1-2-3-4
Laplacian Noise: ε-Privacy
Theorem: the Laplace mechanism with parameter b=1/ is -differential private
0 1 2 3 4 5-1-2-3-4
0 1 2 3 4 5-1-2-3-4
Laplacian Noise: Õ(1/ε)-Error
Take b=1/ε, get that Pr[Y=y] Ç e-|y|
Concentration of the Laplace distribution:Pry~Y[|y| > k·1/ε] = O(e-k)
Setting k=O(log n)Expected error is 1/ε, w.h.p error is Õ(1/ε)