Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample...

25
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft

Transcript of Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample...

Page 1: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Differentially Private Marginals Release with

Mutual Consistency andError Independent of Sample SizeCynthia Dwork, Microsoft

Page 2: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Full Papers Privacy, Accuracy, and Consistency Too: A

Holistic Solution to Contingency Table Release Barak, Chaudhuri, Dwork, Kale, McSherry, and

Talwar ACM SIGMOD/PODS 2007

Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Dwork, McSherry, and Talwar This Workshop

Page 3: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Release of Contingency Table Marginals Simultaneously ensure:

Consistency Accuracy Differential Privacy

Page 4: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Release of Contingency Table Marginals Simultaneously ensure:

Consistency Accuracy Differential Privacy

Terms To Define: Contingency Table Marginal Consistency Accuracy Differential Privacy

Page 5: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Contingency Tables and Marginals Contingency Table: Histogram / Table of

Counts Each respondent (member of data set)

described by a vector of k (binary) attributes

Population in each of the 2k cells One cell for each setting of the k attributes

A2

A1

A3

Page 6: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Contingency Tables and Marginals Contingency Table: Histogram / Table of

Counts Each respondent (member of data set)

described by a vector of k (binary) attributes Population in each of the 2k cells

One cell for each setting of the k attributes

Marginal: sub-table Specified by a set of j ≤ k attributes, eg, j=1 Histogram of population in each of 2j

(eg, 2) cells One cell for each setting of the j selected attributes A1 = 0: 3, A1 = 1: 4, so the A1 marginal is (3,4)

A1

Page 7: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

All the Notation for the Entire Talk

D: the real data set n: number of respondents k: number of attributes T = T(D) : the contingency table representing D (2k

cells) T*: contingency table of a fictional data set M = M(T): a collection of marginals of T

M3: collection of all 3-way marginals R=R(M)=R(M(T)): reported marginals

Typically noisy, to protect privacy: R(M(T)) ≠ M(T) c = c(M): total number of cells in M : name of a marginal (ie, a set of attributes) ε: a privacy parameter

Page 8: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Consistency Across Reported MarginalsThere exists a fictional contingency table T*

whose marginals equal the reported marginals

M(T*) = R(M(T)) T*, M(T*) may have negative and/or non-integral

counts

Who cares about consistency? Not we. Software?

Page 9: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Consistency Across Reported MarginalsThere exists a fictional contingency table T* whose

marginals equal the reported marginals

M(T*) = R(M(T)) T*, M(T*) may have negative and/or non-integral

counts

Who cares about integrality, non-negativity? Not we. Software? See the paper.

Page 10: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Accuracy of Reported Values Roughly, described by E[||R(M(T)) – M(T)||1]

Expected error in each cell: proportional to c(M)/ε

A little worse Probabilistic guarantees on size of max error Similar to change obtained by randomly

adding/deleting c(M)/ε respondents to T and then computing M

Key Point: Error is Independent of n (and k) Depends on the “complexity” of M Depends on the privacy parameter ε

Page 11: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

ε-Differential Privacy

11

For all x, for all reported values rPr[R(M) = r | x in D] 2 exp(±ε) Pr[R(M) = r | x

not in D]

Pr [r]

ratio bounded

r

Page 12: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

ε-Differential PrivacyWhen ε is small: for all x, for all reported r

Pr[R(M) = r | x in D] 2 (1 ± ε) Pr[R(M) = r | x not in D]

Probabilities taken over coins flipped by curator Independent of other sources of data, databases, or even

knowledge of every element in D\{x}.

“Anything, good or bad, is essentially equally likely to occur, whether I join the database or not.”

Generalizes to groups of respondents Although, if group is large, then outcomes should differ.

Page 13: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Why Differential Privacy? Dalenius’ Goal:

“Anything that can be learned about a respondent, given access to the statistical database, can be learned without access”

is Provably Unachievable.

Sam the (American) smoker tries to buy medical insurance Statistical DB teaches smoking causes cancer Sam harmed: high premiums for medical insurance Sam need not be in the database!

Differential Privacy guarantees that risk to Sam will not substantially increase if he enters the DB

DBs have intrinsic social value

Page 14: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

14

An Ad Omnia Guarantee No perceptible risk is incurred by joining

data set Anything adversary can do to Sam, it could

do even if his data not in data set

Bad r’s: X XX

Pr [r]

Page 15: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Achieving Differential Privacy for d-ary f Curator adds noise according to Laplace

distribution “Hides” the presence/absence of any individual How much can the data of one person affect

M(T)? 8 2 M, one person can affect one cell in (T), by 1f = maxD, x ||f(D [ {x}) – f(D \ {x})||1

eg, = 1M ≤ |M|

Page 16: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

16

Calibrate Noise to Sensitivity for d-ary f

f = maxD,x ||f(D [ {x}) – f(D \ {x})||1

0 s 2s 3s 4s 5s-s-2s-3s-4s

Theorem: To achieve -differential privacy, use scaled symmetric noise ~ Lap(s)d with s = f/

Ratio = e

Page 17: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

17

Calibrate Noise to Sensitivity for d-ary f

f = maxD,x ||f(D [ {x}) – f(D \ {x})||1

0 s 2s 3s 4s 5s-s-2s-3s-4s

Theorem: To achieve -differential privacy, use scaled symmetric noise ~ Lap(s)d with s = f/

f = : s = 1/ ε

Page 18: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

18

Calibrate Noise to Sensitivity for d-ary f

f = maxD, x ||f(D [ {x}) – f(D \ {x})||1

0 s 2s 3s 4s 5s-s-2s-3s-4s

Theorem: To achieve -differential privacy, use scaled symmetric noise ~ Lap(s)d with s = f/

f = T: s = 1/ ε

Page 19: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

19

Calibrate Noise to Sensitivity for d-ary f

f = maxD, x ||f(D [ {x}) – f(D \ {x})||1

0 s 2s 3s 4s 5s-s-2s-3s-4s

Theorem: To achieve -differential privacy, use scaled symmetric noise ~ Lap(s)d with s = f/

f = M: s ≤ |M|/ ε

Page 20: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

20

Calibrate Noise to Sensitivity for d-ary f

f = maxD, x ||f(D [ {x}) – f(D \ {x})||1

0 s 2s 3s 4s 5s-s-2s-3s-4s

Theorem: To achieve -differential privacy, use scaled symmetric noise ~ Lap(s)d with s = f/

f = M3: s ≤ (k choose 3) / ε

Page 21: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Application: Release of Marginals M Release noisy contingency table; compute

marginals? Consistency and differential privacy Noise per cell of T: Lap(1/ε) Noise per cell of M: about 2 k/2/ ε for low order

marginals

Release noisy versions of all marginals in M? Noise per cell of M: Lap(|M|/ε) Differential privacy and better accuracy Inconsistent

Page 22: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Move to the Fourier Domain Just a change of basis. Why bother?

T represented by 2k Fourier coefficients (it has 2k cells) To compute j-ary marginal ®(T) only need 2j coefficients For any M, expected noise/cell depends on number of

coefficients needed to compute M(T) Eg, for M3, E[noise/cell] ≈ (k choose 3)/ε.

The Algorithm for R(M(T)): Compute set of Fourier coefficients of T needed for

M(T) Add noise; gives Fourier coefficients for M(T*)

1-1 mapping between set of Fourier coeffs and tables ensures consistency!

Convert back to obtain M(T*) Release R(M(T))=M(T*)

Page 23: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Improving Accuracy Gaussian noise, instead of Laplacian

E[noise/cell] for M3 looks more like O((log (1/) k3/2/ε)

Probablistic (1- ) guarantee of ε-differential privacy

Use Domain-Specific Knowledge We have, so far, avoided this! If most attributes are considered (socially)

insensitive, can add less noise, and to fewer coefficients Eg, M3 with 1 sensitive attribute ≈ k2 (instead of k3 )

Page 24: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

In Theory, Noise Must Depend on the Set M“Dinur-Nissim” style results: 1 sensitive attribute

Overly accurate answers to too many questions permits reconstruction of sensitive attributes of almost entire database, say, 99%.

Attacks use no linkage/external/auxiliary information

Rough Translation: there are “bad” databases, in which a sensitive binary attribute can be learned for all respondents, from, say, 2√n degree-2 marginals, if per-cell errors are strictly less than √n The “badness” relates to the distribution of the

occurrence of the insensitive attributes!

Page 25: Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Summary Introduced ε-Differential Privacy

Rigorous and ad omnia notion of privacy Showed how to achieve differential privacy

In general In the special case of marginal release Simple!

Special attention paid to ensuring consistency among released marginals

Per cell accuracy deteriorates with complexity of query and degree of privacy

Noted that accuracy must deteriorate with complexity of query