Toward Privacy in Public Databases

Toward Privacy in Public Databases

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith,

Larry Stockmeyer, Hoeteck Wee

Work Done at Microsoft Research

2

Database Privacy

Think “Census” Individuals provide information Census Bureau publishes sanitized records

Privacy is legally mandated; what utility can we achieve?

Inherent Privacy vs Utility trade-off One extreme – complete privacy; no information Other extreme – complete information; no privacy

Goals: Find a middle path

preserve macroscopic properties “disguise” individual identifying information

Change the nature of discourse Establish framework for meaningful comparison of

techniques

3

Current solutions

Statistical approaches Alter the frequency (PRAN/DS/PERT) of particular features,

while preserving means. Additionally, erase values that reveal too much

Query-based approaches Disallow queries that reveal too much Output perturbation (add noise to true answer)

Unsatisfying Ad-hoc definitions of the privacy/breach Erasure can disclose information Noise can cancel (although, see work of Nissim+.) Combinations of several seemingly innocuous queries

could reveal information; refusal to answer can be revelatory

4

Everybody’s First Suggestion

Learn the distribution, then output A description of the distribution, or Samples from the learned distribution

Want to reflect facts on the ground Statistically insignificant clusters can be

important for allocating resources

5

Our Approach

Crypto-flavored definitions Mathematical characterization of Adversary’s goal Precise definition of when sanitization procedure fails

Intuition: seeing sanitized DB gives Adversary an “advantage”

Statistical Techniques Perturbation of attribute values

Differs from previous work: perturbation amounts depend on local densities of points

Highly abstracted version of problem If we can’t understand this, we can’t understand real life

(and we can’t…) If we get negative results here, the world is in trouble.

6

What do WE mean by privacy?

[Ruth Gavison] Protection from being brought to the attention of others

inherently valuable attention invites further privacy loss

Privacy is assured to the extent that one blends in with the crowd

Appealing definition; can be converted into a precise mathematical statement…

7

A geometric view

Abstraction: Database consists of points in high dimensional space Rd

independent samples from some underlying distribution

Points are unlabeledyou are your collection of attributes

Distance is everythingpoints are similar if and only if they are close (L2 norm)

Real Database (RDB), privaten unlabeled points in d-dimensional space

Sanitized Database (SDB), publicn’ new points, possibly in a different

space

8

The adversary or Isolator - Intuition

On input SDB and auxiliary information, adversary outputs a point q Rd

q “isolates” a real DB point x, if it is much closer to x than to x’s near neighbors q fails to isolate x if q looks roughly as much

like everyone in x’s neighborhood as it looks like x itself

Tightly clustered points have a smaller radius of isolation

RDB

9

Isolation – the definition

(c-1)

I(SDB,aux) = q x is isolated if B(q,c) contains fewer than T

other points from RDB T-radius of x – distance to its Tth-nearest

neighbor x is “safe” if x > (T-radius of x)/(c-1)

B(q,cx) contains x’s entire T-neighborhood

c – privacy parameter; eg, 4

qx

c

p

If |x-p| < T-radx < (c-1)x then

|q-p| · |q-x| + |x-p| < x + T-radx < cx

10

Requirements for the sanitizer

No way of obtaining privacy if AUX already reveals too much!

Sanitization procedure compromises privacy if giving the adversary access to the SDB considerably increases its probability of success

Definition of “considerably” can be forgiving, say, n-2. Made rigorous by quantification over adversaries,

distributions, auxiliary information, sanitizations, samples:

I I’ w.o.p. D aux z x 2 D |Pr[I(SDB,z) isolates x] – Pr[I’(z) isolates x]| is small/n

Provides a framework for describing the power of a sanitization method, and hence for comparisons

11

The Sanitizer

The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-

radius

x’ = San(x) R B(x,T-rad(x))

Intuition: We are blending x in with its crowd

We are adding to x random noise with mean zero, so several macroscopic properties should be preserved.

12

Flavor of Results (Preliminary)

Assumptions Data arises from a mixture of GaussiansDimension d, number of points n are larged = (log n)

Results Privacy: An adversary who knows the Gaussians

and some auxiliary information cannot isolate any point with probability more than 2-(d)

several special cases; general result not yet proved; Very different proof techniques from anything in the

statistics or crypto literatures!

Utility: A user who does not know the Gaussians can compute the means with a high probability.

13

The “simplest” interesting case

Two points – x and y – generated uniformly from surface of a ball B(o,)

The adversary knows x’, y’, and = |x-y|

We prove there are 2(d) “decoy” pairs (xi,yi) such that |xi-yi|= and Pr[ xi,yi | x’,y’ ] = Pr[ x,y | x’,y’ ]

Furthermore, the adversary can only isolate one point xi or yi at a time: they are “far apart” wrt

Proof based on symmetry arguments and coding theory.

High dimensionality crucial.

14

Finding Decoy Pairs

Consider a hyperplane H through x’, y’ and o xH, yH – mirror reflections of x, y through H

Note: reflections preserve distances! The world of xH, yH looks identical to the world of x,

y

x

y

y’

x’

xH

yH

Pr[ xH,yH | x’,y’ ] = Pr[ x,y | x’,y’ ]

H

15

Lots of choices for H

xH, yH – reflections of x, y through H(x’,y’,o)

Note: reflections preserve distances! The world of xH, yH looks identical to the world of x,

y

How many different H such that the corresponding xH are pairwise distant (and distant from x)?

2r sin

r

2

Sufficient to pick r > 2/3 and = 30°

Fact: There are 2(d) vectors in d-dim, at angle 60° from each other.

Probability that adversary wins ≤ 2-

(d)

x

x1

x2

> 2/3 r

16

Towards the general case… n points

The adversary is given n-1 real points x2,…,xn and one sanitized point x’1

Symmetry does not work – too many constraints

A more direct argument – Let Z = { pRd | p is a legal pre-image for x’1 }

Q = {p | if x1 = p then x1 is isolated by q } Show that Pr[x1 in Q∩Z | x’1 ] ≤ 2-(d)

Pr[x1 in Q∩Z | x’1 ] =

prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4)

17

Why does Q∩Z contribute so little mass?

Z = { p| p is a legal pre-image for x’1 }

Q = { p | if x1 = p then x1 is isolated by q }

qx1’

x2

x3

x4

x5

Z

Q

Key observation: As |q-x1’| increases, Q becomes larger.

But, larger distance from x1’ implies smaller probability mass, as x1 is randomized over a larger area

Q∩Z

x6

T=1; perturb to 1-radius|x1’ – x1| = 1-rad(x1)

18

The general case… n sanitized points

Initial intuition is wrong: Privacy of x1 given x1’ and all the other points in the

clear does not imply privacy of x1 given x1’ and sanitizations of others!

Sanitization of other points reveals information about x

19

Digression: Histogram Sanitization

U = d-dim cube, side = 2 Cut into 2d subcubes

split along each axis subcube has side = 1

For each subcubeif number of RDB points > 2T

then recurse

Output: list of cells and counts

20

Digression: Histogram Sanitization

Theorem: If n = 2o(d) and points are drawn uniformly from U, then histogram sanitizations are safe with respect to 8-isolation: Pr[I(SDB) succeeds] · 2-(d).

Rough Intuition:For q 2 C: expected distance to any x 2 C is relatively large (and even larger for x 2 C’); distances tightly concentrated. Increasing radius by 8 captures almost all the parent cell, which contains at least 2T points.

21

Combining the Two Sanitizations Partition RDB into two sets A and B Cross-training

Compute histogram sanitization for B v 2 A: v = side length of C containing v Output GSan(v, v)

A B

22

Cross-Training Privacy

Privacy for B: only histogram information about B is used

Privacy for A: enough variance for enough coordinates of v, even given C containing v and sanitization v’ of v.

23

Results on privacy.. The special Cases

Distribution Num. of points

Revealed to adversary

Auxiliary information

Uniform on surface of sphere

2 Both sanitized points

Distribution, 1-radius

Uniform over a bounding box or surface of sphere

n One sanitized point, all other real points

Distribution

Uniform over a hypercube

2(d) n/2 sanitized points Distribution

Gaussian 2o(d) n sanitized points Distribution

24

Learning mixtures of Gaussians - Spectral techniques

Observation: Optimal low-rank approx to a matrix of complex data yields the underlying structure, eg, means [M01,VW02].

We show that McSherry’s algorithm works for clustering sanitized Gaussian data

original distribution (mixture of Gaussians) is recovered

25

Spectral techniques for perturbed data

A sanitized point is the sum of two Gaussian variables – sample + noise

w.h.p. the T-radius of a point is less than the “radius” of its Gaussian

Variance of the noise is small Previous techniques work

26

Results on utility… An overview

Distributional/Worst-case

Objective Assumptions

Result

Worst-case Find K clusters minimizing largest diameter

- Diameter increases by a factor of 3

Distributional Find k maximum likelihood clusters

Mixture of k Gaussians

Correct clustering with high probability as long as means are pairwise sufficiently far

27

What about the real world?

Lessons from the abstract model High dimensionality is our friend Gaussian perturbations seem to be the right thing to

do Need to scale different attributes appropriately, so that

data is well rounded

Moving towards real data Outliers

– Our notion of c-isolation deals with them- Existence of outlier may be disclosed

Discrete attributes – Convert them into real-valued attributes- e.g. Convert a binary variable into a probability

Toward Privacy in Public Databases

Documents

Transcript of Toward Privacy in Public Databases