Privacy vs. Utility

26
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008

description

Privacy vs. Utility. Xintao Wu University of North Carolina at Charlotte Nov 10, 2008. Privacy. Legal interpretation View of privacy in terms of access that others have to us and our information. A general definition of privacy must be one that is measureable, of value, and actionable. - PowerPoint PPT Presentation

Transcript of Privacy vs. Utility

Page 1: Privacy vs. Utility

Privacy vs. Utility

Xintao Wu

University of North Carolina at CharlotteNov 10, 2008

Page 2: Privacy vs. Utility

2

Privacy

• Legal interpretation View of privacy in terms of access that others have to us and our

information. A general definition of privacy must be one that is measureable, of

value, and actionable.

• Measuring privacy Secrecy: it concerns information that others may gather about us.

The probability of a data item being accessed The change in knowledge of an adversary upon seeing the data

Anonymity: it addresses how much in the public gaze we are. The privacy leakage is measured in terms of the size of the blurring

accompanying the release of data. Solitude: it measures the degree to which others have physical

access to us.

Page 3: Privacy vs. Utility

3

Privacy vs. Utility

• Encryption does not work in publishing scenario.• Utility

The goal of privacy preservation measures is to secure access to confidential information while at the same time releasing aggregate information to the public.

Page 4: Privacy vs. Utility

4

Data anonymization methods

• Random perturbation Input perturbation Output perturbation

• Generalization The data domain has a natural hierarchical structure. The degree of perturbation can be measured in terms of the height

of the resulting generalization above the leaf values.

• Suppression• Permutation

Destroying the link between identifying and sensitive attributes that could lead to a privacy leakage.

Page 5: Privacy vs. Utility

5

Statistical measures of anonymity

• Query restriction For a database of size N, and a fixed parameter k, all queries that

returned either fewer than k or more than N-k records were rejected.

Could be subverted by requesting a specific sequence of queries

• Anonymity via variance Lower bound the variance for estimators of sensitive attributes

Utility is measured (by combining the perturbation scheme with a query restriction method) as the fraction of queries that are permitted after perturbation.

Confidence interval How hard it is to reconstruct the original data distribution

• Anonymity via multiplicity K-anonymity

Page 6: Privacy vs. Utility

6

Probabilistic measures of anonymity

• Knowing aggregate information about the data as well as the method of perturbation

Perturb X with a random value from [-1,1], the privacy achieved is 2.

The distribution of X is revealed, [0,1] with prob. 0.5 and [4,5] with prob. 0.5

The privacy achieved is reduced to 1

• Mutual information P(A|B) = 1 – 2^H(A|B)/2^H(A) = 1-2^(-I(A;B)) H(A) encodes the amount of uncertainty (the degree of privacy) in

a random variable. H(A|B) the amount of privacy left in A after B is released. I(A;B) = H(A)- H(A|B) mutual information between A and B

• Utility Statistical distance between the source distribution of the data and

perturbed distribution.

Page 7: Privacy vs. Utility

7

On the design and quantification of ppdm algorithms, PODS01

Page 8: Privacy vs. Utility

8

Page 9: Privacy vs. Utility

9

Market basket data

• A privacy breach is defined as one in which the probability of some property of the input data is high, conditioned on the output perturbed data having certain properties. (Evfimievski et al.)

• Privacy is measured in terms of the probability of correctly reconstructing the original bit, given a perturbed bit. (Rizvi and Haristsa)

• Utility is the problem of reconstructing itemset frequencies accurately.

Page 10: Privacy vs. Utility

10

Measuring of transfer information

Limiting privacy breaches in privacy preserving data mining, PODS03

If we look back from y, there is no easy way of telling whether

the source is x1 or x2

Page 11: Privacy vs. Utility

11

Measured based on generalization

• K-anonymity• L-diversity• P-sensitive-k-anonymity• T-closeness

L-diversity may be difficult and unnecessary to achieve The sensitive attribute is the rest result for a virus. 99% of them being

negative. The positive/negative have different degrees of sensitivity. L-diversity is insufficient to prevent atytribute disclosure

Skewness attack, e.g, one equivalence class has an equal number of positive/negative records.

Similarity attack when the sensitive attribute values in an equivalence class are distinct but semantically similar.

T-closeness if the distance between the distribution of a sensitive attribute in this class and that of the attribute in the whole table is no more than t.

Page 12: Privacy vs. Utility

12

Measuring distribution difference

Page 13: Privacy vs. Utility

13

Earth mover’s distance

Page 14: Privacy vs. Utility

14

EMD for numerical attribute

Page 15: Privacy vs. Utility

15

EMD for categorical attribute

Page 16: Privacy vs. Utility

16

EMD for categorical attribute

Page 17: Privacy vs. Utility

17

Permutation

• The goal of the k-anonymous blocks is that the diameter of the range of sensitive attributes is larger than a parameter e.

• Permutation based anonymization can answer aggregate queries more accurately than generalization based anonymization.

Page 18: Privacy vs. Utility

18

Anonymizing inference

• To protect the possible inferences that can be made from the data

• A privacy template is an inference on the data, coupled with a confidence bound. The requirement is that in the anonymized data, this inference not be valid with a confidence larger than the provided bound.

• Wang et al. Handicapping attacker’s confidence: an alternate to k-anonymization

Page 19: Privacy vs. Utility

19

Measuring utility in generalization based anonymity• The precision of a generalization scheme is 1 – the

average height of a generalization (measured over all cells).

Bayardo and Agrawal

ICDE 05

Page 20: Privacy vs. Utility

20

Utility vs. privacy

• Most of the schemes for ensuring data anonymity focus on defining measures of anonymity, while using ad hoc measures of utility.

• After performing a standard anonymization, they publish carefully chosen marginals of the source data. From these marginals, they then construct a consistent maximum entropy distribution, and measure utility as the KL-distance between this distribution and the source.

Kifer & Gehrke. Injecting utility into anonymized datasets. SIGMOD06

• Rastogi et al. The boundary between privacy and utility in data publishing

Page 21: Privacy vs. Utility

21

Computational measures of anonymity• Privacy statements are phrased in terms of the power of

an adversary., rather than the amount of background knowledge they possess.

Dinur & Nissim. Revealing information while preserving privacy. PODS03

• Measuring anonymity via information transfer• Indistinguishability

A database is private if anything learnable from it can be learned in the absence of the database

Page 22: Privacy vs. Utility

22

Anonymity via isolation

• A record is private if it cannot be singled out from its neighbors.

An adversary is defined as an algorithm that takes an anonymized database and some auxiliary information, and outputs a single point q.

An anonymization is successful if the adversary, combining the anonymization with auxiliary information, can do no better at isolation than a weaker adversary with no access to the anonymized data.

Page 23: Privacy vs. Utility

23

Metrics for quantifying data quality

• Quality of the data resulting from the ppdm process Accuracy Completeness consistency

• Quality of the data mining results• Chapter 8.4

Page 24: Privacy vs. Utility

24

measures

Oliveira & Zaiane, privacy preserving frequent itemset mining, 2002

Page 25: Privacy vs. Utility

25

Generalization based

• The data quality metric is based on the height of generalization hierarchies.

Data should be generalized as fewer steps as possible to preserve maximum utility.

Not every generalization steps are equal in the sense of information loss.

• General loss metric• Classification metric

Iyengar KDD02

• Discernibility metric Bayado & Agarwal ICDE05

Page 26: Privacy vs. Utility

26

Statistical based perturbation