Additive Data Perturbation: the Basic Problem and Techniques
-
Upload
brent-leonard -
Category
Documents
-
view
47 -
download
1
description
Transcript of Additive Data Perturbation: the Basic Problem and Techniques
![Page 1: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/1.jpg)
Additive Data Perturbation: the Basic Problem and Techniques
![Page 2: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/2.jpg)
Outline Motivation Definition Privacy metrics Distribution reconstruction methods Privacy-preserving data mining with
additive data perturbation Summary
Note: focus on the papers: 10 and 11
![Page 3: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/3.jpg)
Motivation Web-based computing
Observations Only a few sensitive attributes need protection Allow individual user to perform protection with
low cost Some data mining algorithms work on
distribution instead of individual records
WebApps
data
user 1 user 1 user 1
Privateinfo
![Page 4: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/4.jpg)
Definition of dataset Column by row table Each row is a record, or a vector Each column represents an attribute We also call it multidimensional data
A B C
10 1.0 100
12 2.0 20
A 3-dimensional record
2 records in the 3-attribute dataset
![Page 5: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/5.jpg)
Additive perturbation Definition
Z = X+Y X is the original value, Y is random noise and Z is the
perturbed value Data Z and the parameters of Y are published
e.g., Y is Gaussian N(0,1)
History
Used in statistical databases to protect sensitive attributes (late 80s to 90s) [check paper14]
Benefit Allow distribution reconstruction Allow individual user to do perturbation
Publish the noise distribution
![Page 6: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/6.jpg)
Applications in data mining Distribution reconstruction algorithms
Rakesh’s algorithm Expectation-Maximization (EM) algorithm
Column-distribution based algorithms Decision tree Naïve Bayes classifier
![Page 7: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/7.jpg)
Major issues Privacy metrics Distribution reconstruction algorithms Metrics for loss of information
A tradeoff between loss of information and privacy
![Page 8: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/8.jpg)
Privacy metrics for additive perturbation Variance/confidence based definition Mutual information based definition
![Page 9: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/9.jpg)
Variance/confidence based definition
Method Based on attacker’s view: value estimation
Knowing perturbed data, and noise distribution No other prior knowledge
Estimation method
Perturbed value
Confidence interval: the range having c% prob that the real value is in
Y: zero mean, std is the important factor, i.e., var(Z-X) = 2
Given Z, X is distant from Z in the Z+/- range with c% confWe often ignore the confidence c% and use to represent thedifficulty of value estimation.
![Page 10: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/10.jpg)
Problem with Var/conf metric No knowledge about the original data
is incorporated Knowledge about the original data
distribution which will be discovered with distribution
reconstruction, in additive perturbation can be known in prior in some applications
Other prior knowledge may introduce more types of attacks Privacy evaluation need to incorporate these
attacks
![Page 11: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/11.jpg)
Mutual information based method incorporating the original data
distribution Concept: Uncertainty entropy
Difficulty of estimation… the amount of privacy…
Intuition: knowing the perturbed data Z and the noise Y distribution, how much uncertainty of X is reduced. Z,Y do not help in estimate X all uncertainty of
X is preserved: privacy = 1 Otherwise: 0<= privacy <1
![Page 12: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/12.jpg)
Definition of mutual information Entropy: h(A) evaluate uncertainty of A
Not easy to estimate high entropy Distributions with the same variance uniform has
the largest entropy Conditional entropy: h(A|B)
If we know the random variable B, how much is the uncertainty of A
If B is not independent of A, the uncertainty of A can be reduced, (B helps explain A) i.e., h(A|B) <h(A)
Mutual information I(A;B) = h(A)-h(A|B) Evaluate the information brought by B in estimating
A Note: I(A;B) == I(B;A)
![Page 13: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/13.jpg)
Inherent privacy of a random variable Using uniform variable as the reference 2^h(A) Make the definition consistent with Rakesh’s
approach
MI based privacy metricP(A|B) = 1-2^-I(A;B), the lost privacy I(A;B) =0 B does not help estimate A
Privacy is fully preserved, the lost privacy P(A|B) =0 I(A;B) >0 0<P(A|B)<1
Calculation for additive perturbation: I(X;Z) = h(Z) – h(Z|X) = h(Z) – h(Y)
![Page 14: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/14.jpg)
Distribution reconstruction Problem: Z= X+Y
Know noise Y’s distribution Fy Know the perturbed values z1, z2,…zn Estimate the distribution Fx
Basic methods Rakesh’s method EM esitmation
![Page 15: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/15.jpg)
Rakesh’s algorithm (paper 10) Find distribution P(X|X+Y) three key points to understand it
Bayes rule: P(X|X+Y) = P(X+Y|X) P(X)/P(X+Y)
Conditional prob fx+y(X+Y=w|X=x) = fy(w-x)
Prob at the point a uses the average of all sample estimates
Using fx(a)?
![Page 16: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/16.jpg)
The iterative algorithm
Stop criterion: the difference between two consecutive fx estimates is small
![Page 17: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/17.jpg)
Make it more efficient… Bintize the range of x
Discretize the previous formula
x
m(x) mid-point of the bin that x is inLt = length of interval t
![Page 18: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/18.jpg)
Weakness of Rakesh’s algorithm No convergence proof Don’t know if the iteration gives the
globally optimal result
![Page 19: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/19.jpg)
EM algorithm (paper 11) Using discretized bins to approximate the
distribution
Maximum Likelihood Estimation (MLE) method X1,x2,…, xn are Independent and identically
distributed Joint distribution
f(x1,x2,…,xn|) = f(x1|)*f(x2|)*…f(xn|) MLE principle:
Find that maximizes f(x1,x2,…,xn|) Often maximize log f(x1,x2,…,xn|) = sum log f(xi|)
x
Density (the height) of Bin i is notated as i
![Page 20: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/20.jpg)
Basic idea of the EM alogrithm Q(,^) is the MLE function
is the bin densities (1, 2,… k), and ^ is the previous estimate of .
EM algorithm1. Initial ^ : uniform distribution2. In each iteration: find the current that maximize
Q(,^) based on previous estimate ^, and z
zj – upper(i)<=Y<=zj – lower(i)
![Page 21: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/21.jpg)
EM algorithm has properties Unique global optimal solution ^ converges to the MLE solution
![Page 22: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/22.jpg)
Evaluating loss of information The information that additive
perturbation wants to preserve Column distribution
First metric Difference between the estimate and the
original distribution
![Page 23: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/23.jpg)
Evaluating loss of information Indirect metric
Modeling quality The accuracy of classifier, if used for classification
modeling
Evaluation method Accuracy of the classifier trained on the original
data Accuracy of the classifier trained on the
reconstructed distribution
![Page 24: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/24.jpg)
DM with Additive Perturbation Example: decision tree A brief introduction to decision tree algorithm
There are many versions… One version working on continuous attributes
![Page 25: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/25.jpg)
Split evaluation gini(S) = 1- sum(pj^2)
Pj is the relativ frequency of class j in S gini_split(S) = n1/n*gini(S1)+n2/n*gini(S2) The smaller the better
Procedure Get the distribution of each attribute Scan through each bin in the attribute and calculate
the gini_split index problem: how to determine pj Find the minimum one
x
![Page 26: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/26.jpg)
An approximate method to determine pj The original domain is partitioned to m bins Reconstruction gives an distribution over the bins n1, n2,
…nm Sort the perturbed data by the target attribute assign the records sequentially to the bins according to the
distribution Look at the class labels associated with the records
Errors happen because we use perturbed values to determine the bin identification of each record
![Page 27: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/27.jpg)
When to reconstruct distribution
Global – calculate once By class – calculate once per class Local – by class at each node
Empirical study shows By class and Local are more effective
![Page 28: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/28.jpg)
Problems with paper 10,11 Privacy evaluation
Didn’t consider in-depth attacking methods Data reconstruction methods
Loss of information Negatively related to privacy Not directly related to modeling
Accuracy of distribution reconstruction vs. accuracy of classifier ?
![Page 29: Additive Data Perturbation: the Basic Problem and Techniques](https://reader036.fdocuments.net/reader036/viewer/2022081511/56813095550346895d96725c/html5/thumbnails/29.jpg)
Summary We discussed the basic methods with
additive perturbation Definition Privacy metrics Distribution reconstruction
The problem with privacy evaluation is not complete Attacks Covered by next class