Randomization in Privacy Preserving Data Mining
description
Transcript of Randomization in Privacy Preserving Data Mining
![Page 1: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/1.jpg)
Randomization in Privacy Preserving Data Mining
Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00
the following slides include materials from this paper
![Page 2: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/2.jpg)
Privacy-Preserving Data Mining
• Problem: How do we publish data without compromising individual privacy?
• Solution : randomization, anonymization
![Page 3: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/3.jpg)
Randomization
• Adding random noise to original dataset
• Challenge– Is data still useful for further analysis?
![Page 4: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/4.jpg)
Randomization
• Model: data is distorted by adding random noise
• Original data X = {x1 . . .xN}, for record xi X, ∈random variable Y = {y1 . . .yN} is added, so new data is denoted by Z ={ z1 . . .zN}, zi=xi + yi.
• yi is a random value– Uniform, [-α, +α]– Gaussian, N (0, σ2)
![Page 5: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/5.jpg)
Reconstruction
• Perturbed data hides data distribution and need be reconstructed before data mining
• Given– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y• Estimate the probability distribution of x
Clifton AusDM‘11
![Page 6: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/6.jpg)
1. fx 0 = Uniform distribution
2. Repeat update
until stop criterion met
Reconstruction
• Bayes rule to estimate cumulative density functions
reconstruction algorithm
![Page 7: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/7.jpg)
reconstructed
originalrandomized
original
reconstructed
randomized
N(0, 0.25)
(-0.5, 0.5)
![Page 8: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/8.jpg)
Privacy Metric
• If a data x is estimated to be in the interval [α, β] with c% confidence, then the interval (β-α) defines the amount of privacy with c% confidence.
• ExampleAge 20-40, 95% confidence, 50% privacy in Uniform2 α = 20*0.5/0.95 = 10.5
Confidence50% 95% 99.9%
Uniform 0.5 X 2α 0.95 X 2α 0.999 X 2α
Gaussian 1.34 X σ 3.92 X σ 6.8 X σ
![Page 9: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/9.jpg)
Decision Tree
![Page 10: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/10.jpg)
Training Decision Tree
• Split point– interval boundaries
• Reconstruction algorithm– Global– Byclass– Local
• Dataset– Synthetic dataset, training set of 100,000 records and
testing set of 5,000 records, equally split into two classes
![Page 11: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/11.jpg)
![Page 12: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/12.jpg)
originalglobal and randomized
Byclass and local
global
randomized
original
byclasslocal
![Page 13: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/13.jpg)
![Page 14: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/14.jpg)
Extended Work
• ‘02 proposed a method to quantify information loss– Mutual information
• ‘07 evaluated randomization with combining of public information– Gaussian is better than uniform– Dataset with inherent cluster pattern will improve
randomization performance– Varying density and outliers will decrease performance
![Page 15: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/15.jpg)
Multiplicative Randomization
• Rotation randomization– Distorted by an orthogonal matrix
• Projection randomization– Project high-dimensional dataset into low-
dimensional space• Preserving Euclidean distance and can be
applied with distance-based classification (KNN, SVM) and clustering (K-means)
![Page 16: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/16.jpg)
Summary
• Pros: data and noise are independent, can be applied during data collection time, useful for stream data
• Cons: information loss, dimensionality curse
![Page 17: Randomization in Privacy Preserving Data Mining](https://reader030.fdocuments.net/reader030/viewer/2022020717/5681679d550346895ddce2ce/html5/thumbnails/17.jpg)
Questions?