Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining?...

53
Privacy Preserving Data Mining Moheeb Rajab

Transcript of Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining?...

Page 1: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Privacy PreservingData Mining

Moheeb Rajab

Page 2: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Agenda

Overview and Terminology Motivation Active Research Areas

Secure Multi-party Computation (SMC) Randomization approach

Limitations Summary and Insights

Page 3: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Overview

What is Data Mining? Extracting implicit un-obvious patterns and relationships from a

warehoused of data sets.

This information can be useful to increase the efficiency of theorganization and aids future plans.

Can be done at an organizational level. By Establishing a data Warehouse

Can be done also at a global Scale.

Page 4: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Data Mining System Architecture

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Entity I

Entity II

Entity n

GlobalAggregato

r

Page 5: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Distributed Data Mining Architecture

Lower scale Mining

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Page 6: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Challenges

Privacy Concerns Proprietary information disclosure Concerns about Association breaches Misuse of mining These Concerns provide the motivation

for privacy preserving data miningsolutions

Page 7: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Approaches to preserve privacy

Restrict Access to data (Protect Individualrecords)

Protect both the data and its source: Secure Multi-party computation (SMC) Input Data Randomization

There is no such one solution that fits allpurposes

Page 8: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

SMC vs Randomization

Pinkas et al

Overhead

Accuracy

Privacy Randomization Schemes

SM

C

Page 9: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Secure Multi-party Computation

Multiple parties sharing the burden of creating the dataaggregate.

Final processing if needed can be delegated to anyparty.

Computation is considered secure if each party onlyknows its input and the result of its computation.

Page 10: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

SMC

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Each Party Knows its input and the result of the operation and nothing else

Page 11: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Key Assumptions

The ONLY information that can be leaked is theinformation that we can get as an overall output from thecomputation (aggregation) process

Users are not Malicious but can honestly curious All users are supposed to abide to the SMC protocol

Otherwise, for the case of having malicious participantsis not easy to model! [Penkas et al, Argawal]

Page 12: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

“Tools for Privacy Preserving Distributed DataMining” Clifton et al [SIGKDD]

Secure Sum Given a number of values belonging

to n entities

We need to compute

Such that each entity ONLY knows its input and theresult of the computation (The aggregate sum of thedata)

nxxx ,........,,

21

!=

n

i

ix

1

Page 13: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Examples (Secure Sum)

Problem: Colluding members

Solution Divide values into shares and have each share permute a

disjoint path (no site has the same neighbor twice)

R = 15 10 20

15

45

50

MasterR + 30

R + 10

R +45

R +90

R + 1

40

Sum = R+140 -R

Page 14: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Split path solution

R1 = 15 10 20

15

45

50

R + 15

R + 5

R + 22.5 R +45

R + 7

0

Sum = R1+ 70 – R1 + R2+ 70 –R2= 140

R2 = 12

Page 15: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Secure Set Union

Consider n setsCompute,

Such that each entity ONLY knows U andnothing else.

nSSS ,........,,

21

UUU nSSSSU ,........,

321=

Page 16: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Secure Union Set

Using the properties of CommutativeEncryption

For any permutation i, j the following holds

)...)((...)...)((...11

MEEMEEnjjnii KKKK =

!<== )...))((...)...)((...( 2111

MEEMEEPnjjnii KKKK

Page 17: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Secure Set Union

Global Union Set U. Each site:

Encrypts its items Creates an array M[n] and adds it to U

Upon receiving U an entity should encrypt all items in U that it did notencrypt before.

In the end: all entries are encrypted with all keys Remove the duplicates:

Identical plain text will result the same cipher text regardless of the order of theuse of encryption keys.

Decryption U: Done by all entities in any order.

nKKK ,.....,,

21

Page 18: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Secure Union Set

0 1 0 … . . .. . … )...),((...1

MEEnii

KK

1 2 3 ….. . .. .. . . . n

Page 19: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

A

A

C1

3

2 U= {E1(A)}

Problem:Computation Overhead, number of

exchanged messages O(n*m)

U= {E2(E1(A)),E2(C)}

U= {E3(E2(E1(A))),E3(E2(C)), E3(A)}

U= {E3(E2(E1(A))),E1(E3(E2(C))), E1(E3(A))}

U= {E3(E2(E1(A))),E1(E3(E2(C))), E2(E1(E3(A)))}

Page 20: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Problems with SMC

Scalability High Overhead Details of the trust model assumptions

Users are honest and follow the protocol

Page 21: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Randomization Approach

“Privacy Preserving Data Mining”, Argawal et. al [SIKDD]

Applied generally to provide estimates for data distributions ratherthan single point estimates

A user is allowed to alter the value provided to the aggregator

The alteration scheme should known to the aggregator

The aggregator Estimates the overall global distribution of input byremoving the randomization from the aggregate data

Page 22: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Randomization Approach (ctnd.)

Assumptions: Users are willing to divulge some form of their data The aggregator is not malicious but may honestly

curious (they follow the protocol)

Two main data perturbation schemes Value- class membership (Discretization) Value distortion

Page 23: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Randomization Methods

Value Distortion Method

Given a value the client is allowed to report a distorted valuewhere is a random variable drawn from a known

distribution

Uniform Distribution:

Gaussian Distribution:

ix

)( rxi + r

[ ]!!µ +"= ,,0

!µ ,0=

Page 24: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Quantifying the privacy of differentrandomization Schemes

6.8 x σ3.92 x σ1.34 x σGaussian

0.999 x 2α0.95 x 2α0.5 x 2αUniform

0.999 x W0.95 x W0.5 x WDiscretization

Distribution99.9 %95 %50 %Confidence (α)

W

− α + α

Gaussian Distribution provides the best accuracy at higher confidence levels

Page 25: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Problem Statement

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

EntityI

Entity II

Entity m

GlobalAggregator

kk yxyxyx1112121111

,.....,, +++

kk yxyxyx2222222121

,.....,, +++

mkmkmmmm yxyxyx +++ ,.....,,2211

nnyxyxyx +++ ,.....,,

2211 )(afY

Estimator

)(zf X

Page 26: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Reconstruction of the Original Distribution

Reconstruction problem can be viewed in in the generalframework of the “Inverse Problems”

Inverse Problems: describing system internal structurefrom indirect noisy data.

Bayesian Estimation is an Effective tools for suchsettings

Page 27: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Formal problem statement

Given one dimensional array of randomized data

Where ’s are iid random variables each with the same distributionas the random variable X

And ’s are realizations of a globally known random distributionwith CDF

Purpose: Estimate

nnyxyxyx +++ ,.....,,

2211

ix

iy

YF

XF

Page 28: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Background: Bayesian Inference

An Estimation method that involves collecting observational dataand use it a tool to adjust (either support of refute) a prior belief.

The previous knowledge (hypothesis) has an established probabilitycalled (prior probability)

The adjusted hypothesis given the new observational data is called(posterior probability)

Page 29: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Bayesian Inference Let the prior probability, then Bayes’ rule states

that the posterior probability of given anobservation (D) is given by:

Bayes rule is a cyclic application of the general form ofthe joint probability theorem:

)(

)()|()|( 00

0DP

HPHDPDHP =

)()|(),( 00 DPDHPHDP =

)( 0HP

)( 0H

Page 30: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Bayesian Inference ( Classical Example)

Two Boxes: Box-I : 30 Red balls and 10 White Balls Box-II: 20 Red balls and 20 White Balls

A Person draws a Red Ball, what is the probability thatthe Ball is from Box-I

Prior Probability P(Box-I) = 0.5

From the data we know that: P(Red|Box-I) = 30/40 = 0.75 P(Red|Box-II) = 20/40 = 0.5

Page 31: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Example (cntd.)

Now, given the new observation (The Red Ball)we want to know the posterior probability of Box-I(i.e P(Box-I | Red) )

)(

)()|()|(

REDP

IBoxPIBoxREDPREDIBoxP

!!=!

),(),()( IIBoxREDPIBoxREDPREDP !+!=

)()|()()|()( IIBoxPIIBoxREDPIBoxPIBoxREDPREDP !!+!!=

5.05.075.05.0)( !+!=REDP

Page 32: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Example (cntd)

Computing the joint probability:

Substituting,

The posterior probability of Box-I is amplifies by the observation ofthe Red Ball

)()|()()|()( IIBoxPIIBoxREDPIBoxPIBoxREDPREDP !!+!!=

5.05.075.05.0)( !+!=REDP

6.05.05.075.05.0

5.075.0)|( =

!+!

!=" REDIBoxP

Page 33: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Back: Formal problem statement

Given one dimensional array of randomized data

Where ‘s are iid random variables each with the same distributionas the random variable X

And ’s are realizations of a globally known random distributionwith CDF

Purpose: Estimate

nnyxyxyx +++ ,.....,,

2211

ix

iy

YF

XF

Page 34: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Continuous probability distributions

)()()(}{ zFzCDFdkkfzrP X

z

X ===! " #$0}{ == zrP

1)( =!+"

"#dkkfX

Page 35: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

CDF and PDF

Page 36: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Estimation of

Bayes Rule:

Posterior Probability

Applying Bayes rule

)(

)()|()|( 00

0DP

HPHDPDHP =

!=

"#==+$

az

zXX dzwYXzfaF )|()( 111

!"# +

+ ==

a

YX

XYX

Xwf

dzzfzXwfaF

)(

)()|()(

1

11

11

111

XF

Page 37: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Estimation of

We want to evaluate

Substituting:

!"

"#

++ == dkkfkXwfwf XYXYX )()|()(11111 111

!!

"#

"

"#

+

+

=

==

a

XYX

XyX

X

dkkfkXwf

dzzfzXwfaF

)()|(

)()|()(

111

111

11

11

XF

)( 111wf YX +

Page 38: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Estimation of

Simplification (independence):

XF

!

!"

"#

"#

#

#

=

dkkfzwf

dzzfzwf

aF

XiY

a

XiY

X

)()(

)()(

)(

Page 39: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Estimation of

For all n observations:

!"

"

=#

#$

#$

$

$

=n

i

XiY

a

XiY

X

dkkfzwf

dzzfzwf

naF

1)()(

)()(1

)(

XF

Page 40: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Estimation of the PDF

Is just the derivative of the CDF

Xf

Xf

!"

=#

#$

$

$=

n

i

XiY

XiYX

dkkfzwf

dzzfzwf

naf

1)()(

)()(1)(

Page 41: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Algorithm

Uniform Distribution

While ( not Stopping Condition):

=:0

Xf

0:=j

!"

=#+

#$

+

$

$=

n

i j

XiY

j

XiYj

X

dzzfzwf

afawf

naf

1

1

)()(

)()(1:)(

1: += jj

Page 42: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Stopping Criteria

The algorithm should terminate if:

For each round a goodness of fit test is performed.

Iteration is stopped when the difference between thetwo estimates is too small (lower that a certainthreshold)

)()(1 afaf j

X

j

X !+

2!

Page 43: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Evaluation

Gaussian Randomizing Function µ=0, σ = 0.25

Page 44: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Evaluation

Uniform Randomizing Function [-0.5,0.5]

Page 45: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

How is this different from KalmanEstimator?

Both are estimation techniques

Kalman is stateless

In Kalman filter case we knew the distribution andestimation is used to validate whether the trend of thedata matches that distribution

In Bayesian Inference the observation data is used toadjust the prior hypothesis (probability distribution)

Page 46: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Is the Problem Solved?

Suppose a client randomizes Age records using auniform random variable [-50,50]

If the aggregator receives value 120, with 100%confidence it knows that actual age is

Simply randomization does not guarantee absoluteprivacy

70!

Page 47: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

How to achieve better randomizationscheme

“Limiting Privacy Breaches in PrivacyPreserving Data Mining” Evfimievski et al

Define an evaluation metric of how privacypreserving a scheme is.

Based on the developed metric, develop arandomization scheme that abides to this metric

Page 48: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

How Privacy preserving is a scheme?

Information Theoretic Approach:

Computes the average information disclose in a randomizedattribute by computing the mutual information between theactual and the randomized attribute

Privacy breach Defines a criteria that should be satisfied for a randomization

scheme to be privacy preserving

Page 49: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

What is a privacy breach?

A privacy breach occurs when thedisclosure of a randomized value to theaggregator reveals that a certain property

of the “individual” input holdswith high probability

iy

ix)(xQ

Page 50: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Privacy Breach

Back to Bayes’

Prior Probability where isthe property

Posterior probability:

))(( xQP )(xQ

)|)(( iyxQP

Page 51: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Amplification Is defined in terms of the transitive probability

where y is a fixed randomized outputvalue

Intuitive definition: if there are many ’s that can be mapped to y by

the randomizing scheme then disclosing y have giveslittle information about

We say we amplify the probability that

][ yxP !

ix

][ yxP !

ix

Page 52: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Amplification factor

Let, R a randomization operator

a randomized value of x.Revealing R will not cause privacy breach if :

!>"

"

)1(

)1(

2

1

1

2

p

p

p

p

yVy!

Page 53: Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining? Extracting implicit un-obvious patterns and relationships from a warehoused of data

Summary

No one solution can fit all. Which area looks more promising? Can we create robust randomization

schemes to a wide scale of applicationsand different distributions of data?

How to deal with the case of Maliciousparticipants?