PPID3 AICCSA08

33
Saeed Samet and Ali Miri School of Information Technology and Engineering University of Ottawa Privacy Preserving ID3 using Gini Index over Horizontally Partitioned Data

description

Privacy-Preserving ID3 presented in AICCSA 2008 conference in Doha, Qatar, April 2008

Transcript of PPID3 AICCSA08

Page 1: PPID3 AICCSA08

Saeed Samet and Ali Miri

School of Information Technology and Engineering

University of Ottawa

Privacy Preserving ID3 using Gini Index over

Horizontally Partitioned Data

Page 2: PPID3 AICCSA08

2

Outline

Our motivation Decision Tree and ID3 Information Gain, Entropy, and Gini Index Privacy-Preserving Data Mining Background Our Main Protocol and Sub-Protocols Complexity Future Work

Page 3: PPID3 AICCSA08

3

Our Motivation

All works done in privacy-preserving decision tree use entropy

Gini index can be used to compute information gain

Using gini Index, the largest class goes into one pure node, while the other classes go into the other node

Entropy normally tries to create balanced tree

Page 4: PPID3 AICCSA08

4

Decision Tree and ID3

A Decision Tree describes a tree structure wherein leaves represent classifications and branches represent conjunctions of features that lead to those classifications

ID3 is a decision tree induction algorithm, developed by Quinlan. ID3 stands for "Iterative Dichotomizer 3 "

Page 5: PPID3 AICCSA08

5

Possible Values

Variables(Normal Attributes)

predicted values of target variable(Class Attribute)

Observation

Conclusion

Predictive Model

Decision Tree

Page 6: PPID3 AICCSA08

6

Decision Tree ExampleDay Outlook Humidity Wind Play1 Sunny High Weak No

2 Sunny High Weak No

3 Cloudy High Strong No

4 Rain Normal Strong No

5 Rain Normal Weak Yes

6 Rain High Weak No

7 Cloudy High Weak Yes

8 Sunny Normal Strong No

9 Sunny Normal Strong No

10 Sunny High Strong No

11 Rain Normal Weak Yes

12 Cloudy Normal Weak Yes

13 Cloudy High Weak Yes.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Normal (or Independent) AttributesNormal (or Independent) Attributes

Class (or Dependent) AttributeClass (or Dependent) Attribute

Page 7: PPID3 AICCSA08

7

Decision Tree Example (cont.)

Wind

Humidity Outlook

Weak Strong

Outlook Yes=7No=0

High Normal

Humidity

Sunny Cloudy

Yes=2No=0

Yes=0No=2

Yes=2No=3

Yes=0No=4

Yes=0No=5

Rain

Yes=1No=0

High Normal

Yes=0No=2

Sunny CloudyRain

(Outlook= Cloudy , Wind = Strong ), Humidity = High Play = No

Target Data

Page 8: PPID3 AICCSA08

8

Information Gain The information gain of a given attribute A with respect

to the class attribute C is the reduction in uncertainty about the value of C when we know the value of A.

The uncertainty about the value of C is measured by its entropy.

The uncertainty about the value of C when we know the value of A is given by the conditional entropy of C given A.

where a is a value of A, is the subset of instances of where A takes the value a, and is the number of instances.

Aa

aa SEntropy

S

SSEntropyASGain )(

||

||)(),(

SaS

|| S

Page 9: PPID3 AICCSA08

9

Entropy Amount of uncertainty about an event associated

with a given probability distribution

Shannon defines entropy in terms of a discrete random event X, with possible states as:

where: is the probability of the i-th outcome of X.

nxx ,...,1

n

i

n

iii

ii xpxp

xpxpxH

1 122 )(log)()

)(

1(log)()(

)Pr()( ii xXxp

Page 10: PPID3 AICCSA08

10

Gini Index Another sensible measure of impurity is Gini

Index:

where is the relative frequency of in S.

Therefore, information gain using Gini Index is:

We will come back to this formula…

n

iixpSGini

1

2 )(1)(

Aa

aa SGini

S

SSGiniASGain )(

||

||)(),(

)( ixp ix

Page 11: PPID3 AICCSA08

11

Privacy-Preserving Data Mining

Privacy-preserving data mining Extracting desired knowledge without

revealing the private data values by developing new algorithms or modifying the standard algorithms

In Co-operative and distributed computation, Prevents access to unnecessary and private information while each party wants to achieve some aggregate results

Page 12: PPID3 AICCSA08

12

Privacy-Preserving Approaches

Data Distribution Centralized data environment Distributed data environment

Horizontal Vertical Arbitrary

Main approaches Secure Multi-party Computation (SMC) Randomization and perturbation

Page 13: PPID3 AICCSA08

13

Background

Pinkas and Lindell Computing information gain using Entropy Presenting a secure protocol to compute

when x is distributed between two parties Working only for two parties (because of using

Oblivious Polynomial Evaluation protocol)

Xiao et al. Computing information gain using Entropy Working for multi-party case Using Homomorphic encryption

xx ln

Page 14: PPID3 AICCSA08

14

Our Protocol

Privacy Preserving ID3 over Horizontally Partitioned Data

Using Gini Index to compute information gain

Working for multi-party cases Sub-protocols:

Secure multi-party additionSecure multi-party multiplicationSecure multi-party square-division

Page 15: PPID3 AICCSA08

15

Main Protocol

Computing information gain using Gini Index

is the probability that the value of class attribute C is c while the value of attribute A is a .

Cc

aca pSGini 21)(

Aa

aa SGini

S

SSGiniASGain )(

||

||)(),(

acp

Cc a

aca S

SSGini

2

2

||

||1)(

Page 16: PPID3 AICCSA08

16

Main Protocol (cont.)

is fixed. Therefore, we have to compute

)(SGini

)||

||1(

||

||),(

2

2

Cc a

ac

Aa

a

S

S

S

SASF

)||

||(||

11),(

2

Aa Cc a

ac

S

S

SASF

.

.

.

Page 17: PPID3 AICCSA08

17

Main Protocol (cont.)

For instance, in the previous example, for attribute Outlook we have to compute

||

||||

||

||

||

||||

||

||

||

||||

||

||

2,

2,

2,

2,

2,

2,

2,

2,

2,

Cloudy

NoCloudyYesCloudy

Cc Cloudy

cCloudy

Rain

NoRainYesRain

Cc Rain

cRain

Sunny

NoSunnyYesSunny

Cc sunny

csunny

S

SS

S

S

S

SS

S

S

S

SS

S

S

Page 18: PPID3 AICCSA08

18

For instance, two parties have to compute

and belong to party 1

and belong to party 2

)()(

)()(

2121

221

221

yyxx

yyxx

1x

2x

1y

2y

Main Protocol (cont.)

Page 19: PPID3 AICCSA08

19

Secure Multi-party Addition n parties are involved,

Inputs: , Outputs: ,

such that:

Suppose E is an Additive Homomorphic Encryption, with public key e and private key d:

Thus we have:

nPP ,,1 ii xP :

ii lP :ni 1

n

ii

n

ii lx

11

ni 1

nl

ln

ii

n

ii

n

ii elEelEexEexE

2,,,),( 1

111

)),,((),),(( 2112 demmEDdemED m

emmEemEemE ,,, 2121

Page 20: PPID3 AICCSA08

20

Secure Multi-party Addition (cont.)

1. selects an additive homomorphic encryption and sends the public key e to all other parties.

2. encrypts its input , , and sends it to .

3. For i=2 to n-1 encrypts its input , , multiplies it by and

sends to .

4. encrypts its input , , and computes .

5. randomly selects its, nonzero, output share ,

calculates and sends it to .

1P

1

1

),(i

jj exE

1P 1x ),( 1 exE2P

ix ),( exE iiP

1iP

i

jj exE

1

),(

nP nx ),( exE n

n

ii exE

1

),(

nP nl1

1

),(

nln

ii exE

1nP

Page 21: PPID3 AICCSA08

21

11

1

),(

il

nln

jj exE

Secure Multi-party Addition (cont.)

6. For i=n-1 to 2 randomly selects its, nonzero, output , calculates

and sends it to .

7. decrypts the received value from and sets it as its output .

2P

iP

1P

il

1iP

1l

dexEDl

l

nln

ii ,),(

12

1

11

Page 22: PPID3 AICCSA08

22

Other Sub-Protocols

Secure Multi-party MultiplicationSame as Secure Multi-party

Addition Secure Multi-party Square-

DivisionUsing two previous sub-protocols

Page 23: PPID3 AICCSA08

23

Complexity Common parameters

Size of the database # of parties involved # of attributes # of possible values for attributes (on average)

To compute the cost of the protocol, suppose: # of parties involved in the protocol is denoted by n # of remaining normal attributes at the current step

and node is denoted by a # of possible values for those normal attributes, on

average, is denoted by v # of possible values for the class attribute is denoted

by c # of bits exchanging from one party to another party is

denoted by b Computational overhead of sub-protocols 1 and 2 is

denoted by CPn

Page 24: PPID3 AICCSA08

24

Complexity (cont.)

CPn includes n encryptions, one decryption, n-1 multiplications and n-1 power computations.

The overall computational cost, by assuming that c is dominated by b and n, is

The overall communication cost is

nCPCPnva 22

bnva 2

Page 25: PPID3 AICCSA08

25

Future Work

Using proposed sub-protocols in other techniques in PPDM and presenting new building blocks in SMC

Implementation of the protocol to Find the exact cost and efficiency of the

algorithm Compare with other existing techniques

Page 26: PPID3 AICCSA08

26

References1. Rakesh Agrawal, Alexandre V. Evfimievski, and Ramakrishnan Srikant. Information sharing across

private databases. In ACM Special Interest Group on Management of Data (SIGMOD) Conference, pages 86–97, 2003.

2. Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In ACM Special Interest Group on Management of Data (SIGMOD) Conference, pages 439–450, 2000.

3. Friedman J.H. Olshen R.A. Breiman, L. and C.J. Stone. Classification and Regression Trees. Chapman & Hall, New York, 1984.

4. Leo Breiman. Technical note: Some properties of splitting criteria. Machine Learning, 24(1):41–47, 1996.

5. Christian Cachin and Jan Camenisch, editors. Advances in Cryptology - EUROCRYPT 2004, International Conference on the Theory and Applications of Cryptographic Techniques, Interlaken, Switzerland, May 2-6, 2004, Proceedings, volume 3027 of Lecture Notes in Computer Science. Springer, 2004.

6. Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya, Xiaodong Lin, and Michael Y. Zhu. Tools for privacy preservingdistributed data mining. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 4(2):28–34, 2003.

7. DTREG. How trees are built. http://www.dtreg.com/treebuild.htm, 2006. (Last posted: 22/7/2006).8. W. Du and M. Atallah. Privacy-preserving cooperative statistical analysis. In ACSAC ’01: Proceedings

of the 17th Annual Computer Security Applications Conference, pages 102–110, New Orleans, Louisiana, USA, December 10-14 2001.

9. Wenliang Du and Zhijun Zhan. Building decision tree classifier on private data. In CRPITS’14: Proceedings of the IEEE international conference on Privacy, security and data mining, pages 1–8, Darlinghurst, Australia, Australia, 2002. Australian Computer Society, Inc.

10. Bart Goethals, Sven Laur, Helger Lipmaa, and Taneli Mielik¨ainen. On private scalar product computation for privacy preserving data mining. In ICISC, pages 104–120, 2004.

11. R. J. Light and B. H. Margolin. An analysis of variance for categorical data. In Journal of The American Statistical Association, volume 66, pages 534–544, 1971.

12. Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In CRYPTO, pages 36–54, 2000.

Page 27: PPID3 AICCSA08

27

References13. Behzad Malek and Ali Miri. Secure dot-product protocol using trace functions. 2006 IEEE

International Symposium on Information Theory, 2006.14. Moni Naor and Benny Pinkas. Oblivious transfer and polynomial evaluation. In STOC ’99:

Proceedings of the thirty-first annual ACM Symposium on Theory of Computing, pages 245–254, New York, NY, USA, 1999. ACM Press.

15. Moni Naor and Benny Pinkas. Efficient oblivious transfer protocols. In SODA ’01: Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 448–457, Philadelphia, PA, USA, 2001. Society for Industrial and Applied Mathematics.

16. Benny Pinkas. Cryptographic techniques for privacy-preserving data mining. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 4(2):12–19, 2002.

17. Laura Elena Raileanu and Kilian Stoffel. Theoretical comparison between the gini index and information gain criteria. Annal of Mathematics and Artificial Intelligence, 41(1):77–93, 2004.

18. Eakalak Suthampan and Songrit Maneewongvatana. Privacy preserving decision tree in multi party environment. In Asia Information Retrieval Symposium (AIRS), pages 727–732, 2005.

19. Salford Systems. Do splitting rules really matter? http://www. salford-systems.com/423.php, 2006.20. Jaideep Vaidya and Chris Clifton. Privacy-preserving decision trees over vertically partitioned data.

In Data and Application Security (DBSec), pages 139–152, 2005.21. Jaideep Vaidya and Chris Clifton. Secure set intersection cardinality with application to association

rule mining. Journal of Computer Security, 13(4):593–622, 2005.22. Ming-Jun Xiao, Liu-Sheng Huang, Yong-Long Luo, and Hong Shen. Privacy preserving ID3 algorithm

over horizontally partitioned data. In Parallel and Distributed Computing, Applications and Technologies, pages 239–243, 2005.

Page 28: PPID3 AICCSA08

28

Secure Multi-party Multiplication k parties are involved,

Inputs: , Outputs: ,

such that:

Suppose E is an Additive Homomorphic Encryption, with public key e and private key d. Thus we have:

ii xP :

ii lP :ni 1

n

ii

n

ii lx

11

ni 1

nx

xn

ii

n

ii

n

ii exEexEelEelE

2,,,),( 1

111

kPP ,,1

Page 29: PPID3 AICCSA08

29

1. selects an additive homomorphic encryption and sends the public key e to all other parties.

2. encrypts its input , , and sends it to .

3. For i=2 to n-1 powers the received value to its input , and

sends it to .

4. For i=n to 2 randomly selects its, nonzero, output share , encrypts it,

, computes its inverse, , multiplies the received

value to that, , and sends it to .

1P

1P 1x ),( 1 exE2P

ixix

xexE2),( 1iP

1iP

iPil

1iP

),( elE i1),( elE i

n

iji

x elEexEnx

11 ),(),( 2

Secure Multi-party Multiplication (cont.)

Page 30: PPID3 AICCSA08

30

Secure Multi-party Multiplication (cont.)

6. decrypts the received value from and sets it as its output . 2P1P 1l

dexEexEDln

ii

xnx

,),(),(2

111

2

Page 31: PPID3 AICCSA08

31

Secure Multi-party Square-Division

Suppose two parties are horizontally involved, and Inputs: ,

Outputs:

Using two-party multiplication Inputs: , Outputs: ,

such that: and 2121 llyy

111 ,: yxP 222 ,: yxP

)()(

)()(

2121

221

221

yyxx

yyxx

1P 2P

111 ,: yxP 222 ,: yxP

111 ,: lkP 222 ,: lkP

2121 kkxx

Page 32: PPID3 AICCSA08

32

Secure Multi-party Square-Division (cont.)

Next step Inputs: and

and

Outputs:

Sub-step, using two-party addition

Inputs: ,

Outputs: ,

Such that: and

computes and send it to

computes and send it to

2121 nnww

1121

2111 22: lkyxzP

21

21

ww

zz

1

11 n

mr

111 ,: wzP

2121 mmzz

111 yxw

2222

2222 22: lkyxzP 222 yxw

222 ,: nmP111 ,: nmP

222 ,: wzP

1P 2P

2

22 n

mr 2P 1P

Page 33: PPID3 AICCSA08

33

Secure Multi-party Square-Division (cont.)

Each party computes

21

21

2

2

1

121 nn

mm

n

m

n

mrr

)()(

)()(

2121

221

221

yyxx

yyxx

21

21

ww

zz