Data Classification Using Anatomized Training Data...โ€ข Minimal involvement of ๐ทโ€™spublisher,...

Post on 04-Mar-2020

3 views 0 download

Transcript of Data Classification Using Anatomized Training Data...โ€ข Minimal involvement of ๐ทโ€™spublisher,...

CERIASThe Center for Education and Research in Information Assurance and Security

The ProblemAnatomized Learning ProblemGiven some l-diverse data in the anatomy model, can we learn accurate data miningmodels?

โ€ข What is the anatomy model?โ€ข What is l-diverse?โ€ข Under which assumptions?

Anatomy ModelSeparate data table (๐ท) into two tables, identifier (๐ผ๐‘‡) and sensitive table (๐‘†๐‘‡) insteadof generalizing records in the same group:

โ€ข Divide ๐ท in ๐‘š groups ๐บ๐‘—, group id (๐บ๐ผ๐ท) ๐‘—โ€ข IT attributes (๐ด๐‘–๐‘‘): ๐ด1, โ‹ฏ , ๐ด๐‘‘โ€ข ST attribute: ๐ด๐‘ โ€ข Publish ๐ผ๐‘‡ and ๐‘†๐‘‡ instead of ๐ทโ€ข L-diverseโ€ข Xiao et al. (2006), Nergiz et al. (2011, 2013)โ€ข Patient Data Example (HIPAA 2002)

L-diverse: Privacy StandardEvery instance in ๐ผ๐‘‡ can be associated with L different instances in ๐‘†๐‘‡

โ€ข Patient Data Example: L=2

โ€ข โˆ€ ๐บ๐‘— , ๐‘ฃ โˆˆ ๐œ‹๐ด๐‘  ๐บ๐‘— ,๐‘“๐‘Ÿ๐‘’๐‘ž ๐‘ฃ,๐บ๐‘—

๐บ๐‘—โ‰ค1

๐‘™

โ€ข Machanavajjhala et al. (2007)

H(SEQ) GID (G)

Disease (D)

Hk2(1) 1 Cold

Hk2(2) 1 Fever

Hk2(3) 2 Flu

Hk2(4) 2 Cough

Hk2(5) 3 Flu

Hk2(6) 3 Fever

Hk2(7) 4 Cough

Hk2(8) 4 Flu

Patient (P)

Age (A)

Address (AD)

GID (G)

SEQ (S)

Ike 41 Dayton 1 1

Eric 22 Richmond 1 2

Olga 30 Lafayette 2 3

Kelly 35 Lafayette 2 4

Faye 24 Richmond 3 5

Mike 47 Richmond 3 6

Jason 45 Lafayette 4 7

Max 31 Lafayette 4 8

Sensitive table (ST)Identifier table (IT)

ApproachesLearning Problem Assumptions

โ€ข Training set of ๐‘›๐‘ก๐‘Ÿ instances in anatomy model and test set of ๐‘›๐‘ก๐‘’ instanceswithout any anonymization (Inan et al. (2009))

โ€ข No background knowledge for ๐ผ๐‘‡โ€ข Canโ€™t predict the sensitive attribute ๐‘จ๐’”. If we could, we would be violating

privacy!โ€ข Prediction task of ๐ด๐‘–or ๐ถ (binary!):

โ€ข Type 1: ๐ด๐‘– โˆˆ ๐ด1, โ‹ฏ , ๐ด๐‘‘โ€ข Type 2: ๐ถ โˆ‰ ๐ด1, โ‹ฏ , ๐ด๐‘‘ โˆง ๐ถ โ‰  ๐ด๐‘ 

โ€ข No ๐ผ๐‘‡ and ๐‘†๐‘‡ linkingโ€ข Data must remain L-diverseโ€ข No involvement of Dโ€™s publisherโ€ข Relaxed Assumptions (some models):

โ€ข Minimal involvement of ๐ทโ€™s publisher, limited sources of ๐ทโ€™s publisherโ€ข Link ๐ผ๐‘‡ and ๐‘†๐‘‡ on small subsetsโ€ข โ€œDistributed data miningโ€ between third party (server) and data publisher

(client)

Collaborative Decision Tree Analysis:โ€ข Type 1 prediction task with relaxed assumptionsโ€ข Advantages:

1. Preserves privacy with reasonable accuracy2. Big part of the decision tree is learnt by the third party, a desired situation in

cloud server/client architectureโ€ข Limitations:

1. Hard to give any bound on the model performance, in particular on the conditional risk (error rate) of the classification.

2. What about the execution time guarantees in a cloud client/server architecture?

โ€ข Need of a more justified model with the conditional risk guarantees!

Nearest Neighbor Rule in Anatomy Modelโ€ข Type 2 prediction task without relaxed assumptions.

โ€ข Anatomized Training Data (๐‘ซ๐‘จ): ๐ผ๐‘‡โ‹ˆ

๐ผ๐‘‡. ๐บ๐ผ๐ท = ๐‘†๐‘‡. ๐บ๐ผ๐ท๐‘†๐‘‡

โ€ข Augmentation of nearest neighbor rule (Cover and Hart 1967): Expand the

training set such that the expanded version has size ๐‘›๐‘ก๐‘Ÿ๐‘™โ€ข For all fixed ๐‘™, the conditional risk is the corollary of Cover and Hart when

๐‘›๐‘ก๐‘Ÿ โ†’ โˆžโ€ข One Critical Question: โ€œHow does the Bayes Risk change?โ€

Collaborative Decision Tree Learning1. Distributed Data Mining in the cloud (Client/server architecture)2. On-the fly encrypted subtrees (Mancuhan and al. 2014)3. Experiments with four datasets from the UCI collection: adult, vote, autos and

Australian credit4. 10 fold cross validation on each dataset measuring accuracy

Empirical Results

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

cdtl

cdbsl

cnl

cdtl

cdbsl

cnl

cdtl

cdbsl

cnl

cdtl

cdbsl

cnl

cdtl

cdbsl

cnl

Ad

ult

Age

Sen

siti

ve

Ad

ult

Rel

atio

nsh

ipSe

nsi

tive

Vo

teA

uto

Au

stra

lian

Cre

dit

Accuracy

AD=Dayton

AD=RichmondAD=Lafayette

A AD G S

30 Lafayette 2 3

35 Lafayette 2 4

45 Lafayette 4 7

31 Lafayette 4 8

A AD G S

22 Richmond 1 2

24 Richmond 3 5

47 Richmond 3 6

A AD G S

41 Dayton 1 1

A AD G S

41 Dayton 1 1

22 Richmond 1 2

30 Lafayette 2 3

35 Lafayette 2 4

24 Richmond 3 5

47 Richmond 3 6

45 Lafayette 4 7

31 Lafayette 4 8

A AD D

30 Lafayette Flu

35 Lafayette Cough

45 Lafayette Cough

31 Lafayette Flu A AD D

30 Lafayette Flu

31 Lafayette Flu

A AD D

35 Lafayette Cough

45 Lafayette Cough

D=Flu

D=Cough

Encrypted Subtree!

H(SEQ) G D

Hk2(3) 2 Flu

Hk2(4) 2 Cough

Hk2(7) 4 Cough

Hk2(8) 4 Flu

Current Work

โ€ข Experimentation of the nearest neighbor classifier using real data

โ€ข SVM classification generalization: How to adjust the right margin for the good

generalization property when the training data is anatomized?

โ€ข Real-world case study: How this could inform data retention policies

Ongoing work supported by Northrup Grumman Cybersecurity Research Consortium

Theorem: Let ๐‘€ โˆˆ โ„๐‘‘+1 be a metric space, ๐ท be the training data and ๐ท๐ด be the

anatomized training data. Let ๐‘ƒ๐ด1(๐‘‹) and ๐‘ƒ๐ด2(๐‘‹) be the smooth probability density

functions of ๐‘‹ . Let ๐‘ƒ๐ด1(๐‘‹) and ๐‘ƒ๐ด2(๐‘‹) be the class priors such that ๐‘ƒ๐ด(๐‘‹) =

๐‘ƒ๐ด1๐‘ƒ๐ด1(๐‘‹) + ๐‘ƒ๐ด2๐‘ƒ๐ด2(๐‘‹). Similarly, let ๐‘ƒ1(๐‘‹) and ๐‘ƒ2(๐‘‹) be the smooth probability

density functions of ๐‘‹ such that ๐‘ƒ(๐‘‹) = ๐‘ƒ1๐‘ƒ1(๐‘‹) + ๐‘ƒ2๐‘ƒ2(๐‘‹) with class priors ๐‘ƒ1 and ๐‘ƒ2.Let โ„Ž๐ด(๐‘‹) = โˆ’ln (๐‘ƒ๐ด1(๐‘‹)/๐‘ƒ๐ด2(๐‘‹)) and โ„Ž(๐‘‹) = โˆ’ln (๐‘ƒ1(๐‘‹)/๐‘ƒ2(๐‘‹)) be the classifiers with

biases ฮ”โ„Ž๐ด(๐‘‹) and ฮ”โ„Ž(๐‘‹) respectively. Let ๐‘ก = ln(๐‘ƒ1/๐‘ƒ2) be the decision threshold

with threshold bias ฮ”๐‘ก. Let ๐œ–๐ด > 0 be the small changes on ๐‘ƒ1(๐‘‹) and ๐‘ƒ2(๐‘‹) resulting

in ๐‘ƒ๐ด1(๐‘‹) and ๐‘ƒ๐ด2(๐‘‹); and ๐‘…๐ดโˆ—, ๐‘…โˆ— be the Bayesian error estimations with respective

biases ฮ”๐‘…๐ดโˆ—, ฮ”๐‘…โˆ—. Let ๐‘ƒ๐ด๐‘–(๐‘‹)and ๐‘ƒ๐‘–(๐‘‹) be the Parzen density estimations; and ๐พ(โˆ—) be

the kernel function for ๐ท with shape matrix ๐ด and size/volume parameter ๐‘Ÿ. Last, let's

assume that 1) ๐ด๐‘–๐‘‘ and ๐ด๐‘  are independent in the training data ๐ท and the anatomized

training data ๐ท๐ด 2) ๐‘…๐ดโˆ— = ๐‘…โˆ— hold 3) ฮ”๐‘ก < 1. Therefore, the estimated Bayes risk is:

๐‘…๐ดโˆ— โ‰… ๐‘Ž1๐‘Ÿ

2 + ๐‘Ž2๐‘Ÿ4 + ๐‘Ž3

๐‘Ÿโˆ’๐‘‘โˆ’1

๐‘+ ๐œ–๐ด๐‘Ž4๐‘Ÿ

2 + ๐๐‘จ๐‘Ž5๐‘Ÿ4 โˆ’ ๐๐‘จ๐‘Ž6

๐‘Ÿโˆ’๐‘‘โˆ’1

๐‘

where ๐๐‘จ๐‘Ž6๐‘Ÿโˆ’๐‘‘โˆ’1

๐‘> 0 always holds.

โ€ข Another critical question: โ€œHow does the convergence rate to the asymptotical

conditional risk change?โ€

โ€ข ๐‘‚(1/([๐‘๐‘™]๐‘‘+1) versus ๐‘‚(1/[๐‘]๐‘‘+1)โ€ข Faster convergence to the asymptotical conditional risk using anatomized

training data.

โ€ข How is the asymptotical conditional risk?

โ€ข Depends on the Bayes risk (Theorem above)

This work supported by NPRP grant 09-256-1-046 from the Qatar National Research

Foundation, 1415 PRF Research Grant from Purdue University

Theoretical Results

Koray Mancuhan

kmancuha@purdue.edu

Chris Clifton

clifton@cs.purdue.edu

Data Classification Using Anatomized Training Data