Post on 04-Mar-2020
CERIASThe Center for Education and Research in Information Assurance and Security
The ProblemAnatomized Learning ProblemGiven some l-diverse data in the anatomy model, can we learn accurate data miningmodels?
โข What is the anatomy model?โข What is l-diverse?โข Under which assumptions?
Anatomy ModelSeparate data table (๐ท) into two tables, identifier (๐ผ๐) and sensitive table (๐๐) insteadof generalizing records in the same group:
โข Divide ๐ท in ๐ groups ๐บ๐, group id (๐บ๐ผ๐ท) ๐โข IT attributes (๐ด๐๐): ๐ด1, โฏ , ๐ด๐โข ST attribute: ๐ด๐ โข Publish ๐ผ๐ and ๐๐ instead of ๐ทโข L-diverseโข Xiao et al. (2006), Nergiz et al. (2011, 2013)โข Patient Data Example (HIPAA 2002)
L-diverse: Privacy StandardEvery instance in ๐ผ๐ can be associated with L different instances in ๐๐
โข Patient Data Example: L=2
โข โ ๐บ๐ , ๐ฃ โ ๐๐ด๐ ๐บ๐ ,๐๐๐๐ ๐ฃ,๐บ๐
๐บ๐โค1
๐
โข Machanavajjhala et al. (2007)
H(SEQ) GID (G)
Disease (D)
Hk2(1) 1 Cold
Hk2(2) 1 Fever
Hk2(3) 2 Flu
Hk2(4) 2 Cough
Hk2(5) 3 Flu
Hk2(6) 3 Fever
Hk2(7) 4 Cough
Hk2(8) 4 Flu
Patient (P)
Age (A)
Address (AD)
GID (G)
SEQ (S)
Ike 41 Dayton 1 1
Eric 22 Richmond 1 2
Olga 30 Lafayette 2 3
Kelly 35 Lafayette 2 4
Faye 24 Richmond 3 5
Mike 47 Richmond 3 6
Jason 45 Lafayette 4 7
Max 31 Lafayette 4 8
Sensitive table (ST)Identifier table (IT)
ApproachesLearning Problem Assumptions
โข Training set of ๐๐ก๐ instances in anatomy model and test set of ๐๐ก๐ instanceswithout any anonymization (Inan et al. (2009))
โข No background knowledge for ๐ผ๐โข Canโt predict the sensitive attribute ๐จ๐. If we could, we would be violating
privacy!โข Prediction task of ๐ด๐or ๐ถ (binary!):
โข Type 1: ๐ด๐ โ ๐ด1, โฏ , ๐ด๐โข Type 2: ๐ถ โ ๐ด1, โฏ , ๐ด๐ โง ๐ถ โ ๐ด๐
โข No ๐ผ๐ and ๐๐ linkingโข Data must remain L-diverseโข No involvement of Dโs publisherโข Relaxed Assumptions (some models):
โข Minimal involvement of ๐ทโs publisher, limited sources of ๐ทโs publisherโข Link ๐ผ๐ and ๐๐ on small subsetsโข โDistributed data miningโ between third party (server) and data publisher
(client)
Collaborative Decision Tree Analysis:โข Type 1 prediction task with relaxed assumptionsโข Advantages:
1. Preserves privacy with reasonable accuracy2. Big part of the decision tree is learnt by the third party, a desired situation in
cloud server/client architectureโข Limitations:
1. Hard to give any bound on the model performance, in particular on the conditional risk (error rate) of the classification.
2. What about the execution time guarantees in a cloud client/server architecture?
โข Need of a more justified model with the conditional risk guarantees!
Nearest Neighbor Rule in Anatomy Modelโข Type 2 prediction task without relaxed assumptions.
โข Anatomized Training Data (๐ซ๐จ): ๐ผ๐โ
๐ผ๐. ๐บ๐ผ๐ท = ๐๐. ๐บ๐ผ๐ท๐๐
โข Augmentation of nearest neighbor rule (Cover and Hart 1967): Expand the
training set such that the expanded version has size ๐๐ก๐๐โข For all fixed ๐, the conditional risk is the corollary of Cover and Hart when
๐๐ก๐ โ โโข One Critical Question: โHow does the Bayes Risk change?โ
Collaborative Decision Tree Learning1. Distributed Data Mining in the cloud (Client/server architecture)2. On-the fly encrypted subtrees (Mancuhan and al. 2014)3. Experiments with four datasets from the UCI collection: adult, vote, autos and
Australian credit4. 10 fold cross validation on each dataset measuring accuracy
Empirical Results
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
cdtl
cdbsl
cnl
cdtl
cdbsl
cnl
cdtl
cdbsl
cnl
cdtl
cdbsl
cnl
cdtl
cdbsl
cnl
Ad
ult
Age
Sen
siti
ve
Ad
ult
Rel
atio
nsh
ipSe
nsi
tive
Vo
teA
uto
Au
stra
lian
Cre
dit
Accuracy
AD=Dayton
AD=RichmondAD=Lafayette
A AD G S
30 Lafayette 2 3
35 Lafayette 2 4
45 Lafayette 4 7
31 Lafayette 4 8
A AD G S
22 Richmond 1 2
24 Richmond 3 5
47 Richmond 3 6
A AD G S
41 Dayton 1 1
A AD G S
41 Dayton 1 1
22 Richmond 1 2
30 Lafayette 2 3
35 Lafayette 2 4
24 Richmond 3 5
47 Richmond 3 6
45 Lafayette 4 7
31 Lafayette 4 8
A AD D
30 Lafayette Flu
35 Lafayette Cough
45 Lafayette Cough
31 Lafayette Flu A AD D
30 Lafayette Flu
31 Lafayette Flu
A AD D
35 Lafayette Cough
45 Lafayette Cough
D=Flu
D=Cough
Encrypted Subtree!
H(SEQ) G D
Hk2(3) 2 Flu
Hk2(4) 2 Cough
Hk2(7) 4 Cough
Hk2(8) 4 Flu
Current Work
โข Experimentation of the nearest neighbor classifier using real data
โข SVM classification generalization: How to adjust the right margin for the good
generalization property when the training data is anatomized?
โข Real-world case study: How this could inform data retention policies
Ongoing work supported by Northrup Grumman Cybersecurity Research Consortium
Theorem: Let ๐ โ โ๐+1 be a metric space, ๐ท be the training data and ๐ท๐ด be the
anatomized training data. Let ๐๐ด1(๐) and ๐๐ด2(๐) be the smooth probability density
functions of ๐ . Let ๐๐ด1(๐) and ๐๐ด2(๐) be the class priors such that ๐๐ด(๐) =
๐๐ด1๐๐ด1(๐) + ๐๐ด2๐๐ด2(๐). Similarly, let ๐1(๐) and ๐2(๐) be the smooth probability
density functions of ๐ such that ๐(๐) = ๐1๐1(๐) + ๐2๐2(๐) with class priors ๐1 and ๐2.Let โ๐ด(๐) = โln (๐๐ด1(๐)/๐๐ด2(๐)) and โ(๐) = โln (๐1(๐)/๐2(๐)) be the classifiers with
biases ฮโ๐ด(๐) and ฮโ(๐) respectively. Let ๐ก = ln(๐1/๐2) be the decision threshold
with threshold bias ฮ๐ก. Let ๐๐ด > 0 be the small changes on ๐1(๐) and ๐2(๐) resulting
in ๐๐ด1(๐) and ๐๐ด2(๐); and ๐ ๐ดโ, ๐ โ be the Bayesian error estimations with respective
biases ฮ๐ ๐ดโ, ฮ๐ โ. Let ๐๐ด๐(๐)and ๐๐(๐) be the Parzen density estimations; and ๐พ(โ) be
the kernel function for ๐ท with shape matrix ๐ด and size/volume parameter ๐. Last, let's
assume that 1) ๐ด๐๐ and ๐ด๐ are independent in the training data ๐ท and the anatomized
training data ๐ท๐ด 2) ๐ ๐ดโ = ๐ โ hold 3) ฮ๐ก < 1. Therefore, the estimated Bayes risk is:
๐ ๐ดโ โ ๐1๐
2 + ๐2๐4 + ๐3
๐โ๐โ1
๐+ ๐๐ด๐4๐
2 + ๐๐จ๐5๐4 โ ๐๐จ๐6
๐โ๐โ1
๐
where ๐๐จ๐6๐โ๐โ1
๐> 0 always holds.
โข Another critical question: โHow does the convergence rate to the asymptotical
conditional risk change?โ
โข ๐(1/([๐๐]๐+1) versus ๐(1/[๐]๐+1)โข Faster convergence to the asymptotical conditional risk using anatomized
training data.
โข How is the asymptotical conditional risk?
โข Depends on the Bayes risk (Theorem above)
This work supported by NPRP grant 09-256-1-046 from the Qatar National Research
Foundation, 1415 PRF Research Grant from Purdue University
Theoretical Results
Koray Mancuhan
kmancuha@purdue.edu
Chris Clifton
clifton@cs.purdue.edu
Data Classification Using Anatomized Training Data