slides

Privacy-preserving data mining (1)

Outline A brief introduction to learning algorithms

Classification algorithms Clustering algorithms

Addressing privacy issues in learning Single dataset publishing Distributed multiple datasets How data is partitioned

A quick review

Machine learning algorithms Supervised learning (classification)

Training data have class labels Find the boundary between classes

Unsupervised learning (clustering) Training data have no labels Similarity measure is the key Grouping records based on the similarity

measure

A quick review

Good tutorials http://www.cs.utexas.edu/~mooney/cs39

1L/ “Top 10 data mining algorithms”

www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf

We will review the basic ideas of some algorithms

C4.5 decision tree (classification)

Based on ID3 algorithm Convert decision tree to rule set

From the root to a leave a rule Prune the rules

Cross validationSplit data to N folds

training validating testingIn each round

For choosing the best parameters

Testing the generalization power

Final result: the average of N testing results

Naïve bayes (classification)

Two classes: 0/1, feature vector: x (x1,x2,…, xn)

Apply bayes rule:

Assume independentfeatures :

Easy to count f(xi|class label) with the training data

K nearest neighbor (classification)

“instance-based learning”

Classifying the point

Decision area: Dz

More general: kernel methods

Linear classifier (classification)

wTx + b = 0

wTx + b < 0wTx + b > 0

f(x) = sign(wTx + b)

Examples:•Perceptron•Linear discriminant analysis(LDA)

There are infinite number of linear separatorsWhich one is optimal?

Support Vector Machine (classification)

Distance from example xi to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the distance between support

vectors.

xw br i

ρ Maximizing:

Extended to handle:1. Nonlinear2. Noisy margin3. Large datasets

Boosting (classification)

Classifier ensembles Average prediction of a set of classifiers

trained on the same set of data Intuition

The output of a classifier has certain amount of variance

Averaging can reduce the variance improve the accuracy

AdaBoost Freund Y, Schapire RE (1997) A decision-theoretic

generalization of on-line learning and an application to boosting. J Comput Syst Sci

Gradient boosting J. Friedman: stochastic gradient boosting,

http://citeseer.ist.psu.edu/old/126259.html

Challenges in Clustering

Definition of similarity measures Point-wise

Euclidean Cosine ( document similarity) Correlation …

Set-wise Min/max distance between two sets Entropy based (categorical data)

Challenges in Clustering Hierarchical

1. Merging most similar pairs each step2. Until reaching desired number of clusters

Partitioning (k-means)1. Set initial centroids 2. Partition the data3. Adjust the centroids4. Iterate on 2 and 3 until converging

Other classification of algorithms Aglommerative (bottom-up) methods Divisive (partitional, top-down)

Efficiency of the algorithm –large datasets

Linear-cost algorithms: k-means However, the costs of many algorithms

are quadratic Perform a three-phase processing

1. Sampling2. Clustering3. Labeling

Irregularly shaped clusters and noises

Clustering algorithms Typical ones

Kmeans Expectation-Maximization (EM)

A lot of clustering algorithms addressing different challenges Good survey:

AK Jain etc. Data Clustering: A Review, ACM Computing Surveys, 1999

PPDM issues

How data is distributed Single party releases data Multiparty collaboratively mining data

Pooling data Cryptographic protocols

How data is partitioned Horizontally vertically

Single party

Data perturbation Rakesh00, for decision tree Chen05, for many classifiers and

clustering algorithms

Anonymization Top-down/bottom-up: decision tree

Multiple parties

Party 1

Party 2

Party n

dataserver

user 1 user 1 user 1

Perturbeddata

network

Service-based computing Peer-to-peer computing

•Perturbation & anonymization•Papers: 89,92,94,185,

•Cryptographic approaches•Papers: 95-99,104,107,108

How data is partitioned Horizontally partitioned

All additive (and some multiplicative) perturbation methods

Protocols Kmeans, svm, naïve bayes, bayesian network…

Vertically partitioned All additive perturbation methods Protocols

Kmeans, bayesian network…

Challenges and opportunities

Many modeling methods have no privacy-preserving version Cost – protocol based approaches Limitation of column-based additive

perturbation Complexity

The advantage of geometric data perturbation Covers many different modeling methods

slides

Documents

Transcript of slides

ENGLISH/ THEATRE: Mrs. Sellers— Blue slides: English Pink slides: Theatre Green slides: Both.

Slides 1 › ... › Diapositivas.pdf · Slides 1. Slides 2 Knowing India. eas Slides 3. Slides 4 eas. Slides 5 The castes. Slides 6 e. Slides 7 e. Slides 8 Clothing. Clothing Slides

Tire suas dúvidas de português ÍNDICE DE AULAS Slides Aula 1 Slides Aul a 2 Slides Aula 3 Slides Aula 6 Slides Aula 7 Slides Aula 8 Slides Aula 9 Slides.

1. Introdução ( 6 slides) 7. O mistério da Redenção (15 slides) 2. Natal (10 slides) 8. Mediador e cabeça ( 10 slides) 3. Encarnação (10 slides) 9. Mistérios.

Lecture Slides 01 01-EvidenceEvolution1-Slides

Investor Presentation - Microsoft · Investor Presentation Content 3 CONTENT •Refining Market Slides •Turkish Market Slides •Company Overview Slides •Operations Slides •Key

Lecture Slides Lecture 2 Slides

Slides 1...Knowing India Slides 1 Slides 2 Knowing India The city Slides 3 The city Slides 4 Slides 5 The city The castes Slides 6 e & livestock Slides 7 Slides 8 e & livestock Clothing

Lecture Slides Week 4 Slides NEW

Slides Paris 6 Jussieu Slides on Causality

Prophetic Vioce Slides Penny Webinar Slides

1.Parusia (8 slides) 2.Ressurreição (10 slides) 3.Morte (13 slides) 4.Céu (16 slides) 5.Inferno (9 slides) 6.Purgatório (12 slides) Aulas previstas: Escatologia.

Challenges – 2 slides Opportunities – 4 slides How ? - 3 slides Recommendations – 1 slide

Lecture Slides-slides 42

Organização Pessoal ÍNDICE DE AULAS Slides Aula 1 Slides Aula 2 Slides Aula 3 Slides Aula 4 Slides Aula 5 Slides Aula 6 Slides Aula 7 Slides Aula 8 1.

Islam Muhammad & Islam (30 slides) Geography & Spread (15 slides) Scholars & Contributions (15 slides) Crusades & Modern Issues (15 slides) Review, Quiz.

PowerPoint Presentation€¦ · SLIDE EQUIPAMENTOS SLIDE 4 NIX SLIDES 5 ARTEMIS SLIDES PEGASUS SLIDES 12 ORION SLIDES 15 MERCURIUS SLIDES 18 ARGOS SLIDES 21 8 14 17 20 23 25 TITAN

Microscopic slides and Cover glasses - · PDF fileMicroscopic slides and Cover glasses cut / ground edges slides Superfrost economy slides diagnostic cell slides adhesive slides slides

Outline Data flow networks 101 (3 slides) VTK (20 slides) Contracts (10 slides) An architecture for big data (3 slides)

Appendix F: Presentation Slides - DIInstitute.orgPresentation Slides Contents Module 1: Presentation Slides 1-19 Module 2: Presentation Slides 1-43 Module 3: Presentation Slides 1-34