INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1....

INTRODUCTION TO MACHINE LEARNING

Clustering with k-means

Introduction to Machine Learning

Clustering, what?

● Similar within cluster

● Dissimilar between clusters

● Clustering: grouping objects in clusters

● No labels: unsupervised classification

● Plenty possible clusterings

● Cluster: collection of objects


Clustering, why?● Pa!ern Analysis

● Visualise Data

● pre-Processing Step

● Outlier Detection

● …

● Targeted Marketing Programs

● Student Segmentations

● Data Mining

● …


Clustering, how?● Measure of Similarity: d( …, …)

• Numerical variables Metrics: Euclidean, Manhattan, …

• Categorical variables Construct your own distance

• Clustering Methods

• k-means

• Hierarchical

• …

Many variations


Cluster CentroidObjectCluster

#Clusters

Compactness and Separation● Within Cluster Sums of Squares (WSS):

Measure of compactness

● Between Cluster Sums of Squares (BSS):

Measure of separation

Minimise WSS

Maximise BSS

Cluster Centroid#Clusters

#Objects in ClusterSample Mean


Goal: Partition data in k disjoint subsets

k-Means Algorithm

0 5 10−5

05

x

y

Let’s take k = 3


1. Randomly assign k centroids

0 5 10−5

05

x

y

k-Means Algorithmk = 3Goal: Partition data in k disjoint subsets



2. Assign data to closest centroid

0 5 10−5

05

x

y





3. Moves centroids to average location

0 5 10−5

05

x

y






4. Repeat step 2 and 3

0 5 10−5

05

x

y






4. Repeat step 2 and 3

0 5 10−5

05

x

y

k-Means Algorithm

The algorithm has converged!

k = 3Goal: Partition data in k disjoint subsets


● Problem: WSS keeps decreasing as k increases!

● Solution: WSS starts decreasing slowly

Choosing k● Goal: Find k that minimizes WSS

WSS / TSS < 0.2 Fix k}


Choosing k: Scree PlotScree Plot: Visualizing the ratio WSS / TSS as function of k

1 2 3 4 5 6 7

0.2

0.4

0.6

0.8

1.0

k

WSS

/ TS

S

Look for the elbow in the plot

Choose k = 3


k-Means in R

● centers: Starting centroid or #clusters

● nstart: #times R restarts with different centroids

> my_km <- kmeans(data, centers, nstart)

Distance: Euclidean metric

> my_km$tot.withinss

> my_km$betweenss

WSS

BSS


Let’s practice!


Performance and Scaling


Cluster EvaluationNot trivial! There is no truth

● No true labels

● No true response

Evaluation methods? Depends on the goal

Goal: Compact and Separated Measurable!


Cluster MeasuresWSS and BSS:

Underlying idea:

● Variance within clusters

● Separation between clusters } Compare

Alternative:

● Diameter

● Intercluster Distance

Good indication


Diameter

0 5 10 15−5

05

x

y

: Objects : Cluster

: Distance (objects)

Measure of Compactness


0 5 10 15

−50

5

x

y

Intercluster Distance

: Objects : Clusters: Distance (objects)

Measure of Separation


0 5 10 15

−50

5

x

y

Dunn’s Index


Notes:

● High computational cost

● Worst case indicator

Dunn’s IndexHigher Dunn Be!er separated / more compact


Alternative measures● Internal Validation: based on intrinsic knowledge

● BIC Index

● Silhoue!e’s Index

● External Validation: based on previous knowledge

● Hulbert’s Correlation

● Jaccard’s Coefficient


Evaluating in RLibraries: cluster and clValid

> dunn(clusters = my_km, Data = ...)

Dunn’s Index:

● clusters: cluster partitioning vector

● Data: original dataset


Scale IssuesMetrics are o!en scale dependent!

Which pair is most similar? ( Age, Income, IQ )

● X1 = (28, 72000, 120)

● X2 = (56, 73000, 80)

● X3 = (29, 74500, 118)

● Intuition: (X1, X3)

● Euclidean: (X1, X2)

Solution: Rescale income / 1000$


StandardizingProblem: Multiple variables on different scales

Solution: Standardize your data

1. Subtract the mean

2. Divide by the standard deviation

> scale(data)

Note: Standardizing Different interpretation


Let’s practice!


Hierarchical Clustering


Hierarchical ClusteringHierarchy:

● Which objects cluster first?

● Which cluster pairs merge? When?

Bo!om-up:

● Starts from the objects

● Builds a hierarchy of clusters


Bo!om-Up: AlgorithmPre: Calculate distances between objects

Objects


Bo!om-Up: AlgorithmPre: Calculate distances between objects

Distance

Objects


Bo!om-Up: Algorithm1. Put every object in its own cluster


Bo!om-Up: Algorithm2. Find the closest pair of clusters Merge them


Bo!om-Up: Algorithm3. Compute distances between new cluster and old ones


Bo!om-Up: Algorithm4. Repeat steps two and three


Bo!om-Up: Algorithm4. Repeat steps two and three One cluster


Linkage-Methods● Simple-Linkage: minimal distance between clusters

● Complete-Linkage: maximal distance between clusters

● Average-Linkage: average distance between clusters

Different Clusterings


Simple-LinkageMinimal distance between objects in each clusters


Complete-LinkageMaximal distance between objects in each cluster


Single-Linkage: Chaining

● O!en undesired


Single-Linkage: Chaining

● Can be great outlier detector

● O!en undesired


DendrogramHe

ight

Leaves / Objects

Cut

Merge

Merge


Hierarchical Clustering in R

> dist(x, method)

Library: stats

> hclust(d, method)

● x: dataset ● method: distance

● d: distance matrix ● method: linkage


Hierarchical: Pro and Cons● Pros

● In-depth analysis

● Linkage-methods

● Cons

● High computational cost

● Can never undo merges

Different pa"ern


k-Means: Pro and Cons● Pros

● Can undo merges

● Fast computations

● Cons

● Fixed #Clusters

● Dependent on starting centroids


Let’s practice!

INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1....

Documents

Transcript of INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1....