Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering...

62
Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K -means clustering Hierarchical clustering COMP-652, Lecture 12 - October 21, 2009 1

Transcript of Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering...

Page 1: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Lecture 12: Unsupervised learning. Clustering

• Introduction to unsupervised learning

• Clustering problem

• K-means clustering

• Hierarchical clustering

COMP-652, Lecture 12 - October 21, 2009 1

Page 2: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Unsupervised learning

• In supervised learning, data is in the form of pairs 〈x, y〉, where y =f(x), and the goal is to approximate f well.

• In unsupervised learning, the data just contains x!

• Goal is to “summarize” or find “patterns” or “structure” in the data

• A variety of problems and uses:– Clustering: “Flat” clustering or partitioning, hierarchical clustering– Density estimation– Dimensionality reduction, for: visualization, compression, pre-

processing

• The definition of “ground truth” is often missing: no clear errorfunction, or at least many reasonable alternatives

• Often useful in exploratory data analysis, and as a pre-processingstep for supervised learning

COMP-652, Lecture 12 - October 21, 2009 2

Page 3: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

What is clustering?

• Clustering is grouping similar objects together.– To establish prototypes, or detect outliers.– To simplify data for further analysis/learning.– To visualize data (in conjunction with dimensionality reduction)

• Clusterings are usually not “right” or “wrong” – different clusterings/clusteringcriteria can reveal different things about the data.

• Some clustering criteria/algorithms have natural probabilisticinterpretations

• Clustering algorithms:– Employ some notion of distance between objects– Have an explicit or implicit criterion defining what a good cluster is– Heuristically optimize that criterion to determine the clustering

COMP-652, Lecture 12 - October 21, 2009 3

Page 4: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

K-means clustering

• One of the most commonly-used clustering algorithms, because it iseasy to implement and quick to run.

• Assumes the objects (instances) to be clustered are n-dimensionalvectors, xi.

• Uses Euclidean distance

• The goal is to partition the data into K disjoint subsets

COMP-652, Lecture 12 - October 21, 2009 4

Page 5: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

K-means clustering with real-valued data

• Inputs:– A set of n-dimensional real vectors {x1,x2, . . . ,xm}.– K, the desired number of clusters.

• Output: A mapping of the vectors into K clusters (disjoint subsets),C : {1, . . . ,m} 7→ {1, . . . ,K}.

1. Initialize C randomly.2. Repeat

(a) Compute the centroid of each cluster (the mean of all the instancesin the cluster)

(b) Reassign each instance to the cluster with closest centroiduntil C stops changing.

COMP-652, Lecture 12 - October 21, 2009 5

Page 6: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: initial data

COMP-652, Lecture 12 - October 21, 2009 6

Page 7: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: assign into 3 clusters randomly

COMP-652, Lecture 12 - October 21, 2009 7

Page 8: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: compute centroids

COMP-652, Lecture 12 - October 21, 2009 8

Page 9: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: reassign clusters

COMP-652, Lecture 12 - October 21, 2009 9

Page 10: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: recompute centroids

COMP-652, Lecture 12 - October 21, 2009 10

Page 11: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: reassign clusters

COMP-652, Lecture 12 - October 21, 2009 11

Page 12: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: recompute centroids – done!

COMP-652, Lecture 12 - October 21, 2009 12

Page 13: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

What if we do not know the right number of clusters?

COMP-652, Lecture 12 - October 21, 2009 13

Page 14: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: assign into 4 clusters randomly

COMP-652, Lecture 12 - October 21, 2009 14

Page 15: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: compute centroids

COMP-652, Lecture 12 - October 21, 2009 15

Page 16: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: reassign clusters

COMP-652, Lecture 12 - October 21, 2009 16

Page 17: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: recompute centroids

COMP-652, Lecture 12 - October 21, 2009 17

Page 18: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: reassign clusters

COMP-652, Lecture 12 - October 21, 2009 18

Page 19: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example: recompute centroids – done!

COMP-652, Lecture 12 - October 21, 2009 19

Page 20: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Questions

• What is K-means trying to optimize?• Will it terminate?• Will it always find the same answer?• How should we choose the initial cluster centers?• Can we automatically choose the number of centers?

COMP-652, Lecture 12 - October 21, 2009 20

Page 21: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Does K-means clustering terminate?

• For given data {x1, . . . ,xm} and a clustering C, consider the sum ofthe squared Euclidian distance between each vector and the centerof its cluster:

J =m∑

i=1

‖xi − µC(i)‖2 ,

where µC(i) denotes the centroid of the cluster containing xi.

• There are finitely many possible clusterings: at most Km.

• Each time we reassign a vector to a cluster with a nearer centroid, Jdecreases.

• Each time we recompute the centroids of each cluster, J decreases(or stays the same.)

• Thus, the algorithm must terminate.

COMP-652, Lecture 12 - October 21, 2009 21

Page 22: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Does K-means always find the same answer?

• K-means is a version of coordinate descent, where the parametersare the cluster center coordinates, and the assignments of points toclusters.

• It minimizes the sum of squared Euclidean distances from vectors totheir cluster centroid.

• This error function has many local minima!

• The solution found is locally optimal, but not globally optimal

• Because the solution depends on the initial assignment of instancesto clusters, random restarts will give different solutions

COMP-652, Lecture 12 - October 21, 2009 22

Page 23: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example

J = 0.22870 J = 0.3088

COMP-652, Lecture 12 - October 21, 2009 23

Page 24: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example application: Color quantization

• Suppose you have an image stored with 24 bits per pixel

• You want to compress it so that you use only 8 bits per pixel (256colors)

• You want the compressed image to look as similar as possible to theoriginal image

⇒ Perform K−means clustering on the original set of color vectors withK = 256 colors.– Cluster centers (rounded to integer intensities) form the entries in

the 256-color colormap– Each pixel repesented by 8-bit index into colormap

COMP-652, Lecture 12 - October 21, 2009 24

Page 25: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example (Bishop)K = 2 K = 3 K = 10 Original image

COMP-652, Lecture 12 - October 21, 2009 25

Page 26: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

More generally: Vector quantization with Euclideanloss

• Suppose we want to send all the instances over a communicationchannel

• In order to compress the message, we cluster the data and encodeeach instance as the center of the cluster to which it belongs

• The reconstruction error for real-valued data can be measured asEuclidian distance between the true value and its encoding

• An optimal (J-minimizing) K-means clustering minimizes thereconstruction error among all possible codings of the same type

COMP-652, Lecture 12 - October 21, 2009 26

Page 27: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Finding good initial configurations

• The initial configuration can influence the final clustering

• Assigning each item to random cluster in {1, . . . ,K} is unbiased. . . buttypically results in cluster centroids near the centroid of all the datain the first round.

• A different heuristic tries to spread the initial centroids around asmuch as possible:– Place first center on top of a randomly chosen data point– Place second center on a data point as far away as possible from

the first one– Place the i-th center as far away as possible from the closest of

centers 1 through i− 1

• K-means clustering typically runs quickly. With a randomlizedintialization step, you can run the algorithm multiple times and takethe clustering with smallest J .

COMP-652, Lecture 12 - October 21, 2009 27

Page 28: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Choosing the number of clusters

• A difficult problem, ideas are floating around

• Delete clusters that cover too few points

• Split clusters that cover too many points

• Add extra clusters for “outliers”

• Minimum description length: minimize loss + complexity of theclustering

• Use a hierarchical method first (see in a bit)

COMP-652, Lecture 12 - October 21, 2009 28

Page 29: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Why the sum of squared Euclidean distances?

Subjective reason: It produces nice, round clusters.

COMP-652, Lecture 12 - October 21, 2009 29

Page 30: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Why the sum of squared Euclidean distances?

Objective reason: Maximum Likelihood Principle

• Suppose the data really does divide into K clusters.

• Suppose the data in each cluster is generated by independentsamples from a multivariate Gaussian distribution, where:– The mean of the Gaussian is the centroid of the cluster– The covariance matrix is of the form σ2I

• Then the probability of the data is highest when the sum of squaredEuclidean distances is smallest.

COMP-652, Lecture 12 - October 21, 2009 30

Page 31: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Derivation: similar to MSE motivation in supervisedlearning

l(x1, . . . ,xm|C(i), µj) =m∏

i=1

l(xi|C(i), µj)

=m∏

i=1

1(2π)n/2σn

exp(− 1

2σ2‖xi − µC(i)‖2

)

log l(x1, . . . ,xm|C(i), µj) ∝ −m∑

i=1

‖xi − µC(i)‖2 = J

COMP-652, Lecture 12 - October 21, 2009 31

Page 32: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Why not the sum of squared Euclidean distances?

1. It produces nice round clusters!

2. Differently scaled axes can dramatically affect results.

3. There may be symbolic attributes, which have to be treated differently

COMP-652, Lecture 12 - October 21, 2009 32

Page 33: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

K-means-like clustering in general

• Given a set of objects (need not be real vectors),– Choose a notion of pairwise distance / similarity between the

objects.– Choose a scoring function for the clustering– Optimize the scoring function, to find a good clustering.

• For most choices, the optimization problem will be intractable. Localoptimization is often necessary.

COMP-652, Lecture 12 - October 21, 2009 33

Page 34: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Distance metrics

• Euclidean distance

• Hamming distance (number of mismatches between two strings)

• Travel distance along a manifold (e.g. for geographic points)

• Tempo / rhythm similarity (for songs)

• Shared keywords (for web pages), or shared in-links

• . . .

COMP-652, Lecture 12 - October 21, 2009 34

Page 35: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Scoring functions

• Minimize: Summed distances between all pairs of objects in thesame cluster. (Also known as ”within-cluster scatter.”)

• Minimize: Maximum distance between any two objects in the samecluster. (Can be hard to optimize.)

• Maximize: Minimum distance between any two objects in differentclusters.

COMP-652, Lecture 12 - October 21, 2009 35

Page 36: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Common uses of K-means

• Often used in exploratory data analysis

• Often used as a pre-processing step before supervised learning

• In one-dimension, it is a good way to discretize real-valued variablesinto non-uniform buckets

• Used in speech understanding/recognition to convert wave forms intoone of k categories (vector quantization)

COMP-652, Lecture 12 - October 21, 2009 36

Page 37: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Hierarchical clustering

• Organizes data instances into trees.

• For visualization, exploratory data analysis.

• Agglomerative methods build the tree bottom-up, successivelygrouping together the clusters deemed most similar.

• Divisive methods build the tree top-down, recursively partitioning thedata.

COMP-652, Lecture 12 - October 21, 2009 37

Page 38: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

What is a hierarchical clustering?

• Given instances D = {x1, . . . ,xm}.

• A hierarchical clustering is a set of subsets (clusters) of D, C ={C1, . . . , CK}, where– Every element in D is in at least one set of C (the root)– The Cj can be assigned to the nodes of a tree such that the cluster

at any node is precisely the union of the clusters at the node’schildren (if any).

COMP-652, Lecture 12 - October 21, 2009 38

Page 39: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example of a hierarchical clustering

• Suppose D = {1, 2, 3, 4, 5, 6, 7}. A hierarchical clustering is C ={{1}, {2, 3}, {4, 5}, {1, 2, 3, 4, 5}, {6, 7}, {1, 2, 3, 4, 5, 6, 7}}.

1,2,3,4,5,6,7

1 2,3 4,5

1,2,3,4,5 6,7

• In this example:– Leaves of the tree need not correspond to single instances.– The branching factor of the tree is not limited.

• However, most hierarchical clustering algorithms produce binarytrees, and take single instances as the smallest clusters.

COMP-652, Lecture 12 - October 21, 2009 39

Page 40: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Agglomerative clustering

• Input: Pairwise distances d(x,x′) between a set of data objects {xi}.

• Output: A hierarchical clustering

• Algorithm:1. Assign each instance as its own cluster on a working list W .

2. Repeat(a) Find the two clusters in W that are most “similar”.(b) Remove them from W .(c) Add their union to W .

until W contains a single cluster with all the data objects.

3. Return all clusters appearing in W at any stage of the algorithm.

COMP-652, Lecture 12 - October 21, 2009 40

Page 41: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

How many clusters?

• How many clusters are generated by the agglomerative clusteringalgorithm?

• Answer: 2m− 1, where m is the number of data objects.

• Why? A binary tree with m leaves has m − 1 internal nodes, thus2m− 1 nodes total.

• More explicitly:– The working list W starts with m singleton clusters– Each iteration removes two clusters from W and adds one new

one– The algorithm stops when W has one cluster, which is after m− 1

iterations

COMP-652, Lecture 12 - October 21, 2009 41

Page 42: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

How do we measure dissimilarity between clusters?

• Distance between nearest objects (“Single-linkage” agglomerativeclustering, or “nearest neighbor”):

minx∈C,x′∈C′

d(x,x′)

• Distance between farthest objects (“Complete-linkage” agglomerativeclustering, or “furthest neighbor”):

maxx∈C,x′∈C′

d(x,x′)

• Average distance between objects (“Group-average” agglomerativeclustering):

1|C||C ′|

∑x∈C,x′∈C′

d(x,x′)

COMP-652, Lecture 12 - October 21, 2009 42

Page 43: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 1: Data

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 43

Page 44: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 1: Iteration 30

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 44

Page 45: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 1: Iteration 60

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 45

Page 46: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 1: Iteration 70

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 46

Page 47: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 1: Iteration 78

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 47

Page 48: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 1: Iteration 79

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 48

Page 49: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 2: Data

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 49

Page 50: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 2: Iteration 50

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 50

Page 51: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 2: Iteration 80

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 51

Page 52: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 2: Iteration 90

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 52

Page 53: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 2: Iteration 95

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 53

Page 54: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Example 2: Iteration 99

Single Average Complete

COMP-652, Lecture 12 - October 21, 2009 54

Page 55: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Intuitions about cluster similarity

• Single-linkage– Favors spatially-extended / filamentous clusters– Often leaves singleton clusters until near the end

• Complete-linkage favors compact clusters

• Average-linkage is somewhere in between

COMP-652, Lecture 12 - October 21, 2009 55

Page 56: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Monotonicity

• Let A, B, C be clusters.

• Let d be one of the dissimilarity measures: single-linkage (seebelow), average linkage or complete linkage

• If d(A,B) ≤ d(A,C) and d(A,B) ≤ d(B,C), then d(A,B) ≤ d(A ∪B,C).

COMP-652, Lecture 12 - October 21, 2009 56

Page 57: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Monotonicity of single-linkage criterion: Proof

• Suppose that that d(A,B) ≤ d(A,C) and d(A,B) ≤ d(B,C)

• Then:

d(A ∪B,C) = minx∈A∪B,x′∈C

d(x, x′)

= min(

minxa∈A,x′∈C

d(xa, x′), min

xb∈B,x′∈Cd(xa, x

′))

= min (d(A,C), d(B,C))

≥ min (d(A,B), d(A,B))

= d(A,B)

• Proofs for group-average and complete-linkage are similar.

COMP-652, Lecture 12 - October 21, 2009 57

Page 58: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Dendrograms

• The monotonicity property implies that every time agglomerativeclustering merges two clusters, the dissimilarity of those clusters is ≥the dissimilarity of all previous merges.

• Dendrograms (trees depicting hierarchical clusterings) are oftendrawn so that the height of a node corresponds to the dissimilarityof the merged clusters.

• We can form a flat clustering by cutting the tree at any height.

• Jumps in the height of the dendrogram can suggest natural cutoffs.

COMP-652, Lecture 12 - October 21, 2009 58

Page 59: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Dendrograms for Example 1

Single Average Complete

6976617062737879716572756367806877667464 8 1 560 91342444150495643464555524748575459185320 658511217 314 233312336243726344025392829353021382732 71019221615 4110

0.1

0.2

69766170627378796471 863687767806572756674 233312437263440282930351522 71019213827321623362539 411 1 560 658 9134244464150564943185320455552514759485754 31412170

0.1

0.2

0.3

0.4

0.5

0.6

69766170627378796471 863687767806572756674 233243726344028293035152231 71019213827321623362539 1 560 411 658 94555515213424944461853415056434754594857 3141220170

0.2

0.4

0.6

0.8

1

1.2

COMP-652, Lecture 12 - October 21, 2009 59

Page 60: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Dendrograms for Example 2

Single Average Complete

64 91 58 75 63 67 85 78 87 51 57 69 97100 60 82 94 98 53 59 66 93 73 52 71 55 86 62 99 65 79 89 80 54 72 83 56 81 92 96 61 88 68 70 74 95 76 77 84 90 1 17 18 5 20 13 34 37 45 41 3 11 28 40 43 6 14 32 9 23 12 47 4 46 7 10 33 48 19 35 16 22 44 36 38 49 2 15 26 27 30 39 50 29 25 8 31 24 21 420

0.05

0.1

64 91 85 60 82 94 98 51 57 69 97100 53 59 66 58 75 63 67 78 87 73 93 76 77 84 90 1 17 18 5 20 13 34 37 45 41 6 14 32 9 23 12 3 11 28 40 43 47 2 26 27 30 15 39 50 29 25 8 31 24 21 42 68 70 74 95 4 46 38 7 10 33 48 19 35 16 36 22 44 49 52 71 55 86 62 99 65 79 89 80 54 72 83 56 81 92 96 61 880

0.2

0.4

64 91 85 60 82 94 98 51 57 69 97100 53 59 66 1 17 18 5 20 13 34 37 45 41 6 14 32 9 23 12 58 75 63 67 78 87 73 93 76 77 84 90 2 26 27 30 15 39 50 29 22 44 49 8 31 24 21 42 25 68 70 74 95 3 11 28 40 43 47 4 46 38 7 35 10 33 48 16 19 36 52 71 55 86 62 99 65 79 89 80 54 72 83 56 61 81 92 96 880

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

COMP-652, Lecture 12 - October 21, 2009 60

Page 61: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Divisive clustering

• Works by recursively partitioning the instances.

• How might you do that?– K-means?– Max weighted cut on graph where edges are weighted by pairwise

distances?– Maximum margin?

• Many heuristics for partitioning the instances have been proposed. . . but many are computationally hard and/or violate monotonicity,making it hard to draw dendrograms.

COMP-652, Lecture 12 - October 21, 2009 61

Page 62: Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

Summary

• K-means– Fast way of partitioning data into K clusters– It minimizes the sum of squared Euclidean distances to the

clusters centroids– Different clusterings can result from different initializations– Can be interpreted as fitting a mixture distribution

• Hierarchical clustering– Organizes data objects into a tree based on similarity.– Agglomerative (bottom-up) tree construction is most popular.– There are several choices of distance metric (linkage criterion)– Monotonicity allows us to draw dendrograms in which the height of

a node corresponds to the dissimilarity of the clusters merged.– Trees can be cut off at some level, to generate a flat partitioning of

the data.

COMP-652, Lecture 12 - October 21, 2009 62