Mean shift and Hierarchical clustering

Clustering for new discovery in data

Mean shift clustering Hierarchical clustering

- Kunal Parmar

Houston Machine

Learning Meetup

1/21/2017

Clustering : A world without labels

• Finding hidden structure in data when we don’t

have labels/classes for the data

• We group data

together based

on some notion

of similarity in

the feature space

Clustering approaches covered in previous lecture

• k-means clustering o Iterative partitioning into k clusters based on proximity of an observation to

the cluster mean

Clustering approaches covered in previous lecture

• DBSCAN o Partition the feature space based on density

In this segment,

Mean shift clustering Hierarchical clustering

Mean shift clustering • Mean shift clustering is a non-parametric iterative

mode-based clustering technique based on kernel

density estimation.

• It is very commonly used in the field of computer

vision because of it’s high efficiency in image

segmentation.

Mean shift clustering

• It assumes that our data is sampled from an

underlying probability distribution

• The algorithm finds out the modes(peaks) of the

probability distribution. The underlying kernel

distribution at the mode corresponds to a cluster

Kernel density estimation

Set of points KDE surface

Algorithm: Mean shift 1. Define a window (bandwidth of the kernel to be

used for estimation) and place the window on a

data point

2. Calculate mean of all the points within the window

3. Move the window to the location of the mean

4. Repeat step 2-3 until convergence

• On convergence, all data points within that window

form a cluster.

Example: Mean shift

Types of kernels

• Generally, a Gaussian kernel is used for probability

estimation in mean shift clustering.

• However, other kinds of kernels that can be used

are, o Rectangular kernel

o Flat kernel, etc.

• The choice of kernel affects the clustering result

Types of kernels

• The choice of the bandwidth of the kernel(window)

will also impact the clustering result o Small kernels will result in lots of clusters, some even being individual data

points

o Big kernels will result in one or two huge clusters

Pros and cons : Mean Shift • Pros

o Model-free, doesn’t assume predefined shape of clusters

o Only relies on one parameter: kernel bandwidth h

o Robust to outliers

• Cons o The selection of window size is not trivial

o Computationally expensive; O(𝑛2)

o Sensitive to selection of kernel bandwidth; small h will slow down convergence,

large h speeds it up but might merge two modes

Applications : Mean Shift • Clustering and segmentation

• dfsn

Applications : Mean Shift • Clustering and Segmentation

Hierarchical Clustering

• Hierarchical clustering creates clusters that have a

predetermined ordering from top to bottom.

• There are two types of hierarchical clustering: o Divisive

• Top to bottom approach

o Agglomerative

• Bottom to top approach

Algorithm:

Hierarchical agglomerative clustering

1. Place each data point in it’s own singleton group

2. Iteratively merge the two closest groups

3. Repeat step 2 until all the data points are merged

into a single cluster

• We obtain a dendogram(tree-like structure) at the

final step. We cut the dendogram at a certain level

to obtain the final set of clusters.

Cluster similarity or dissimilarity

• Distance metric o Euclidean distance

o Manhattan distance

o Jaccard index, etc.

• Linkage criteria o Single linkage

o Complete linkage

o Average linkage

Linkage criteria • It is the quantification of the distance between sets

of observations/intermediate clusters formed in the

agglomeration process

Single linkage • Distance between two clusters is the shortest

distance between two points in each cluster

Complete linkage • Distance between two clusters is the longest

distance between two points in each cluster

Average linkage

• Distance between clusters is the average distance

between each point in one cluster to every point in

other cluster

Example: Hierarchical clustering

• We consider a small dataset with seven samples; o (A, B, C, D, E, F, G)

• Metrics used in this example o Distance metric: Jaccard index

o Linkage criteria: Complete linkage


• We construct a dissimilarity matrix based on Jaccard index.

• B and F are merged in this step as they have the lowest dissimilarity


• How do we calculate distance of (B,F) with other

clusters?

o This is where the choice of linkage criteria comes in

o Since we are using complete linkage, we use the maximum distance

between two clusters

o So,

• Dissimilarity(B, A) : 0.5000

• Dissimilarity(F, A) : 0.6250

• Hence, Dissimilarity((B,F), A) : 0.6250


• We iteratively merge clusters at each step until all

the data points are covered, i. merge two clusters with lowest dissimilarity

ii. update the dissimilarity matrix based on merged clusters

o sfs

Dendogram • At the end of the agglomeration process, we

obtain a dendogram that looks like this,

• sfdafdfsdfsd

Cutting the tree • We cut the dendogram at a level where there is a

jump in the clustering levels/dissimilarities

Cutting the tree • If we cut the tree at 0.5, then we can say that within

each cluster the samples have more than 50%

similarity

• So our final set of clusters is, i. (B,F),

ii. (A,E,C,G) and

iii. (D)

Final set of clusters

Impact of metrics • The metrics chosen for hierarchical clustering can

lead to vastly different clusters.

• Distance metric o In a 2-dimensional space, the distance between the point (1,0) and the

origin (0,0) can be 2 under Manhattan distance, 2 under Euclidean

distance.

• Linkage criteria o Distance between two clusters can be different based on linkage criteria

used

Linkage criteria • Complete linkage is the most popular metric used

for hierarchical clustering. It is less sensitive to

outliers.

• Single linkage can handle non-elliptical shapes. But,

single linkage can lead to clusters that are quite

heterogeneous internally and it more sensitive to

outliers and noise

Pros and Cons : Hierarchical Clustering

• Pros o No assumption of a particular number of clusters

o May correspond to meaningful taxonomies

• Cons o Once a decision is made to combine two clusters, it can’t be undone

o Too slow for large data sets, O(𝑛2 log(𝑛))

References i. https://spin.atomicobject.com/2015/05/26/mean-

shift-clustering/

ii. http://vision.stanford.edu/teaching/cs131_fall1314

_nope/lectures/lecture13_kmeans_cs131.pdf

iii. http://84.89.132.1/~michael/stanford/maeb7.pdf

https://spin.atomicobject.com/2015/05/26/mean-shift-clustering/





http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

http://vision.stanford.edu/teaching/cs131_fall1314_nope/lectures/lecture13_kmeans_cs131.pdf

http://84.89.132.1/~michael/stanford/maeb7.pdf

Thank you!

Mean shift and Hierarchical clustering

Data & Analytics

Transcript of Mean shift and Hierarchical clustering