Mean shift and Hierarchical clustering
-
Upload
yan-xu -
Category
Data & Analytics
-
view
111 -
download
4
Transcript of Mean shift and Hierarchical clustering
Clustering for new discovery in data
Mean shift clustering Hierarchical clustering
- Kunal Parmar
Houston Machine
Learning Meetup
1/21/2017
Clustering : A world without labels
• Finding hidden structure in data when we don’t
have labels/classes for the data
• We group data
together based
on some notion
of similarity in
the feature space
Clustering approaches covered in previous lecture
• k-means clustering o Iterative partitioning into k clusters based on proximity of an observation to
the cluster mean
Clustering approaches covered in previous lecture
• DBSCAN o Partition the feature space based on density
In this segment,
Mean shift clustering Hierarchical clustering
Mean shift clustering • Mean shift clustering is a non-parametric iterative
mode-based clustering technique based on kernel
density estimation.
• It is very commonly used in the field of computer
vision because of it’s high efficiency in image
segmentation.
Mean shift clustering
• It assumes that our data is sampled from an
underlying probability distribution
• The algorithm finds out the modes(peaks) of the
probability distribution. The underlying kernel
distribution at the mode corresponds to a cluster
Kernel density estimation
Set of points KDE surface
Algorithm: Mean shift 1. Define a window (bandwidth of the kernel to be
used for estimation) and place the window on a
data point
2. Calculate mean of all the points within the window
3. Move the window to the location of the mean
4. Repeat step 2-3 until convergence
• On convergence, all data points within that window
form a cluster.
Example: Mean shift
Example: Mean shift
Example: Mean shift
Example: Mean shift
Types of kernels
• Generally, a Gaussian kernel is used for probability
estimation in mean shift clustering.
• However, other kinds of kernels that can be used
are, o Rectangular kernel
o Flat kernel, etc.
• The choice of kernel affects the clustering result
Types of kernels
• The choice of the bandwidth of the kernel(window)
will also impact the clustering result o Small kernels will result in lots of clusters, some even being individual data
points
o Big kernels will result in one or two huge clusters
Pros and cons : Mean Shift • Pros
o Model-free, doesn’t assume predefined shape of clusters
o Only relies on one parameter: kernel bandwidth h
o Robust to outliers
• Cons o The selection of window size is not trivial
o Computationally expensive; O(𝑛2)
o Sensitive to selection of kernel bandwidth; small h will slow down convergence,
large h speeds it up but might merge two modes
Applications : Mean Shift • Clustering and segmentation
• dfsn
Applications : Mean Shift • Clustering and Segmentation
Hierarchical Clustering
• Hierarchical clustering creates clusters that have a
predetermined ordering from top to bottom.
• There are two types of hierarchical clustering: o Divisive
• Top to bottom approach
o Agglomerative
• Bottom to top approach
Algorithm:
Hierarchical agglomerative clustering
1. Place each data point in it’s own singleton group
2. Iteratively merge the two closest groups
3. Repeat step 2 until all the data points are merged
into a single cluster
• We obtain a dendogram(tree-like structure) at the
final step. We cut the dendogram at a certain level
to obtain the final set of clusters.
Cluster similarity or dissimilarity
• Distance metric o Euclidean distance
o Manhattan distance
o Jaccard index, etc.
• Linkage criteria o Single linkage
o Complete linkage
o Average linkage
Linkage criteria • It is the quantification of the distance between sets
of observations/intermediate clusters formed in the
agglomeration process
Single linkage • Distance between two clusters is the shortest
distance between two points in each cluster
Complete linkage • Distance between two clusters is the longest
distance between two points in each cluster
Average linkage
• Distance between clusters is the average distance
between each point in one cluster to every point in
other cluster
Example: Hierarchical clustering
• We consider a small dataset with seven samples; o (A, B, C, D, E, F, G)
• Metrics used in this example o Distance metric: Jaccard index
o Linkage criteria: Complete linkage
Example: Hierarchical clustering
• We construct a dissimilarity matrix based on Jaccard index.
• B and F are merged in this step as they have the lowest dissimilarity
Example: Hierarchical clustering
• How do we calculate distance of (B,F) with other
clusters?
o This is where the choice of linkage criteria comes in
o Since we are using complete linkage, we use the maximum distance
between two clusters
o So,
• Dissimilarity(B, A) : 0.5000
• Dissimilarity(F, A) : 0.6250
• Hence, Dissimilarity((B,F), A) : 0.6250
Example: Hierarchical clustering
• We iteratively merge clusters at each step until all
the data points are covered, i. merge two clusters with lowest dissimilarity
ii. update the dissimilarity matrix based on merged clusters
o sfs
Dendogram • At the end of the agglomeration process, we
obtain a dendogram that looks like this,
• sfdafdfsdfsd
Cutting the tree • We cut the dendogram at a level where there is a
jump in the clustering levels/dissimilarities
Cutting the tree • If we cut the tree at 0.5, then we can say that within
each cluster the samples have more than 50%
similarity
• So our final set of clusters is, i. (B,F),
ii. (A,E,C,G) and
iii. (D)
Final set of clusters
Impact of metrics • The metrics chosen for hierarchical clustering can
lead to vastly different clusters.
• Distance metric o In a 2-dimensional space, the distance between the point (1,0) and the
origin (0,0) can be 2 under Manhattan distance, 2 under Euclidean
distance.
• Linkage criteria o Distance between two clusters can be different based on linkage criteria
used
Linkage criteria • Complete linkage is the most popular metric used
for hierarchical clustering. It is less sensitive to
outliers.
• Single linkage can handle non-elliptical shapes. But,
single linkage can lead to clusters that are quite
heterogeneous internally and it more sensitive to
outliers and noise
Pros and Cons : Hierarchical Clustering
• Pros o No assumption of a particular number of clusters
o May correspond to meaningful taxonomies
• Cons o Once a decision is made to combine two clusters, it can’t be undone
o Too slow for large data sets, O(𝑛2 log(𝑛))
References i. https://spin.atomicobject.com/2015/05/26/mean-
shift-clustering/
ii. http://vision.stanford.edu/teaching/cs131_fall1314
_nope/lectures/lecture13_kmeans_cs131.pdf
iii. http://84.89.132.1/~michael/stanford/maeb7.pdf
Thank you!