Clustering with k-means and mixture of Gaussian densities
Jakob VerbeekDecember 3, 2010
Course website:http://lear.inrialpes.fr/~verbeek/MLCR.10.11.php
Plan for the course• Session 1, October 1 2010
– Cordelia Schmid: Introduction – Jakob Verbeek: Introduction Machine Learning
• Session 2, December 3 2010– Jakob Verbeek: Clustering with k-means, mixture of Gaussians – Cordelia Schmid: Local invariant features – Student presentation 1: Scale and affine invariant interest point detectors, Mikolajczyk,
Schmid, IJCV 2004.
• Session 3, December 10 2010– Cordelia Schmid: Instance-level recognition: efficient search – Student presentation 2: Scalable Recognition with a Vocabulary Tree, Nister and Stewenius,
CVPR 2006.
Plan for the course• Session 4, December 17 2010
– Jakob Verbeek: Mixture of Gaussians, EM algorithm, Fisher Vector image representation – Cordelia Schmid: Bag-of-features models for category-level classification – Student presentation 2: Beyond bags of features: spatial pyramid matching for recognizing
natural scene categories, Lazebnik, Schmid and Ponce, CVPR 2006.
• Session 5, January 7 2011– Jakob Verbeek: Classification 1: generative and non-parameteric methods – Student presentation 4: Large-Scale Image Retrieval with Compressed Fisher Vectors,
Perronnin, Liu, Sanchez and Poirier, CVPR 2010. – Cordelia Schmid: Category level localization: Sliding window and shape model – Student presentation 5: Object Detection with Discriminatively Trained Part Based Models,
Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2010.
• Session 6, January 14 2011– Jakob Verbeek: Classification 2: discriminative models – Student presentation 6: TagProp: Discriminative metric learning in nearest neighbor models
for image auto-annotation, Guillaumin, Mensink, Verbeek and Schmid, ICCV 2009. – Student presentation 7: IM2GPS: estimating geographic information from a single image,
Hays and Efros, CVPR 2008.
Clustering• Finding a group structure in the data
– Data in one cluster similar to each other– Data in different clusters dissimilar
• Map each data point to a discrete cluster index– “flat” methods find k groups (k known, or automatically set)– “hierarchical” methods define a tree structure over the data
Hierarchical Clustering• Data set is partitioned into a tree structure
• Top-down construction– Start all data in one cluster: root node– Apply “flat” clustering into k groups– Recursively cluster the data in each group
• Bottom-up construction– Start with all points in separate cluster– Recursively merge “closest” clusters– Distance between clusters A and B
• Min, max, or mean distance between x in A, and y in B
Clustering example• Learn face similarity from training pairs labeled as
same/different• Cluster faces based on identity• Example: picasa web albums, label face clusters
[Guillaumin, Verbeek, Schmid, ICCV 2009]
Clustering example: visual words
Airplanes
Motorbikes
Faces
Wild Cats
Leafs
People
Bikes
Clustering for visual vocabulary construction• Clustering of local image descriptors
– Most often done using k-means or mixture of Gaussians– Divide space of region descriptors in a collection of non-
overlapping cells
• Recap of the image representation pipe-line– Extract image regions at different locations and scales:
randomly, on a regular grid, or using interest point detector– Compute descriptor for each region (eg SIFT)– Assign each descriptor to a cluster center
• Or do “soft assignment” or “multiple assignment”– Make histogram for complete image
• Possibly separate histograms for different image regions
Definition of k-means clustering• Given: data set of N points xn, n=1,…,N
• Goal: find K cluster centers mk, k=1,…,K
• Clustering: assignment of data points to cluster centers– Binary indicator variables rnk =1 if xn assgined to xn, 0 otherwise
• Error criterion: sum of squared distances between each data point and assigned cluster center
N
n
K
kknnk
Kkk mxrmE
1 1
21 ||||}{
Examples of k-means clustering
• Data uniformly sampled in unit square, running k-means with 5, 10, 15, 20 and 25 centers
Minimizing the error function
• Goal find centers mk and assignments rnk to minimize the error function
• An iterative algorithm1) Initialize cluster centers, somehow2) Update assignments rnk for fixed mk 3) Update centers mk for fixed data assignments rnk fixed4) If cluster centers changed: return to step 2)5) Return cluster centers
• Iterations monotonically decrease error function
N
n
K
kknnk
Kkk mxrmE
1 1
21 ||||}{
Examples of k-means clustering
• Several iterations with two centers
Error function
Minimizing the error function
• Update assignments rnk for fixed mk – Decouples over the data points– Only one rnk =1, rest zero– Assign to closest center
• Update centers mk for fixed assignments rnk – Decouples over the centers– Set derivative to zero– Put center at mean of assigned data points
K
kknnk mxr
1
2||||
N
n
K
kknnk
Kkk mxrmE
1 1
21 ||||}{
N
nknnk mxr
1
2||||
N
nknnk
k
mxrmE
1
)(2 0
kmE
N
nnk
N
nnnkk rxrm
11
/
Minimizing the error function
• Goal find centers mk and assignments rnk to minimize the error function
• An iterative algorithm1) Initialize cluster centers, somehow2) Update assignments rnk for fixed mk 3) Update centers mk for fixed data assignments rnk fixed4) If cluster centers changed: return to step 2)5) Return cluster centers
• Iterations monotonically decrease error function– Both steps reduce the error function– Only a finite number of possible assignments
N
n
K
kknnk
Kkk mxrmE
1 1
21 ||||}{
Examples of k-means clustering
• Several iterations with two centers
Error function
Examples of k-means clustering
• Solutions for different initializations
Examples of k-means clustering• Clustering RGB vectors of pixels in images
• Compression of image file: N x 3 x 8 bits– Store RGB values of cluster centers: K x 24 bits– Store cluster index of each pixel: N x log K bits
4.2% 16.7%8.3%
Clustering with Gaussian mixture density• Each cluster represented by Gaussian density
– Center, as in k-means– Covariance matrix: cluster spread around center
)()(21exp||)2(),;()( 12/12/ mxCmxCCmxNxp Td
Determinant ofcovariance matrix C Quadratic function of
point x and mean mData dimension d
Clustering with Gaussian mixture density• Mixture density is weighted sum of Gaussians
– Mixing weight: importance of each cluster
• Density has to integrate to 1, so we require
K
kkkk CmxNxp
1
),;()(
1
0
1
K
kk
k
Clustering with Gaussian mixture density• Given: data set of N points xn, n=1,…,N
• Find mixture of Gaussians (MoG) that best explains data– Assigns maximum likelihood to the data– Assume data points are drawn independently from MoG
– Maximize log-likelihood of fixed data set X w.r.t. parameters of MoG
• As with k-means objective function has local minima– Can use Expectation-Maximization (EM) algorithm– Similar to the iterative k-means algorithm
K
kkknk
N
n
N
nn
Kkkkk
N
n
K
kkknk
N
nn
CmxNxpL
Cm
CmxNxpXp
111
1
1 11
),;(log)(log)(
},,{
),;()()(
Assignment of data points to clusters• As with k-means zn indicates cluster index for xn
• To sample point from MoG– Select cluster index k with probability given by mixing weight– Sample point from the k-th Gaussian– MoG recovered if we marginalize of unknown index
K
kkknk
K
knnnn
kknnn
kn
CmxNkzxpkzpxp
CmxNkzxpkzp
11
),;()|()()(
),;()|()(
Soft assignment of data points to clusters• Given data point xn, infer value of zn
– Conditional probability of zn given xn
K
sssns
kknk
n
nnn
n
nnnn
CmxN
CmxNxp
kzxpkzpxpxkzpxkzp
1
),;(
),;()(
)|()()(),()|(
Maximum likelihood estimation of Gaussian• Given data points xn, n=1,…,N• Find Gaussian that maximizes data log-likelihood
• Set derivative of data log-likelihood w.r.t. parameters to zero
• Parameters set as data covariance and mean
N
nn
Tn
N
nn
N
nn mxCmxCdCmxNxpL
1
1
11
)()(21||log
21)2log(
2),;(log)(log)(
N
nn
N
nn
xN
m
mxCLm
1
1
1
1
0)(
N
n
Tnn
N
n
Tnn
mxmxN
C
mxmxCLC
1
11
))((1
0))((21
21)(
Maximum likelihood estimation of MoG• Use EM algorithm
– Initialize MoG: parameters or soft-assign– E-step: soft assign of data points to clusters– M-step: update the cluster parameters– Repeat EM steps, terminate if converged
• Convergence of parameters or assignments
• E-step: compute posterior on z given x:
• M-step: update Gaussians from data points weighted by posterior
nkK
sssns
kknknn q
CmxN
CmxNxkzp
1
),;(
),;()|(
N
nnnk
k
N
nnnkN
nnk
k xqN
xqq
m11
1
11
N
n
Tknknnk
kk mxmxqN
C1
))((1
N
nnk
N
nnk
snns
k qN
qq 11
,
11
Maximum likelihood estimation of MoG
• Example of several EM iterations
Clustering with k-means and MoG• Hard assignment in k-means is not
robust near border of quantization cells
• Soft assignment in MoG accounts for ambiguity in the assignment
• Both algorithms sensitive for initialization– Run from several initializations– Keep best result
• Nr of clusters need to be set
• Both algorithm can be generalized to other types of distances or densities
Images from [Gemert et al, IEEE TPAMI, 2010]
Further reading material• Paper by Radford Neal & Geoffrey Hinton
“A view of the EM algorithm that justifies incremental, sparse, and other variants”in “Learning in graphical models”,1998. (available online)
• Chapter 9 from the book “Machine Learning and Pattern Recognition”, by Chris Bishop (Springer, 2006)
Top Related