CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to...

51
CLUSTERING Based on “Foundations of Statistical NLP”, C. Manning & H. Sch¨ utze, MIT Press, 2002, ch. 14 and “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 6.12 0.

Transcript of CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to...

Page 1: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

CLUSTERING

Based on

“Foundations of Statistical NLP”, C. Manning & H. Schutze, MIT Press,2002, ch. 14

and “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 6.12

0.

Page 2: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Plan1. Introduction to clustering

◦ Clustering vs Classification

◦ Hierarchical vs non-hierarchical clustering◦ Soft vs hard assignments in clustering

2. Hierarchical clustering

• Bottom-up (agglomerative) clustering

• Top-down (divisive) clustering◦ Similarity functions in clustering:

single link, complete link, group average

3. Non-hierarchical clustering

• the k-means clustering algorithm

• the EM algorithm for Gaussian Mixture Modelling(estimating the means of k Gaussians)

1.

Page 3: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

1 Introduction to clustering

Clustering vs Classification

Classification = supervised learning,i.e. we need a set of labeled training instances for each

group/class.

Clustering = unsupervised learning,because there is no teacher who provides the examples in

the training set with class labels.It assumes no pre-existing categorization scheme;the clusters are induced from data.

2.

Page 4: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

• Clustering:partition a set of objects into groups/clusters.

• The goal: place objects which are similar (according to acertain similarity measure) in a same group, and assign

dissimilar objects to different groups.

• Objects are usually described and clustered using a set offeatures and values (often known as the data representa-

tion model).

3.

Page 5: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Hierarchical vs Non-hierarchical Clustering

Hierarchical Clusteringproduces a tree of groups/clusters, each node being a sub-

group of its mother.

Non-hierarchical Clustering (or, flat clustering):the relation between clusters is often left undetermined.

Most non-hierarchical clustering algorithms are iterative.

They start with a set of initial clusters and then iterativelyimprove them using a reallocation scheme.

4.

Page 6: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

An Example of Hierarchical Clustering:

A Dendrogram showing a clustering

of 22 high frequency words from the Brown corpus

wasisastooffromatforwithoninbutandahisthethisitIhenotbe

5.

Page 7: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

The Dendrogram Commented

• Similarity in this case is based on the left and right contextof words. (Firth: “one can characterize a word by thewords that occur around it”.)

◦ For instance:

he, I, it, this have more in common with each other thanthey have with and, but;

in, on have a greater similarity than he, I.

• Each node in the tree represents a cluster that was created

by merging two child nodes.

• The height of a connection corresponds to the apparent(di)similarity between the nodes at the bottom of the dia-

gram.

6.

Page 8: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Exemplifying the Main Uses of Clustering (I)

Generalisation

We want to figure out the correct preposition to use withthe noun Friday when translating a text from French into

English.

The days of the week get put in the same cluster by a clus-tering algorithm which measures similarity of words based

on their contexts.

Under the assumption that an environment that is correctfor one member of the cluster is also correct for the other

members,we can infer the correctness of on Friday from the presence(in the given corpus) of on Sunday, on Monday.

7.

Page 9: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Main Uses of Clustering (II)

Exploratory Data Analysis (EDA)

Any technique that lets one to better visualise the data is

likely to

− bring to the fore new generalisations, and

− stop one from making wrong assumptions about data.

This is a ‘must’ for domains like Statistical Natural Lan-guage Processing and Biological Sequence Analysis.

8.

Page 10: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

2 Hierarchical Clustering

Botom-up (Agglomerative) Clustering:

Form all possible singleton clusters (each containing a sin-gle object).

Greedily combine clusters with “maximum similarity” (or“minimum distance”) together into a new cluster.

Continue until all objects are contained in a single cluster.

Top-down (Divisive) Clustering:

Start with a cluster containing all objects.Greedly split the cluster into two, assigning objects to clus-

ters so to maximize the within-group similarity.Continue splitting clusters which are the least coherent

until either having only singleton clusters or reaching thenumber of desired clusters.

9.

Page 11: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

The Bottom-up Hierarchical Clustering Algorithm

Given: a set X = {x1, . . . , xn} of objects

a function sim: P(X)× P(X)→ R

for i = 1, n do

ci = {xi} endC = {c1, . . . , cn}j = n + 1while | C |> 1

(cn1, cn2

) = argmax(cu,cv)∈C×C sim(cu, cv)

cj = cn1∪ cn2

C = C\{cn1, cn2} ∪ {cj}

j = j + 1

10.

Page 12: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Bottom-up Hierarchical Clustering:

Further Comments

• In general, if d is a distance measure, then one can take

sim(x, y) =1

1 + d(x, y)

• Monotonicity of the similarity function:

The operation of merging must not increase the similarity:

∀c, c′, c′′ : min(sim(c, c′), sim(c, c′′)) ≥ sim(c, c′ ∪ c′′).

11.

Page 13: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

The Top-down Hierarchical Clustering Algorithm

Given: a set X = {x1, . . . , xn} of objectsa function coh: P(X)→ R

a function split: P(X)→ P(X)× P(X)C = {X}(= {c1})j = 1while ∃ci ∈ C such that | ci |> 1

cu = argmincv∈C coh(cv)

cj+1 ∪ cj+2 = split (cu)C = C\{cu} ∪ {cj+1, cj+2}j = j + 2

12.

Page 14: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Top-down Hierarchical Clustering:

Further Comments

• Similarity functions (see next slide) can be used here also

as coherence.

• To split a cluster in two sub-clusters:any bottom-up or non-hierarchical clustering algorithmscan be used;

better use the relative entropy (the Kulback-Leibler (KL)divergence):

D(p || q) =∑

x∈Xp(x)log

p(x)

q(x)

where it is assumed that 0 log 0q

= 0, and p log p0

=∞.

13.

Page 15: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Classes of Similarity Functions

• single link: similarity of two clusters considered for merg-ing is determined by the two most similar members of the

two clusters

• complete link: similarity of two clusters is determined bythe two least similar members of the two clusters

• group average: similarity is determined by the average sim-ilarity between all members of the clusters considered.

14.

Page 16: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

1 2 3 4 5 6 7 80123456

x

x

x

x

x x

x x

a set of points in a plane

1 2 3 4 5 6 7 80123456

x

x

x

x

x x

x x

first step in single/complete clustering

1 2 3 4 5 6 7 80123456

x

x

x

x

x x

x x

single−link clustering

1 2 3 4 5 6 7 80123456

x

x

x

x

x x

x x

complete−link clustering

15.

Page 17: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Single-link vs Complete-link Clustering:

Pros and Cons

Single-link Clustering:

• good local coherence, since the similarity function is locally defined

• can produce elongated clusters (“the chaining effect”)

• Closely related to the Minimum Spanning Tree (MST) of a set ofpoints.(Of all trees connecting the set of objects, the sum of the edges ofthe MST is minimal.)

• In graph theory, it corresponds to finding a maximally connectedgraph. Complexity: O(n2).

Complete-link Clustering:

• The focuss is on the global cluster quality.

• In graph theory, it corresponds to finding a clique (maximally com-plete subgraph of) a given graph. Complexity: O(n3).

16.

Page 18: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Group-average Agglomerative Clustering

The criterion for merges: average similarity, which in some

cases can be efficiently computed, implying O(n2). For ex-ample, one can take

sim(x, y) = cos(x, y) =x · y

| x || y | =m∑

i=1

xiyi

with x, y being length-normalised, i.e., | x |=| y |= 1.

Therefore, it is a good compromise between single-link andcomplete-link clustering.

17.

Page 19: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Group-average Agglomerative Clustering: Computation

Let X ⊆ Rm be the set of objects to be clustered

The average similarity of a cluster cj is:

S(cj) =1

| cj | (| cj | −1)

x∈cj

y 6=x∈cj

sim(x, y)

Considering s(cj) =∑

x∈cjx and assuming | x |= 1, then:

s(cj) · s(cj) =∑

x∈cj

y∈cj

x · y =| cj | (| cj | −1)S(cj) +∑

x∈cj

x · x =| cj | (| cj | −1)S(cj)+ | cj |

Therefore: S(cj) =s(cj) · s(cj)− | cj || cj | (| cj | −1)

and

S(ci ∪ cj) =(s(ci) + s(cj)) · (s(ci) + s(cj))− (| ci | + | cj |)

(| ci | + | cj |)(| ci | + | cj | −1)

ands(ci ∪ sj) = s(ci) + s(cj)

which requires constant time for computing.

18.

Page 20: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Application of Hierarchical Clustering:

Improving Language Modeling

[Brown et al., 1992],[Manning & Schuetze, 1992], pages 509–512

Using cross-entropy (− 1

NlogP (w1, . . . , wN)) and bottom-up clustering,

Brown obtained a cluster-based language model which didn’t prove betterthan the word-based model.But the linear interpolation of the two models was better than both!

Example of 3 clusters obtained by Brown:

- plan, letter, request, memo, case, question, charge, statement, draft- day, year, week, month, quarter, half- evaluation, assessment, analysis, understanding, opinion, conversation,

discussion

Note that the words in these clusters have similar syntactic and semanticproperties.

19.

Page 21: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Soft vs Hard Assignments in Clustering

Hard assignment:each object is assigned to one and only one cluster.This is the typical choice for hierarchical clustering.

Soft assignment: allows degrees of membership, and membership in mul-tiple clusters.

In a vector space model,the centroid (or, center of gravity) of each cluster c is

µ =1

| c |∑

x∈c

x

and the degree of membership of x in multiple clusters can be (forinstance) the distance between x and µ.

Non-hierarchical clustering works with both hard assignments and softassignments.

20.

Page 22: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

3 Non-hierarchical Clustering

As already mentioned, start with an initial set of seeds (oneseed for each cluster), then iteratively refine it.

The initial centers for clusters can be computed by applying a

hierarchical clustering algorithm on a subset of the objectsto be clustered (especially in the case of ill-behaved sets).

Stopping criteria (examples):

− group-average similarity

− the likelyhood of data, given the clusters

− the Minimum Description Length (MDL) principle

− mutual information between adjiacent clusters

− ...

21.

Page 23: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

An Example of Non-hierarchical Clustering:

3.1 The k-Means Algorithm

[S. P. Lloyd, 1957]

Given a set X = {x1, . . . , xn} ⊆ Rm,

a distance measure d on Rm,a function for computing the mean µ : P(Rm)→ Rm,

built k clusters so to satisfy a certain (“stopping”) criterion(e.g., maximization of group-average similarity).

Procedure:

Select (arbitrarily) k initial centers f1, . . . , fk in Rm;while the stopping criterion is not satisfied

for all clusters cj do cj = {xi | ∀fl d(xi, fj) ≤ d(xi, fl)} end

for all means fj do fj ← µj end

22.

Page 24: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Illustrating the k-Means Clustering Algorithm

c1

c2

c1

c2

recomputation of meansassignment

23.

Page 25: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Remark

The k-means algorithm can be used to solve — em-ploying hard assignments of objects to clusters — the

following clusterisation problem:

24.

Page 26: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

3.2 Gaussian Mixture Modeling

3.2.1 The ProblemGiven

• D, a set of instances from X generated by a mixture of k

Gaussian distributions;

• the unknown means 〈µ1, . . . , µk〉 of the k Gaussians;

◦ to simplify the presentation, all Gaussians are assumed to

have the same variance σ2, and they are selected with equalprobability;

• we don’t know which xi was generated by which Gaussian;

determine

• h, the ML estimates of 〈µ1, . . . , µk〉, i.e. argmaxh P (D | h).

25.

Page 27: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Generating Data from a Mixture of k Gaussians

p(x)

x

Each instance x is obtained by

1. Choosing one of the k Gaussians having the same varianceσ2 with – for simplicity – uniform probability;

2. Generating randomly an instance according to that Gaus-

sian.

26.

Page 28: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Notations

For the previously given example (k = 2), and

we can think of the full description of each instance as

yi =< xi, zi1, zi2 >, where

• xi is observable, zij is unobservable

• zij is 1 if xi was generated by jth Gaussian and 0 oth-erwise

27.

Page 29: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Remark

For k = 1 there will be no unbservable variables.We have already shown — see the Bayesian Learning

chapter, the ML hypothesis section — that the ML

hypothesis is the one that minimizes the sum of squared

errors:

µML = argminµ

m∑

i=1

(xi − µ)2 =1

m

m∑

i=1

xi

Indeed, it is in this way that the k-means algorithmworks towards solving the problem of estimating the

means of k Gaussians.

28.

Page 30: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Note

The k-means algorithm finds a local optimum for a [certain] “sumof squares” crieterion. While neither being able to find the globaloptimum, the following algorithm — which uses soft assignmentsof instances to clusters, i.e. zij ∈ {0, 1}, and

∑kj=1

P (zij) = 1 — maylead to better results, since it uses slower/“softer” changes to thevalues (and means) of unknown variables.

c1 c1

c2

c2

c2

c1

initial state after iteration 1 after iteration 2

29.

Page 31: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

3.2.2 The EM Algorithm for

Gaussian Mixture Modeling

The Idea

EM finds a local maximum of E[lnP (Y |h)], where

• Y is complete set of (observable plus unobservable) vari-ables/data

• the expected value of lnP (Y |h) is taken over possible

values of unobserved variables in Y .

30.

Page 32: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

EM for GMM: Algorithm Overview

Initial step: Pick at random h′ = 〈µ′

1, µ′

2, . . . , µ′

k〉, then – until a certaincondition is met – iterate:

Estimation step: Assuming that the current hypothesis h′ = 〈µ′

1, µ′

2, . . . , µ′

k〉holds, for each hidden variable zij estimate (in the sense of MaximumLikelihood) the probability asociated with its possible values, and cal-culate the expected value E[zij ]:

E[zij] =p(x = xi|µ = µ′j)

∑kl=1 p(x = xi|µ = µ′l)

=e−

1

2σ2(xi−µ′

j)2

∑kl=1 e−

1

2σ2(xi−µ′

l)2

Maximization step: Assuming that the value of each hidden variable zij

is its own expected value E[zij ] as calculated above, choose a new MLhypothesis h′′ = 〈µ′′

1, µ′′

2, . . . , µ′′

k〉 so to maximize E[ln P (y1, . . . , ym | h)] (seethe next slides):

µ′′

j ←∑m

i=1E[zij ] xi∑m

i=1E[zij ]

Replace h′ =< µ′

1, µ′

2, . . . , µ′

k > by < µ′′

1, µ′′

2, . . . , µ′′

k >.

31.

Page 33: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Calculus for the Expectation Step

E[zij]def.= 0 · P (zij = 0 | xi; h

′) + 1 · P (zij = 1 | xi; h′)

= P (zij = 1 | xi; h′)

T.Bayes=

P (xi | zij = 1; h′) · P (zij = 1 | h′)∑k

l=1(xi | zil = 1; h′) · P (zil = 1 | h′)

=p(x = xi|µ = µ′j)

∑kl=1 p(x = xi|µ = µ′l)

Note:The a priori prbabilities P (zil = 1 | h′) have been assumed as beingidentical, irrespective of l.

32.

Page 34: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Calculus for the Maximization Step (I)

p(yi|h) = p(xi, zi1, . . . , zik|h) = p(xi|zi1, . . . , zik; h)p(zi1, . . . , zik|h)

=1

k

1√2πσ2

e−1

2σ2

∑k

j=1zij(xi−µj)

2

lnP (Y |h) = lnm∏

i=1

p(yi|h) =m∑

i=1

ln p(yi|h)

=m∑

i=1

(− ln k + ln1√

2πσ2− 1

2σ2

k∑

j=1

zij(xi − µj)2)

E[lnP (Y |h)] =m∑

i=1

(− ln k + ln1√

2πσ2− 1

2σ2

k∑

j=1

E[zij](xi − µj)2)

33.

Page 35: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Calculus for the Maximization Step (II)

argmaxh

E[ln P (Y |h)] = argmaxh

m∑

i=1

(− ln k + ln1√

2πσ2− 1

2σ2

k∑

j=1

E[zij ](xi − µj)2)

= argminh

m∑

i=1

k∑

j=1

E[zij ](xi − µj)2 = argmin

h

k∑

j=1

m∑

i=1

E[zij ](xi − µj)2

= argminh

k∑

j=1

{(m∑

i=1

E[zij ])µ2

j − 2(m∑

i=1

E[zij ]xi)µj +m∑

i=1

E[zij ]x2

i }

⇒ µ′′

j ←∑m

i=1E[zij ] xi∑m

i=1E[zij ]

34.

Page 36: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

EM for GMM: Justification

It can be shown (Baum et al. 1970) that after each iteration

P (Y | h) increases, unless it is a local maximum. Therefore thepreviously defined EM algorithm

• converges to a (local) maximum likelihood hypothesis h,

• by providing iterative estimates of the hidden variables zij.

35.

Page 37: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Hierarchical vs. Non-hierarchical Clustering:

Pros and Cons

Hierarchical Clustering:

− preferable for detailed data analysis: provides more informationsthan non-hierarchical clustering;

− less efficient than non-hierarchical clustering: one has to computeat least n × n similarity coefficients and then update them duringthe clustering process.

Non-hierarchical Clustering:

− preferable if data sets are very large, or efficiency is a key issue;

− the k-means algo is conceptually the simplest method and shouldbe used first on a new data set (its results are often sufficient);

− k-means (using a simple Euclidian metric), is not usable on “nom-

inal” data like colours. In such cases, use the EM algorithm.

36.

Page 38: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Some proofs

0.

Page 39: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

K-means as an optimisation algorithm:

The monotonicity of the JK criterion

[CMU, 2009 spring, Ziv Bar-Joseph, HW5, pr. 2.1]

1.

Page 40: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

The K-means algorithm

Input: x1, . . . , xn ∈ Rd, cu n ≥ K.

Output: o anumita K-partitie pentru {x1, . . . , xn}.

Procedura:

[Initializare:] t← 0;se fixeaza ın mod arbitrar µ0

1, . . . , µ0K , centroizii initiali ai clusterelor, si

se asigneaza fiecare instanta xi la centroidul cel mai apropiat, formand astfelclusterele C0

1 , . . . , C0K .

[Recursivitate:] Se executa iteratia ++ t:

Pasul 1: se calculeaza noile pozitii ale centroizilor:

µti =

1

|Ct−1i |

xi∈Ct−1

ixi pentru i = 1, K;

Pasul 2:se reasigneaza fiecare xi la [clusterul cu] centroidul cel mai apropiat, adicase stabileste componenta clusterelor la iteratia t: Ct

1, . . . , CtK ;

[Terminare:] pana cand o anumita conditie este ındeplinita(de exemplu: pana cand pozitiile centroizilor — sau: componenta clusterelor— nu se mai modifica de la o iteratie la alta).

2.

Page 41: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

a. Demonstrati ca, de la o iteratie la alta, algoritmul K-meansmareste coeziunea de ansamblu a clusterelor. I.e., considerandfunctia

J(Ct, µt) =n

i=1

||xi − µt

Ct(xi)||2 =

n∑

i=1

(

xi − µt

Ct(xi)

)

·(

xi − µt

Ct(xi)

)

,

unde:

Ct = (Ct1, C

t2, . . . , C

tK) este colectia de clustere (i.e., K-partitia) la momentul t,

µt = (µt1, µ

t2, . . . , µ

tK) este colectia de centroizi ai clusterelor la momentul t,

Ct(xi) desemneaza clusterul la care este asignat elementul xi la iteratia t,

operatorul · desemneaza produsul scalar al vectorilor din Rd,

aratati ca J(Ct, µt) ≥ J(Ct+1, µt+1) pentru orice t.

3.

Page 42: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Ideea demonstratiei

Inegalitatea de mai sus rezulta din doua inegalitati (care corespund pasilor 1si 2 de la iteratia t):

J(Ct, µt)(1)

≥ J(Ct, µt+1)(2)

≥ J(Ct+1, µt+1)

La prima inegalitate (cea corespunzatoare pasului 1) se poate considera caparametrul Ct este fixat iar µ este variabil, ın vreme ce la a doua inegalitate(cea corespunzatoare pasului 2) se considera µt fixat si C variabil.

Prima inegalitate se poate obtine ınsumand o serie de inegalitati, si anumecate una pentru fiecare cluster Ct

j . A doua inegalitate se demonstreaza ime-diat.

Ilustrarea acestei idei, pe un exemplu particular:

Vezi urmatoarele 3 slide-uri[Edinburgh, 2009 fall, C. Williams, V. Lavrenko, HW4, pr. 3]

4.

Page 43: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

−10−20 0 5 10−5

−10−20 0 5 10−5 −10−20 0 5 10−5

−10−20 0 5 10−5 22/7 −10−20 0 5 10−5 22/7

−10−20 0 5 10−5 7/3−10−20 0 5 10−5 7/3

−10−20 0 5 10−5 7−7 −10−20 0 5 10−5 7−7

init. / iter. 0

iter. 1

iter. 2

iter. 3

5.

Page 44: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Pentru acest exemplu de aplicare a algoritmului K-means, scriem expre-siile numerice pentru valoarea criteriului J2(C

i, µi) pentru fiecare iteratie(i = 0, 1, 2, 3).

iter. J2(Ci, µi)

0. 0 + {(−9− (−10))2 + . . . + (−5− (−10))2

+2[(5− (−10))2 + . . . + (9− (−10))2]} ≥

1. (−9− (−20))2 + {(−8− 7/3))2 + . . . (−5− 7/3)2

+2[(5− 7/3)2 + . . . + (9− 7/3)2]} ≥

2. (−9− (−9))2 + . . . (−5− (−9))2

+2[(5− 22/7)2 + . . . + (9− 22/7)2] ≥

3. (−9− (−7))2 + . . . (−5− (−7))2

+2[(5− 7)2 + . . . + (9− 7)2]

Observatie: La prima vedere, este greu sa dovedim aceste inegalitati(J2(C

i−1, µi−1) ≥ J2(Ci, µi), pentru i = 1, 2, 3) ...altfel decat calculadd efectiv

valoarea expresiilor care se compara. Si totusi, introducand niste termeniintermediari, comparatiile se vor rezolva ıntr-un mod foarte elegant...

6.

Page 45: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

iter. J2(Ci−1, µi) J2(C

i, µi)

0. 0 + {(−9 − (−10))2 + . . . + (−5 − (−10))2

+2[(5 − (−10))2 + . . . + (9 − (−10))2]} ≥

1. 0 + {(−9 − 7/3)2 + . . . + (−5 − 7/3))2 (−9 − (−20))2 + {(−8 − 7/3))2 + . . . + (−5 − 7/3)2

+2[(5 − 7/3)2 + . . . + (9 − 7/3)2]} ≥ +2[(5 − 7/3)2 + . . . + (9 − 7/3)2]} ≥

2. (−9 − (−9))2 + {(−8 − 22/7))2 + . . . + (−5 − 22/7)2 (−9 − (−9))2 + (−8 − (−9))2 + . . . + (−5 − (−9))2

+2[(5 − 22/7)2 + . . . + (9 − 22/7)2]} ≥ +2[(5 − 22/7)2 + . . . + (9 − 22/7)2] ≥

3. (−9 − (−7))2 + . . . + (−5 − (−7))2 (−9 − (−7))2 + . . . + (−5 − (−7))2

+2[(5 − 7)2 + . . . + (9 − 7)2] = +2[(5 − 7)2 + . . . + (9 − 7)2]

Explicatii:1. Inegalitatile pe orizontala (J2(C

i−1, µi) ≥ J2(Ci, µi), pentru i = 1, 2, 3) sunt

usor de demonstrat, pe baza corespondentei termen cu termen.2. Restul inegalitatilor (J2(C

i, µi) ≥ J2(Ci, µi+1), pentru i = 1, 2, 3) se rezolva

printr-o metoda de optimizare simpla. De exemplu, pentru i = 1 este imediatca functia (−9 − x)2 + {(−8− x))2 + . . . + (−5 − x)2 + 2[(5 − x)2 + . . . + (9 − x)2] ısiatinge minimul pentru x = 7/3, deci J2(C1, µ2) ≥ J2(C

1, µ1).

7.

Page 46: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Demonstratie, pentru cazul general

Observatie: Pentru convenienta, ne vom limita la cazul d = 1. Extinderea demonstrateila cazul d > 1 nu comporta dificultati.

Demonstrarea inegalitatii (1): J(Ct, µt)≥J(Ct, µt+1)(Vezi pasul 1 al iteratiei t.)

Fixam j ∈ {1, . . . , K}. Daca notam cu Ctj = {xi1 , xi2 , . . . , xil

}, unde lnot.= |Ct

j |, atunci

J(Ctj , µ

tj) =

l∑

p=1

(

xip− µt

j

)2, deci J(Ct, µt) =

K∑

j=1

J(Ctj , µ

tj).

Daca se considera Ctj fixat, iar µt

j variabil, atunci putem minimiza imediat functia

f(µ)def.= J(Ct

j , µ) = lµ2 − 2µl

p=1

xip+

l∑

p=1

x2ip⇒ argmin

µJ(Ct

j , µ) =1

l

l∑

p=1

xip

def.= µt+1

j .

Asadar, J(Ctj , µ) ≥ J(Ct

j , µt+1j ), pentru ∀µ. In particular, pentru µ = µt

j vom avea:

J(Ctj , µ

tj)≥J(Ct

j , µt+1j ). Inegalitatea aceasta este valabila pentru toate clusterele j =

1, . . . , K. Daca sumam toate aceste inegalitati, rezulta: J(Ct, µt)≥J(Ct, µt+1).

8.

Page 47: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Demonstrarea inegalitatii (2): J(Ct, µt+1) ≥ J(Ct+1, µt+1)(Vezi pasul 2 al iteratiei t.)

La acest pas, o instanta oarecare xi, unde i ∈ {1, . . . , n}, este reasignata de laclusterul cu centroidul µt+1

j , la un alt centroid µt+1q , daca

||xi−µt+1j′ ||

2 ≥ ||xi−µt+1q ||2 ⇔ (xi−µt+1

j′ )2 ≥ (xi−µt+1q )2, pentru orice j′ = 1, . . . , K.

In contextul iteratiei t, acest lucru implica

(

xi − µt+1Ct(xi)

)2

≥(

xi − µt+1Ct+1(xi)

)2

.

Sumand membru cu membru inegalitatile de acest tip obtinute pentru i = 1, n,rezulta: J(Ct, µt+1) ≥ J(Ct+1, µt+1), ceea ce era de demonstrat.

9.

Page 48: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

b. Ce puteti spune despre oprirea algoritmului K-means? (Ter-mina oare acest algoritm ıntr-un numar finit de pasi, sau esteposibil ca el sa reviziteze o configuratie anterioara µ?)

10.

Page 49: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Daca algoritmul reviziteaza o K-partitie, atunci rezulta ca pentru un anumitt avem J(Ct−1, µt) = J(Ct, µt+1). Este posibil ca acest fapt sa se ıntample, sianume atunci cand:

− exista instante multiple (i.e., xi = xj, desi i 6= j),

− criteriul de oprire al algoritmului K-means este de forma“pana cand componenta clusterelor nu se mai modifica”,

− se presupune ca, ın cazul ın care o instanta xi este situata la egala distantafata de doi sau mai multi centroizi, ea poate fi asignata ın mod aleatoriu laoricare dintre ei.

Asa se ıntampla ın exemplul dinfigura alaturata daca se consideraca la o iteratie t avem x2 = 0 ∈Ct

1 si x3 = 0 ∈ Ct2, iar la iteratia

urmatoare alegem ca x3 = 0 ∈ Ct+11

si x2 = 0 ∈ Ct+12 si, din nou, invers

la iteratia t + 2.

µ1 µ2x1 x4

x3x2

0−1 1−1/2 1/2

11.

Page 50: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Observatii

• Daca se pastreaza criteriul dat ca exemplu ın enuntul problemei – adicase itereaza pana cand centroizii “stationeaza” – algoritmul se poate oprifara ca la ultima iteratie J(C, µ) sa fi atins minimul posibil. In cazul

exemplului de mai sus, vom avea1

4+ 2 ·

1

4+

1

4= 1 >

2

3.

• Daca nu exista instante multiple care sa fie situate la distante egale fatade doi sau mai multi centroizi la o iteratie oarecare a algoritmului K-means (precum sunt x2 si x3 ın exemplul de mai sus), sau daca se impunerestrictia ca ın astfel de situatii instantele identice sa fie asignate la unsingur cluster, este evident ca algoritmul K-means se opreste ıntr-unnumar finit de pasi.

12.

Page 51: CLUSTERING - Alexandru Ioan Cuza Universityciortuz/SLIDES/2015/cluster.pdf1 Introduction to clustering Clustering vs Classification Classification = supervised learning, i.e. we

Concluzii

• Algoritmul K-means exploreaza — pornind de la o anumita initializare

a celor K centroizi —, doar un subset din totatul de Kn K-partitii,

asigurandu-ne ınsa ca are loc proprietatea J(C0, µ1) ≥ J(C1, µ2) ≥ . . . ≥J(Ct−1, µt) ≥ J(Ct, µt+1), conform punctului a al acestei probleme.

• Atingerea minimului global al functiei J(C, µ) — unde C este o vari-

abila care parcurge multimea tuturor K-partitiilor care se pot forma cu

instantele {x1, . . . , xn} — nu este garantata pentru algoritmul K-means.

Valoarea functiei J care se obtine la oprirea algoritmului K-means este

dependenta de plasarea initiala a centroizilor µ precum si de modul con-

cret ın care sunt alcatuite clusterele ın cazul ın care o instanta oarecare

se afla la distanta egala de doi sau mai multi centroizi, dupa cum am

aratat ın exemplul de mai sus.

13.