Clustering Lecture

8/16/2019 Clustering Lecture

1/46

1

Unsupervised Clustering

Clustering is a very general problem that appears inmany different settings (not necessarily in a datamining context)

Grouping “similar products” together to improvethe efficiency of a production line

Pacing “similar items” into containers

Grouping “similar customers” together

Grouping “similar stocs” together


2/46

!

"he #imilarity Concept

$bviously% the concept of similarity is eyto clustering& Using similarity definitions that are specific to a

domain may generate more acceptable clusters& '&g& products that reuire same or similar

toolsprocesses in the production line aresimilar&

*rticles that are in the course pac of the samecourse are similar&

General similarity measures are reuired forgeneral purpose algorithms&


3/46

+

Clustering, "he -./eans *lgorithm(0loyd% 12!)

1& Choose a value for K % the total number ofclusters&

!& 3andomly choose K points as cluster centers&

+& *ssign the remaining instances to theirclosest cluster center&

4& Calculate a ne5 cluster center for eachcluster&

6& 3epeat steps +.6 until the cluster centers donot change&


4/46

4

7istance /easure

"he similarity is captured by adistance measure in this algorithm&

"he original proposed measure ofdistance is the 'uclidean distance&

∑=

−=

==

n

i

ii y x y xd

y y yY x x x X nn

1

2

2

)(),(

),,,(),,,,( 211


5/46

6

Table 3.6 • K-Means Input Values

Instance X Y

1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.5

6 5.0 6.0

*n 'xample Using -.

/eans


6/46

8

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

f(x)

x

1

2

3

4

5

6


7/46

9

Table 3.7 • Several Applications of the K-Means Algorithm ( K = 2)

Outcome Cluster Centers Cluster Points Squared Error

1 (2.67,4.67) 2, 4, 614.50

(2.00,1.83) 1, 3, 5

2 (1.5,1.5) 1, 315.94

(2.75,4.125) 2, 4, 5, 6

3 (1.8,2.7) 1, 2, 3, 4, 59.60

(5,6) 6


8/46

2

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

x

f(x)


9/46

General Considerations

:ors best 5hen the clusters in the data are ofapproximately eual si;e&

*ttribute significance cannot be determined&

0acs explanation capabilities&

3euires real.valued data& Categorical data can beconverted into real% but the distance function needs tobe 5ored out carefully&

:e must select the number of clusters present in thedata&

7ata normali;ation may be reuired if attribute rangesvary significantly&

*lternative distance measures may generate differentclusters&


10/46

-.means Clustering

Partitional clustering approach

'ach cluster is associated 5ith a centroid (center point)

'ach point is assigned to the cluster 5ith the closest centroid


11/46

-.means Clustering = 7etails

>nitial centroids are often chosen randomly& Clusters produced vary from one run to another&

"he centroid is (typically) the mean of the points in the cluster&

?Closeness@ is measured by 'uclidean distance% cosine similarity% correlation% etc&

-.means 5ill converge for common similarity measures mentioned above&

/ost of the convergence happens in the first fe5 iterations&

$ften the stopping condition is changed to ?Until relatively fe5 points change clusters@ Complexity is $( n A - A > A d )

n B number of points% - B number of cl usters%> B number of iterations% d B number of attributes


12/46

"5o different -.means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-otimal Clusterin!

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Otimal Clusterin!

Ori!inal Points


13/46

>mportance of Choosing >nitial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6


14/46

>mportance of Choosing >nitial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6


15/46

'valuating -.means Clusters

/ost common measure is #um of #uared 'rror (##')

or each point% the error is the distance to the nearestcluster

"o get ##'% 5e suare these errors and sum them&

x is a data point in cluster C i and mi is the representativepoint for cluster C i

can sho5 that mi corresponds to the center (mean) ofthe cluster

Given t5o clusters% 5e can choose the one 5ith thesmallest error

∑ ∑= ∈

= K

i C x

i

i

xmdist SSE 1

2 ),(


16/46

>mportance of Choosing >nitial Centroids D

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5


17/46

>mportance of Choosing >nitial Centroids D

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5


18/46

Problems 5ith #electing >nitial Points

>f there are - ?real@ clusters then the chance of selecting one centroid from each cluster is small& Chance is relatively small 5hen - is large

>f clusters are the same si;e% n% then

or example% if - B 1E% then probability B 1EF1E 1E B E&EEE+8

#ometimes the initial centroids 5ill readust themselves in ?right@ 5ay% and sometimes they don@t

Consider an example of five pairs of clusters


19/46

Hierarchical Clustering

Produces a set of nested clustersorgani;ed as a hierarchical tree

Can be visuali;ed as a dendrogram * tree lie diagram that records theseuences of merges or splits

1 3 2 5 4 6

0

0.05

0.1

0.15

0.2

1

2

!

"

#

1

"3 #

$


20/46

#trengths of HierarchicalClustering

7o not have to assume any particularnumber of clusters *ny desired number of clusters can be

obtained by ?cutting@ the dendogram at theproper level

"hey may correspond to meaningful

taxonomies 'xample in biological sciences (e&g&% animal

ingdom% phylogeny reconstruction% D)


21/46

Hierarchical Clustering

"5o main types of hierarchical clustering *gglomerative,

#tart 5ith the points as individual clusters

*t each step% merge the closest pair of clusters until only onecluster (or clusters) left

7ivisive,

#tart 5ith one% all.inclusive cluster

*t each step% split a cluster until each cluster contains a point (orthere are clusters)

"raditional hierarchical algorithms use a similarity ordistance matrix /erge or split one cluster at a time


22/46

*gglomerative Clustering*lgorithm /ore popular hierarchical clustering techniue

Iasic algorithm is straightfor5ard1& Compute the proximity matrix

!& 0et each data point be a cluster

3. Repeat

4& /erge the t5o closest clusters

6& Update the proximity matrix

6. Until only a single cluster remains

-ey operation is the computation of the

proximity of t5o clusters 7ifferent approaches to defining the distance

bet5een clusters distinguish the different algorithms


23/46

#tarting #ituation

#tart 5ith clusters of individual pointsand a proximity matrix

1

3

$

#

"

1 " 3 # $ % % %

%

%

% Proximit& 'atrix


24/46

>ntermediate #ituation

*fter some merging steps% 5e have some clusters

C1

C#

C" C$

C3

C"C1

C1

C3

C$

C#

C"

C3 C# C$

Proximit& 'atrix


25/46

>ntermediate #ituation :e 5ant to merge the t5o closest clusters (C! and C6)

and update the proximity matrix&

C1

C#

C" C$

C3

C"C1

C1

C3

C$

C#

C"

C3 C# C$

Proximit& 'atrix


26/46

*fter /erging "he uestion is “Ho5 do 5e update the proximity

matrixJ”

C1

C#

C" U C$

C3

( ( ( (

(

(

(

C"

U

C$C1

C1

C3

C#

C" U C$

C3 C#

Proximit& 'atrix


27/46

Ho5 to 7efine >nter.Cluster #imilarity

1

3

$

#

"

1 " 3 # $ % % %

%

%

%

Similarit&(

MIN

MAX

Group Average

D!"a#$e %e"&ee# 'e#"ro(!

)"*er +e"*o(! (rve# ,- a# o,.e$"ve

/u#$"o# ar(! Me"*o( u!e! !uare( error

Proximit& 'atrix


28/46


1

3

$

#

"

1 " 3 # $ % % %

%

%

%Proximit& 'atrix

MIN

MAX

Group Average

D!"a#$e %e"&ee# 'e#"ro(!

)"*er +e"*o(! (rve# ,- a# o,.e$"ve



29/46


1

3

$

#

"

1 " 3 # $ % % %

%

%

%Proximit& 'atrix

MIN

MAX

Group Average

D!"a#$e %e"&ee# 'e#"ro(!

)"*er +e"*o(! (rve# ,- a# o,.e$"ve



30/46


1

3

$

#

"

1 " 3 # $ % % %

%

%

%Proximit& 'atrix

MIN

MAX

Group Average

D!"a#$e %e"&ee# 'e#"ro(!

)"*er +e"*o(! (rve# ,- a# o,.e$"ve



31/46


1

3

$

#

"

1 " 3 # $ % % %

%

%

%Proximit& 'atrix

MIN

MAX

Group Average

D!"a#$e %e"&ee# 'e#"ro(!

)"*er +e"*o(! (rve# ,- a# o,.e$"ve



32/46

Cluster #imilarity, />< or#ingle 0in #imilarity of t5o clusters is based on

the t5o most similar (closest) pointsin the different clusters

7etermined by one pair of points% i&e&% byone lin in the proximity graph&

I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20

I2 0.90 1.00 0.70 0.60 0.50

I3 0.10 0.70 1.00 0.40 0.30

I4 0.65 0.60 0.40 1.00 0.80

I5 0.20 0.50 0.30 0.80 1.00 1 2 ! "


33/46

Hierarchical Clustering, /><

)ested Clusters *endro!ram

1

2

!

"

#

1

"

3

#

$

3 6 2 5 4 10

0.05

0.1

0.15

0.2


34/46

#trength of /><

Ori!inal Points +,o Clusters

Can andle non-ellitical saes


35/46

0imitations of /><


Sensiti.e to noise and outliers


36/46

Cluster #imilarity, /*K or Complete0inage

#imilarity of t5o clusters is based onthe t5o least similar (most distant)points in the different clusters 7etermined by all pairs of points in the

t5o clustersI1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20

I2 0.90 1.00 0.70 0.60 0.50

I3 0.10 0.70 1.00 0.40 0.30

I4 0.65 0.60 0.40 1.00 0.80

I5 0.20 0.50 0.30 0.80 1.00 1 2 ! "


37/46

Hierarchical Clustering, /*K


3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

!

"

#

1

" $

3

#


38/46

#trength of /*K


/ess suscetible to noise and outliers

f


39/46

0imitations of /*K


+ends to brea0 lar!e clusters

iased to,ards !lobular clusters


40/46

Cluster #imilarity, Group *verage

Proximity of t5o clusters is the average of pair5ise

proximity bet5een points in the t5o clusters&


41/46

Hierarchical Clustering, Group*verage


3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

1

2

!

"

#

1

"

$

3

#


42/46

Hierarchical Clustering, Group*verage

Compromise bet5een #ingle andComplete 0in

#trengths 0ess susceptible to noise and outliers

0imitations Iiased to5ards globular clusters


43/46

Cluster #imilarity, :ard@s/ethod

#imilarity of t5o clusters is based on the increasein suared error 5hen t5o clusters are merged #imilar to group average if distance bet5een points is

distance suared

0ess susceptible to noise and outliers

Iiased to5ards globular clusters

Hierarchical analogue of -.means Can be used to initiali;e -.means


44/46

Hierarchical Clustering, Comparison

2rou .era!e

4ard5s 'etod

1

2

!

"

#

1

"

$

3

#

'I) 'X

1

2

!

"

#

1

"

$

3#

1

2

!

"

#

1

" $

3

#1

2

!

"

#

1

"

3

#

$


45/46

Hierarchical Clustering, "ime and #pace reuirements

$(

$(


46/46

Hierarchical Clustering, Problems and 0imitations

$nce a decision is made to combine t5oclusters% it cannot be undone

Clustering Lecture

Documents

Transcript of Clustering Lecture