ผู้ช่วยศาสตราจารย์ ดร.จิรัฎฐา ...

Post on 05-Feb-2016

38 views 0 download

description

HIERARCHICAL AND k - MEANS CLUSTERING. 8. ผู้ช่วยศาสตราจารย์ ดร.จิรัฎฐา ภูบุญอบ ( jiratta . p@msu . ac . th, 08-9275-9797 ). 1. Clustering Task. Clustering refers to grouping records, observations, or tasks into classes of similar objects - PowerPoint PPT Presentation

Transcript of ผู้ช่วยศาสตราจารย์ ดร.จิรัฎฐา ...

1

ผู้��ช่�วยศาสตราจารย� ดร.จ�ร�ฎฐา ภู�บุ�ญอบุ

(jiratta.p@msu.ac.th, 08-9275-9797)

HIERARCHICAL HIERARCHICAL ANDANDkk--MEANS MEANS CLUSTERINGCLUSTERING

2

1. Clustering TaskClustering refers to grouping records, observations, or tasks into classes of similar objectsCluster is collection records similar to one anotherRecords in one cluster dissimilar to records in other clustersClustering is unsupervised data mining taskTherefore, no target variable specified

3

1. Clustering Task (cont’d)

For example, Claritas, Inc. provides demographic profiles of geographic areas, according to zip codePRIZM segmentation system clusters zip codes in terms of lifestyle typesRecall clusters identified for 90210 Beverly Hills, CA

Cluster 01: Blue Blood Estates“Established executives, professionals, and ‘old money’ heirs that live in America’s wealthiest suburbs...”Cluster 10: Bohemian MixCluster 02: Winner’s CircleCluster 07: Money and BrainsCluster 08: Young Literati

4

1. Clustering Task (cont’d)

Clustering Tasks in Business and Research

Target marketing for niche product, without large marketing budgetSegment financial behavior into benign and suspicious categoriesGene expression clustering, where genes exhibit similar characteristics

5

1. Clustering Task (cont’d)

Cluster analysis addresses similar issues encountered in classification

Similarity measurementRecoding categorical variablesStandardizing and normalizing variablesNumber of clusters

6

1. Clustering Task (cont’d)

Measuring SimilarityEuclidean Distance measures distance between records

Other distance measurements include City-Block Distance and Minkowski Distance

i

ii yxd 2Euclidean )(),( yx

q

iii

iii

yxd

yxd

),(

),(

Minkowski

Block-City

yx

yx

7

1. Clustering Task (cont’d)

SequentialThreshold

ParallelThreshold

SquareError

Clustering Procedures

Nonhierarchical

Hierarchical

Agglomerative

Divisive

SingleLinkage

CompleteLinkage

AverageLinkage

k-Means

8

2. Hierarchical Clustering Methods

Treelike cluster structure (dendogram) created through recursive partitioning (Divisive Methods) or combining (Agglomerative Methods) existing clusters

Divisive MethodsAgglomerative Methods

9

3. Agglomerative Methods

Each observation initialized to become own clusterAt each iteration two closest clusters aggregated togetherNumber of clusters reduced by one, each stepEventually, all records combined into single clusterAgglomerative more popular hierarchical methodTherefore, focus remains on this approach

Measuring distance between records straightforward once recoding and normalization appliedHowever, how is distance between clusters determined?

10

3. Agglomerative Methods (cont’d)

Single Linkage

Minimum DistanceCluster 1 Cluster 2

Complete Linkage

Maximum Distance

Cluster 1 Cluster 2

Average Linkage

Average Distance

Cluster 1 Cluster 2

11

3. Agglomerative Methods (cont’d)

สมมต�ว�าเราจะ cluster ลู�กค้�า based on ลู�กษณะการด"#ม Beer หร"อ Wine หน่�วยเป็'น่แก�ว แลูะป็ร�มาณพอๆ ก�น่

ลู�กค้�า Beer Wine

A 1 1

B 1 2

C 8 2

D 6 3

E 8 0

12

3.1 Single Linkage

ค้+าน่วณระยะห�างระหว�างจ�ดต�างๆ ระหว�าง observation โดยใช่� Squared Euclidean Distance แลู�วเลู"อกระยะทางท0#ส� 1น่ท0#ส�ด

A B C D

Dist

A B C D E

A 0B 0C 0D 0E 0

Distance MatrixInitial Data Items

E

13

3.1 Single Linkage (cont’d)

A B C D

Dist A B C D E

A 0B 0C 0D 0E 0

Distance MatrixInitial Data Items

E

14

3.1 Single Linkage (cont’d)

A B C D

Dist A B C D E

ABCDE

Distance MatrixCurrent Clusters

E

1

15

3.1 Single Linkage (cont’d)

A B C D

Dist

AB

C D E

ABCDE

Distance MatrixCurrent Clusters

E

1

16

3.1 Single Linkage (cont’d)

A B C D

Dist

AB

C D E

ABCDE

Distance MatrixCurrent Clusters

E

1

17

3.1 Single Linkage (cont’d)

A B C E

Dist

AB

C D E

ABCDE

Distance MatrixCurrent Clusters

D

1

4

18

3.1 Single Linkage (cont’d)

A B C E

Dist

AB

CE

D

ABCED

Distance MatrixCurrent Clusters

D

1

4

19

3.1 Single Linkage (cont’d)

A B C E

Dist

AB

CE

D

ABCED

Distance MatrixCurrent Clusters

D

14

20

3.1 Single Linkage (cont’d)

A B C E

Dist

AB

CE

D

ABCED

Distance MatrixCurrent Clusters

D

14

5

21

3.1 Single Linkage (cont’d)

A B C E

Dist ABCED

ABCED

Distance MatrixCurrent Clusters

D

14

5

22

3.1 Single Linkage (cont’d)

A B C E

Dist ABCED

ABCED

Distance MatrixFinal Result

D

14

5

26

23

3.2 Complete Linkage

ค้+าน่วณระยะห�างระหว�างจ�ดต�างๆ ระหว�าง observation โดยใช่� Squared Euclidean Distance แลู�วเลู"อกระยะทางท0#ยาวท0#ส�ด

A B C D

Dist

A B C D E

ABCDE

Distance MatrixInitial Data Items

E

24

3.2 Complete Linkage (cont’d)

A B C D

Dist A B C D E

ABCDE

Distance MatrixInitial Data Items

E

25

3.2 Complete Linkage (cont’d)

A B C D

Dist A B C D E

ABCDE

Distance MatrixCurrent Clusters

E

1

26

3.2 Complete Linkage (cont’d)

A B C D

Dist

AB

C D E

ABCDE

Distance MatrixCurrent Clusters

E

1

27

3.2 Complete Linkage (cont’d)

A B C D

Dist

AB

C D E

ABCDE

Distance MatrixCurrent Clusters

E

1

28

3.2 Complete Linkage (cont’d)

A B C E

Dist

AB

C D E

ABCDE

Distance MatrixCurrent Clusters

D

1

4

29

3.2 Complete Linkage (cont’d)

A B C E

Dist

AB

CE

D

ABCED

Distance MatrixCurrent Clusters

D

1

4

30

3.2 Complete Linkage (cont’d)

A B C E

Dist

AB

CE

D

ABCED

Distance MatrixCurrent Clusters

D

14

31

3.2 Complete Linkage (cont’d)

A B C E

Dist

AB

CE

D

ABCED

Distance MatrixCurrent Clusters

D

14

13

32

3.2 Complete Linkage (cont’d)

A B C E

Dist ABCED

ABCED

Distance MatrixCurrent Clusters

D

14

13

33

3.2 Complete Linkage (cont’d)

A B C E

Dist ABCED

ABCED

Distance MatrixFinal Result

D

14

13

53

34

4. k-Means Clustering

k-Means AlgorithmStep 1: Analyst specifies k = number of clusters to partition dataStep 2: k records randomly assigned to initial clustersStep 3: For each record, find cluster centerEach cluster center “owns” subset of recordsResults in k clusters, C1, C2, ...., Ck Step 4: For each of k clusters, find cluster centroidUpdate cluster center location to centroidStep 5: Repeats Steps 3 – 5 until convergence or termination

35

4. k-Means Clustering (cont’d)

Nearest criterion in Step 3 typically Euclidean Distance

Determining Cluster CentroidAssume n data points (a1, b1, c1), (a2, b2, c2), ..., (an, bn, cn)Centroid of points is center of gravity of pointsLocated at point (Σ ai/n, Σ bi/n, Σ ci/n)

For example, points (1, 1, 1), (1, 2, 1), (1, 3, 1), and (2, 1, 1) have centroid

)00.1,75.1,25.1(4

1111,

4

1321,

4

2111

36

4. k-Means Clustering (cont’d)

k-Means algorithm terminates when centroids no longer changeFor k clusters, C1, C2, ...., Ck, all records “owned” by cluster remain in clusterConvergence criterion may also cause terminationFor example, no significant reduction in SSE

icluster of centroid represents

icluster in point dataeach

where,),(1

2

i

i

k

i iCpi

m

Cp

mpdSSE

37

5. Example of k-Means Clustering at Work

Assume k = 2 to cluster following data points

a b c d e f g h(1

,3

)

(3

,3

)

(4,3)

(5,3)

(1

,2

)

(4

,2

)

(1

,1

)

(2

,1

)

Step 1: k = 2 specifies number of clusters to partitionStep 2: Randomly assign k = 2 cluster centers

For example, m1 = (1, 1) and m2 = (2, 1)

First IterationStep 3: For each record, find nearest cluster center

Euclidean distance from points to m1 and m2 shown

38

5. Example of k-Means Clustering at Work (cont’d)

Point a b c d e f g h

Distance from m1

2.00

2.83

3.61

4.47

1.00

3.16

0.00

1.00

Distance from m2

2.24

2.24

2.83

3.61

1.41

2.24

1.00

0.00

Cluster Membershi

pC1 C2 C2 C2 C1 C2 C1 C2

Cluster m1 contains {a, e, g} and m2 has {b, c, d, f, h}

39

5. Example of k-Means Clustering at Work (cont’d)

Cluster membership assigned, now SSE calculated

360024.2161.383.224.22

),(

22222222

1

2

k

i iCpimpdSSE

Recall clusters constructed where between-cluster variation (BCV) large, as compared to within-cluster variation (WCV)

for WCV surrogate SSE

BCVfor surrogate ),(

where,0278.036

1

SSE

),(

WCV

BCV

21

21

mmd

mmd

Ratio BCV/WCV expected to increase for successive iterations

40

5. Example of k-Means Clustering at Work (cont’d)

Step 4: For k clusters, find cluster centroid, update locationCluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2), Cluster 2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)

Figure shows movement of clusters m1 and m2

(triangles) after first iteration of algorithm

Step 5: Repeats Steps 3 – 4 until convergence or termination

0 1 2 3 4 5 60

1

2

5

4

3

41

5. Example of k-Means Clustering at Work (cont’d)

Point

a(1

,3

)

b(3

,3

)

c (4,3)

d (5,3)

e(1

,2

)

f(4

,2

)

g(1

,1

)

h(2

,1

)Distance from m1

12.24

3.16

4.12

0 3 11.41

Distance from m2

2.66

0.84

0.72

1.52

2.63

0.57

2.95

2.13

Cluster Membershi

p1 2 2 2 1 2 1 1

Cluster m1 contains {a, e, g , h} and m2 has {b, c, d, f}

Cluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2), Cluster 2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)

42

5. Example of k-Means Clustering at Work (cont’d)

Second IterationRepeat procedure for Steps 3 – 4Again, for each record find nearest cluster center m1 = (1, 2) or m2 = (3.6, 2.4)Cluster m1 contains {a, e, g, h} and m2 has {b, c, d, f}SSE = 7.86, and BCV/WCV = 0.3346Note 0.3346 has increased compared to First Iteration value = 0.0278Between-cluster variation increasing with respect to Within-cluster variation

43

5. Example of k-Means Clustering at Work (cont’d)

Cluster centroids updated to m1 = (1.25, 1.75) or m2 = (4, 2.75)

After Second Iteration, cluster centroids shown to move slightly

0 1 2 3 4 5 60

1

2

5

4

3

44

5. Example of k-Means Clustering at Work (cont’d)

Third (Final) IterationRepeat procedure for Steps 3 – 4Now, for each record find nearest cluster center m1 = (1.25, 1.75) or m2 = (4, 2.75)SSE = 6.23, and BCV/WCV = 0.4703Again, BCV/WCV has increased compared to previous = 0.3346This time, no records shift cluster membershipCentroids remain unchanged, therefore algorithm terminates