Post on 05-Feb-2016
description
1
ผู้��ช่�วยศาสตราจารย� ดร.จ�ร�ฎฐา ภู�บุ�ญอบุ
(jiratta.p@msu.ac.th, 08-9275-9797)
HIERARCHICAL HIERARCHICAL ANDANDkk--MEANS MEANS CLUSTERINGCLUSTERING
2
1. Clustering TaskClustering refers to grouping records, observations, or tasks into classes of similar objectsCluster is collection records similar to one anotherRecords in one cluster dissimilar to records in other clustersClustering is unsupervised data mining taskTherefore, no target variable specified
3
1. Clustering Task (cont’d)
For example, Claritas, Inc. provides demographic profiles of geographic areas, according to zip codePRIZM segmentation system clusters zip codes in terms of lifestyle typesRecall clusters identified for 90210 Beverly Hills, CA
Cluster 01: Blue Blood Estates“Established executives, professionals, and ‘old money’ heirs that live in America’s wealthiest suburbs...”Cluster 10: Bohemian MixCluster 02: Winner’s CircleCluster 07: Money and BrainsCluster 08: Young Literati
4
1. Clustering Task (cont’d)
Clustering Tasks in Business and Research
Target marketing for niche product, without large marketing budgetSegment financial behavior into benign and suspicious categoriesGene expression clustering, where genes exhibit similar characteristics
5
1. Clustering Task (cont’d)
Cluster analysis addresses similar issues encountered in classification
Similarity measurementRecoding categorical variablesStandardizing and normalizing variablesNumber of clusters
6
1. Clustering Task (cont’d)
Measuring SimilarityEuclidean Distance measures distance between records
Other distance measurements include City-Block Distance and Minkowski Distance
i
ii yxd 2Euclidean )(),( yx
q
iii
iii
yxd
yxd
),(
),(
Minkowski
Block-City
yx
yx
7
1. Clustering Task (cont’d)
SequentialThreshold
ParallelThreshold
SquareError
Clustering Procedures
Nonhierarchical
Hierarchical
Agglomerative
Divisive
SingleLinkage
CompleteLinkage
AverageLinkage
k-Means
8
2. Hierarchical Clustering Methods
Treelike cluster structure (dendogram) created through recursive partitioning (Divisive Methods) or combining (Agglomerative Methods) existing clusters
Divisive MethodsAgglomerative Methods
9
3. Agglomerative Methods
Each observation initialized to become own clusterAt each iteration two closest clusters aggregated togetherNumber of clusters reduced by one, each stepEventually, all records combined into single clusterAgglomerative more popular hierarchical methodTherefore, focus remains on this approach
Measuring distance between records straightforward once recoding and normalization appliedHowever, how is distance between clusters determined?
10
3. Agglomerative Methods (cont’d)
Single Linkage
Minimum DistanceCluster 1 Cluster 2
Complete Linkage
Maximum Distance
Cluster 1 Cluster 2
Average Linkage
Average Distance
Cluster 1 Cluster 2
11
3. Agglomerative Methods (cont’d)
สมมต�ว�าเราจะ cluster ลู�กค้�า based on ลู�กษณะการด"#ม Beer หร"อ Wine หน่�วยเป็'น่แก�ว แลูะป็ร�มาณพอๆ ก�น่
ลู�กค้�า Beer Wine
A 1 1
B 1 2
C 8 2
D 6 3
E 8 0
12
3.1 Single Linkage
ค้+าน่วณระยะห�างระหว�างจ�ดต�างๆ ระหว�าง observation โดยใช่� Squared Euclidean Distance แลู�วเลู"อกระยะทางท0#ส� 1น่ท0#ส�ด
A B C D
Dist
A B C D E
A 0B 0C 0D 0E 0
Distance MatrixInitial Data Items
E
13
3.1 Single Linkage (cont’d)
A B C D
Dist A B C D E
A 0B 0C 0D 0E 0
Distance MatrixInitial Data Items
E
14
3.1 Single Linkage (cont’d)
A B C D
Dist A B C D E
ABCDE
Distance MatrixCurrent Clusters
E
1
15
3.1 Single Linkage (cont’d)
A B C D
Dist
AB
C D E
ABCDE
Distance MatrixCurrent Clusters
E
1
16
3.1 Single Linkage (cont’d)
A B C D
Dist
AB
C D E
ABCDE
Distance MatrixCurrent Clusters
E
1
17
3.1 Single Linkage (cont’d)
A B C E
Dist
AB
C D E
ABCDE
Distance MatrixCurrent Clusters
D
1
4
18
3.1 Single Linkage (cont’d)
A B C E
Dist
AB
CE
D
ABCED
Distance MatrixCurrent Clusters
D
1
4
19
3.1 Single Linkage (cont’d)
A B C E
Dist
AB
CE
D
ABCED
Distance MatrixCurrent Clusters
D
14
20
3.1 Single Linkage (cont’d)
A B C E
Dist
AB
CE
D
ABCED
Distance MatrixCurrent Clusters
D
14
5
21
3.1 Single Linkage (cont’d)
A B C E
Dist ABCED
ABCED
Distance MatrixCurrent Clusters
D
14
5
22
3.1 Single Linkage (cont’d)
A B C E
Dist ABCED
ABCED
Distance MatrixFinal Result
D
14
5
26
23
3.2 Complete Linkage
ค้+าน่วณระยะห�างระหว�างจ�ดต�างๆ ระหว�าง observation โดยใช่� Squared Euclidean Distance แลู�วเลู"อกระยะทางท0#ยาวท0#ส�ด
A B C D
Dist
A B C D E
ABCDE
Distance MatrixInitial Data Items
E
24
3.2 Complete Linkage (cont’d)
A B C D
Dist A B C D E
ABCDE
Distance MatrixInitial Data Items
E
25
3.2 Complete Linkage (cont’d)
A B C D
Dist A B C D E
ABCDE
Distance MatrixCurrent Clusters
E
1
26
3.2 Complete Linkage (cont’d)
A B C D
Dist
AB
C D E
ABCDE
Distance MatrixCurrent Clusters
E
1
27
3.2 Complete Linkage (cont’d)
A B C D
Dist
AB
C D E
ABCDE
Distance MatrixCurrent Clusters
E
1
28
3.2 Complete Linkage (cont’d)
A B C E
Dist
AB
C D E
ABCDE
Distance MatrixCurrent Clusters
D
1
4
29
3.2 Complete Linkage (cont’d)
A B C E
Dist
AB
CE
D
ABCED
Distance MatrixCurrent Clusters
D
1
4
30
3.2 Complete Linkage (cont’d)
A B C E
Dist
AB
CE
D
ABCED
Distance MatrixCurrent Clusters
D
14
31
3.2 Complete Linkage (cont’d)
A B C E
Dist
AB
CE
D
ABCED
Distance MatrixCurrent Clusters
D
14
13
32
3.2 Complete Linkage (cont’d)
A B C E
Dist ABCED
ABCED
Distance MatrixCurrent Clusters
D
14
13
33
3.2 Complete Linkage (cont’d)
A B C E
Dist ABCED
ABCED
Distance MatrixFinal Result
D
14
13
53
34
4. k-Means Clustering
k-Means AlgorithmStep 1: Analyst specifies k = number of clusters to partition dataStep 2: k records randomly assigned to initial clustersStep 3: For each record, find cluster centerEach cluster center “owns” subset of recordsResults in k clusters, C1, C2, ...., Ck Step 4: For each of k clusters, find cluster centroidUpdate cluster center location to centroidStep 5: Repeats Steps 3 – 5 until convergence or termination
35
4. k-Means Clustering (cont’d)
Nearest criterion in Step 3 typically Euclidean Distance
Determining Cluster CentroidAssume n data points (a1, b1, c1), (a2, b2, c2), ..., (an, bn, cn)Centroid of points is center of gravity of pointsLocated at point (Σ ai/n, Σ bi/n, Σ ci/n)
For example, points (1, 1, 1), (1, 2, 1), (1, 3, 1), and (2, 1, 1) have centroid
)00.1,75.1,25.1(4
1111,
4
1321,
4
2111
36
4. k-Means Clustering (cont’d)
k-Means algorithm terminates when centroids no longer changeFor k clusters, C1, C2, ...., Ck, all records “owned” by cluster remain in clusterConvergence criterion may also cause terminationFor example, no significant reduction in SSE
icluster of centroid represents
icluster in point dataeach
where,),(1
2
i
i
k
i iCpi
m
Cp
mpdSSE
37
5. Example of k-Means Clustering at Work
Assume k = 2 to cluster following data points
a b c d e f g h(1
,3
)
(3
,3
)
(4,3)
(5,3)
(1
,2
)
(4
,2
)
(1
,1
)
(2
,1
)
Step 1: k = 2 specifies number of clusters to partitionStep 2: Randomly assign k = 2 cluster centers
For example, m1 = (1, 1) and m2 = (2, 1)
First IterationStep 3: For each record, find nearest cluster center
Euclidean distance from points to m1 and m2 shown
38
5. Example of k-Means Clustering at Work (cont’d)
Point a b c d e f g h
Distance from m1
2.00
2.83
3.61
4.47
1.00
3.16
0.00
1.00
Distance from m2
2.24
2.24
2.83
3.61
1.41
2.24
1.00
0.00
Cluster Membershi
pC1 C2 C2 C2 C1 C2 C1 C2
Cluster m1 contains {a, e, g} and m2 has {b, c, d, f, h}
39
5. Example of k-Means Clustering at Work (cont’d)
Cluster membership assigned, now SSE calculated
360024.2161.383.224.22
),(
22222222
1
2
k
i iCpimpdSSE
Recall clusters constructed where between-cluster variation (BCV) large, as compared to within-cluster variation (WCV)
for WCV surrogate SSE
BCVfor surrogate ),(
where,0278.036
1
SSE
),(
WCV
BCV
21
21
mmd
mmd
Ratio BCV/WCV expected to increase for successive iterations
40
5. Example of k-Means Clustering at Work (cont’d)
Step 4: For k clusters, find cluster centroid, update locationCluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2), Cluster 2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
Figure shows movement of clusters m1 and m2
(triangles) after first iteration of algorithm
Step 5: Repeats Steps 3 – 4 until convergence or termination
0 1 2 3 4 5 60
1
2
5
4
3
41
5. Example of k-Means Clustering at Work (cont’d)
Point
a(1
,3
)
b(3
,3
)
c (4,3)
d (5,3)
e(1
,2
)
f(4
,2
)
g(1
,1
)
h(2
,1
)Distance from m1
12.24
3.16
4.12
0 3 11.41
Distance from m2
2.66
0.84
0.72
1.52
2.63
0.57
2.95
2.13
Cluster Membershi
p1 2 2 2 1 2 1 1
Cluster m1 contains {a, e, g , h} and m2 has {b, c, d, f}
Cluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2), Cluster 2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
42
5. Example of k-Means Clustering at Work (cont’d)
Second IterationRepeat procedure for Steps 3 – 4Again, for each record find nearest cluster center m1 = (1, 2) or m2 = (3.6, 2.4)Cluster m1 contains {a, e, g, h} and m2 has {b, c, d, f}SSE = 7.86, and BCV/WCV = 0.3346Note 0.3346 has increased compared to First Iteration value = 0.0278Between-cluster variation increasing with respect to Within-cluster variation
43
5. Example of k-Means Clustering at Work (cont’d)
Cluster centroids updated to m1 = (1.25, 1.75) or m2 = (4, 2.75)
After Second Iteration, cluster centroids shown to move slightly
0 1 2 3 4 5 60
1
2
5
4
3
44
5. Example of k-Means Clustering at Work (cont’d)
Third (Final) IterationRepeat procedure for Steps 3 – 4Now, for each record find nearest cluster center m1 = (1.25, 1.75) or m2 = (4, 2.75)SSE = 6.23, and BCV/WCV = 0.4703Again, BCV/WCV has increased compared to previous = 0.3346This time, no records shift cluster membershipCentroids remain unchanged, therefore algorithm terminates