ผู้ช่วยศาสตราจารย์ ดร.จิรัฎฐา ...

ผู้��ช่�วยศาสตราจารย� ดร.จ�ร�ฎฐา ภู�บุ�ญอบุ

(jiratta.p@msu.ac.th, 08-9275-9797)

HIERARCHICAL HIERARCHICAL ANDANDkk--MEANS MEANS CLUSTERINGCLUSTERING

1. Clustering TaskClustering refers to grouping records, observations, or tasks into classes of similar objectsCluster is collection records similar to one anotherRecords in one cluster dissimilar to records in other clustersClustering is unsupervised data mining taskTherefore, no target variable specified

1. Clustering Task (cont’d)

For example, Claritas, Inc. provides demographic profiles of geographic areas, according to zip codePRIZM segmentation system clusters zip codes in terms of lifestyle typesRecall clusters identified for 90210 Beverly Hills, CA

Cluster 01: Blue Blood Estates“Established executives, professionals, and ‘old money’ heirs that live in America’s wealthiest suburbs...”Cluster 10: Bohemian MixCluster 02: Winner’s CircleCluster 07: Money and BrainsCluster 08: Young Literati

Clustering Tasks in Business and Research

Target marketing for niche product, without large marketing budgetSegment financial behavior into benign and suspicious categoriesGene expression clustering, where genes exhibit similar characteristics

Cluster analysis addresses similar issues encountered in classification

Similarity measurementRecoding categorical variablesStandardizing and normalizing variablesNumber of clusters

Measuring SimilarityEuclidean Distance measures distance between records

Other distance measurements include City-Block Distance and Minkowski Distance

ii yxd 2Euclidean )(),( yx

Minkowski

Block-City

SequentialThreshold

ParallelThreshold

SquareError

Clustering Procedures

Nonhierarchical

Hierarchical

Agglomerative

Divisive

SingleLinkage

CompleteLinkage

AverageLinkage

k-Means

2. Hierarchical Clustering Methods

Treelike cluster structure (dendogram) created through recursive partitioning (Divisive Methods) or combining (Agglomerative Methods) existing clusters

Divisive MethodsAgglomerative Methods

3. Agglomerative Methods

Each observation initialized to become own clusterAt each iteration two closest clusters aggregated togetherNumber of clusters reduced by one, each stepEventually, all records combined into single clusterAgglomerative more popular hierarchical methodTherefore, focus remains on this approach

Measuring distance between records straightforward once recoding and normalization appliedHowever, how is distance between clusters determined?

3. Agglomerative Methods (cont’d)

Single Linkage

Minimum DistanceCluster 1 Cluster 2

Complete Linkage

Maximum Distance

Cluster 1 Cluster 2

Average Linkage

Average Distance

Cluster 1 Cluster 2

3. Agglomerative Methods (cont’d)

สมมต�ว�าเราจะ cluster ลู�กค้�า based on ลู�กษณะการด"#ม Beer หร"อ Wine หน่�วยเป็'น่แก�ว แลูะป็ร�มาณพอๆ ก�น่

ลู�กค้�า Beer Wine

3.1 Single Linkage

ค้+าน่วณระยะห�างระหว�างจ�ดต�างๆ ระหว�าง observation โดยใช่� Squared Euclidean Distance แลู�วเลู"อกระยะทางท0#ส� 1น่ท0#ส�ด

A B C D

A B C D E

A 0B 0C 0D 0E 0

Distance MatrixInitial Data Items

3.1 Single Linkage (cont’d)

A B C D

Dist A B C D E

A 0B 0C 0D 0E 0

A B C D

Dist A B C D E

Distance MatrixCurrent Clusters

A B C D

A B C E

Dist ABCED

A B C E

Dist ABCED

Distance MatrixFinal Result

3.2 Complete Linkage

ค้+าน่วณระยะห�างระหว�างจ�ดต�างๆ ระหว�าง observation โดยใช่� Squared Euclidean Distance แลู�วเลู"อกระยะทางท0#ยาวท0#ส�ด

A B C D

A B C D E

3.2 Complete Linkage (cont’d)

A B C D

Dist A B C D E

A B C D

Dist A B C D E

A B C D

A B C E

Dist ABCED

A B C E

Dist ABCED

Distance MatrixFinal Result

4. k-Means Clustering

k-Means AlgorithmStep 1: Analyst specifies k = number of clusters to partition dataStep 2: k records randomly assigned to initial clustersStep 3: For each record, find cluster centerEach cluster center “owns” subset of recordsResults in k clusters, C1, C2, ...., Ck Step 4: For each of k clusters, find cluster centroidUpdate cluster center location to centroidStep 5: Repeats Steps 3 – 5 until convergence or termination

4. k-Means Clustering (cont’d)

Nearest criterion in Step 3 typically Euclidean Distance

Determining Cluster CentroidAssume n data points (a1, b1, c1), (a2, b2, c2), ..., (an, bn, cn)Centroid of points is center of gravity of pointsLocated at point (Σ ai/n, Σ bi/n, Σ ci/n)

For example, points (1, 1, 1), (1, 2, 1), (1, 3, 1), and (2, 1, 1) have centroid

)00.1,75.1,25.1(4

4. k-Means Clustering (cont’d)

k-Means algorithm terminates when centroids no longer changeFor k clusters, C1, C2, ...., Ck, all records “owned” by cluster remain in clusterConvergence criterion may also cause terminationFor example, no significant reduction in SSE

icluster of centroid represents

icluster in point dataeach

where,),(1

i iCpi

mpdSSE

5. Example of k-Means Clustering at Work

Assume k = 2 to cluster following data points

a b c d e f g h(1

Step 1: k = 2 specifies number of clusters to partitionStep 2: Randomly assign k = 2 cluster centers

For example, m1 = (1, 1) and m2 = (2, 1)

First IterationStep 3: For each record, find nearest cluster center

Euclidean distance from points to m1 and m2 shown

5. Example of k-Means Clustering at Work (cont’d)

Point a b c d e f g h

Distance from m1

Distance from m2

Cluster Membershi

pC1 C2 C2 C2 C1 C2 C1 C2

Cluster m1 contains {a, e, g} and m2 has {b, c, d, f, h}

Cluster membership assigned, now SSE calculated

360024.2161.383.224.22

22222222

i iCpimpdSSE

Recall clusters constructed where between-cluster variation (BCV) large, as compared to within-cluster variation (WCV)

for WCV surrogate SSE

BCVfor surrogate ),(

where,0278.036

Ratio BCV/WCV expected to increase for successive iterations

Step 4: For k clusters, find cluster centroid, update locationCluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2), Cluster 2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)

Figure shows movement of clusters m1 and m2

(triangles) after first iteration of algorithm

Step 5: Repeats Steps 3 – 4 until convergence or termination

0 1 2 3 4 5 60

c (4,3)

d (5,3)

)Distance from m1

0 3 11.41

Distance from m2

Cluster Membershi

p1 2 2 2 1 2 1 1

Cluster m1 contains {a, e, g , h} and m2 has {b, c, d, f}

Cluster 1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2), Cluster 2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)

Second IterationRepeat procedure for Steps 3 – 4Again, for each record find nearest cluster center m1 = (1, 2) or m2 = (3.6, 2.4)Cluster m1 contains {a, e, g, h} and m2 has {b, c, d, f}SSE = 7.86, and BCV/WCV = 0.3346Note 0.3346 has increased compared to First Iteration value = 0.0278Between-cluster variation increasing with respect to Within-cluster variation

Cluster centroids updated to m1 = (1.25, 1.75) or m2 = (4, 2.75)

After Second Iteration, cluster centroids shown to move slightly

0 1 2 3 4 5 60

Third (Final) IterationRepeat procedure for Steps 3 – 4Now, for each record find nearest cluster center m1 = (1.25, 1.75) or m2 = (4, 2.75)SSE = 6.23, and BCV/WCV = 0.4703Again, BCV/WCV has increased compared to previous = 0.3346This time, no records shift cluster membershipCentroids remain unchanged, therefore algorithm terminates

ผู้ช่วยศาสตราจารย์ ดร.จิรัฎฐา ...

Documents

Transcript of ผู้ช่วยศาสตราจารย์ ดร.จิรัฎฐา ...

Kw 4 Nagorathna Project Ramanagar Indent 9275

วารสารพิกุล - research.kpru.ac.th fileผู้ช่วยศาสตราจารย์ ดร.การุณันทน์ รัตนแสนวงษ์.

บรรยายโดย ผู้ช่วยศาสตราจารย์ ......ตรรกศาสตร เป นว ชาแขนงหน งท เสนอข อค

วิวัฒนาการของมนุษย์ · Title: วิวัฒนาการของมนุษย์ Author: ผู้ช่วยศาสตราจารย์

ผู้ช่วยศาสตราจารย์ ดร.ประสิทธิ์ นางทิน ประธานหลักสูตร ...ecct-th.org/acf/seminar_62-03.pdf ·

โดย ผู้ช่วยศาสตราจารย์ ดร.ชนินทร์ วะสีนนท์ รักษาราชการแทนรองอธิการบดีฝ่ายวางแผนและพัฒนา

SISTEMA K - Alvenius Equipamentos Tubulares · ASTM A123 ABNT NBR 6323 ABNT NBR 9797 ABNT NBR 9797 ASTM A134/A139 ABNT NBR 9797 ASTM A134/A139 ABNT NBR 9797 ABNT NBR 9797 AWWA C200

ผู้ช่วยศาสตราจารย์ เภสัชกร อนุชัย ธีระเรืองไชยศรี คณะเภสัช ... · Learning

robotica eduaperu. 9797

แนะนำการเรียนการสอน ระดับบัณฑิตศึกษาในภาพรวม โดย ผู้ช่วยศาสตราจารย์

ผู้ช่วยศาสตราจารย์ ดร. สงครามชัย ลีทองดี คณบดีคณะสาธารณสุขศาสตร์

Clean Water Act of 2004 (RA 9275)

‘Value Innovation’ Strategic Product Planning · ผู้ช่วยศาสตราจารย์ ดร.ไปรมา อิศรเสนา ณ อยุธยา

0131 524 9797

หลักการใช้ยาบาบัดในโรคหัวใจล้มเหลวเรื้อรังPrinciples of Pharmacotherapy in Chronic Heart Failureผู้ช่วยศาสตราจารย์

# 6 ($8 9797

ผู้ช่วยศาสตราจารย์ ดร.อัทธ์ พิศาลวานิช คณบดีคณะเศรษฐศาสตร์ และ

9797 1 copia

ผู้ช่วยศาสตราจารย์ ดร วีระชัย สาระคร หัวหน้าทีมพัฒนา ... fileKKU Smart Learning Academy

9797 Basic Car