HW 4 Answers. 1. Consider the xy coordinates of 7 points shown in Table 1. (a) Construct the...

23
HW 4 Answers

Transcript of HW 4 Answers. 1. Consider the xy coordinates of 7 points shown in Table 1. (a) Construct the...

HW 4 Answers

1. Consider the xy coordinates of 7 points shown in Table 1.

(a) Construct the distance matrix by using Euclidean and perform single and complete link hierarchical clustering. Show your results by drawing a dendrogram. The dendrogram should clearly show the order in which the points are merged.

(b) Following (a), compute the cophenetic correlation coefficient for the derived dendrograms.

(a) The distance matrix:

p1 p2 p3 p4 p5 p6 p7

p1 0.00 0.23 0.22 0.37 0.34 0.24 0.19

p2 0.00 0.14 0.19 0.14 0.24 0.06

p3 0.00 0.16 0.28 0.10 0.17

p4 0.00 0.28 0.22 0.25

p5 0.00 0.39 0.15

p6 0.00 0.26

p7 0.00

p1 p2 p3 p4 p5 p6 p7

p1 0.00 0.23 0.22 0.37 0.34 0.24 0.19

p2 0.00 0.14 0.19 0.14 0.24 0.06

p3 0.00 0.16 0.28 0.10 0.17

p4 0.00 0.28 0.22 0.25

p5 0.00 0.39 0.15

p6 0.00 0.26

p7 0.00

Step 2:

p1 p2,p7 p3 p4 p5 p6

p1 0.00 0.19 0.22 0.37 0.34 0.24

p2,p7 0.00 0.14 0.19 0.14 0.24

p3 0.00 0.16 0.28 0.10

p4 0.00 0.28 0.22

p5 0.00 0.39

p6 0.00

p1 p2,p7 p3, p6 p4 p5

p1 0.00 0.19 0.22 0.37 0.34

p2,p7 0.00 0.14 0.19 0.14

p3, p6 0.00 0.16 0.28

p4 0.00 0.28

p5 0.00

Step 0:

Step 3 (merge p5,p2,p7 first):

p1 p2,p5,p7 p3, p6 p4

p1 0.00 0.19 0.22 0.37

p2,p5,p7 0.00 0.14 0.19

p3, p6 0.00 0.16

p4 0.00

Step 1:

Step 4 p1 p2,p3,p5, p6,p7 p4

p1 0.00 0.19 0.37

p2,p3,p5,p6,p7 0.00 0.16

p4 0.00

Step 5 p1 p2,p3,p4,p5, p6,p7

p1 0.00 0.19

p2,p3,p4,p5,p6,p7

0.00

2 7 5 3 6 4 1

2 7 53 6 4 1

Two possible dendrograms for single link hierarchical clustering:

(a) Case 1: merge p5,p2,p7 first

(a) Case 2: merge p3,p6,p2,p7 first

(c) The cophenetic correlation coefficient matrix for single link clustering

p1 p2 p3 p4 P5 p6 p7

p1 0.00 0.34 0.39 0.39 0.34 0.39 0.34

p2 0.00 0.39 0.39 0.15 0.39 0.06

p3 0.00 0.22 0.39 0.10 0.39

p4 0.00 0.39 0.22 0.39

p5 0.00 0.39 0.15

p6 0.00 0.39

p7 0.00

(a) Case 1 dendrogram (single link )

p1 p2 p3 p4 p5 p6 p7

p1 0.00 0.23 0.22 0.37 0.34 0.24 0.19

p2 0.00 0.14 0.19 0.14 0.24 0.06

p3 0.00 0.16 0.28 0.10 0.17

p4 0.00 0.28 0.22 0.25

p5 0.00 0.39 0.15

p6 0.00 0.26

p7 0.00 2 7 5 3 6 4 1

(a) The distance matrix

2 7 5 3 6 41

(a) The dendrogram for complete link clustering

(b) The cophenetic correlation coefficient matrix for complete link clustering

p1 p2 p3 p4 p5 p6 p7

p1 0.00 0.19 0.19 0.19 0.19 0.19 0.19

p2 0.00 0.14 0.16 0.14 0.14 0.06

p3 0.00 0.16 0.14 0.10 0.14

p4 0.00 0.16 0.16 0.16

p5 0.00 0.14 0.14

p6 0.00 0.14

p7 0.00

2. Consider the following four faces shown in Figure 2. Again, darkness or number of dots represents density. Lines are used only to distinguish regions and do not represent points.

(a) For each figure, could you use single link to find the patterns represented by the nose, eyes, and mouth? Explain.

(b) For each figure, could you use K-means to find the patterns represented by the nose, eyes, and mouth? Explain.

Ans:

(a) Only for (b) and (d).

For (b), the points in the nose, eyes, and mouth are much closer together than the points between these areas.

For (d) there is only space between these regions.

(b) Only for (b) and (d).

For (b), K-means would find the nose, eyes, and mouth, but the lower density points would also be included.

For (d), Kmeans would find the nose, eyes, and mouth straightforwardly as long as the number of clusters was set to 4.

3. Compute the entropy and purity for the confusion matrix in Table 2.

-The purity of a cluster

-The overall purity

cluster i

class j

98.0693

676

53.01562

827

Purity (cluster #1):

Purity (cluster #2):

Purity (cluster #3): 49.0949

465

Purity (total): 61.049.03204

94953.0

3204

156298.0

3204

693

• Entropy – pij: The probability that a member of cluster i belong to class j,

pij= mij/mi •mij:The # of objects of class j in cluster i•mi: The # of objects in cluster i

– The entropy of a cluster

•L: The number of classes (ground truth, given)– The entropy of a clustering is the total entropy

•m: Total # of data points•K: # of clusters

2.0693

676log

693

676

693

4log

693

4

693

11log

693

11

693

1log

693

1

693

1log

693

1

84.11562

33log

1562

33

1562

827log

1562

827

1562

333log

1562

333

1562

89log

1562

89

1562

27log

1562

27

Entropy (cluster #1):

Entropy (cluster #2):

Entropy (cluster #3):

70.1949

29log

949

29

949

16log

949

16

949

105log

949

105

949

8log

949

8

949

465log

949

465

949

326log

949

326

Entropy (total):

44.17.13204

94984.1

3204

15622.0

3204

6931

i

K

ii em

me

4. Using the distance matrix in Table 3, compute the silhouette coefficient for each point, each of the two clusters, and the overall clustering. (Cluster 1 contains {P1, P2} and Cluster 2 contains { P3, P4})

Cluster 1: {P1, P2}Cluster 2: {P3, P4}

• Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings

• For an individual point, i– Calculate a = average distance of i to the points in its cluster

– Calculate b = min (average distance of i to points in another cluster)

– The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case)

– Typically between 0 and 1.

– The closer to 1 the better.

• Can calculate the Average Silhouette width for a cluster or a clustering

Internal Measures: Silhouette Coefficient

ab

a:群內平均

b:最短群外平均

Cluster 1: {P1, P2}Cluster 2: {P3, P4}

5. Given the set of cluster labels and similarity matrix shown in Tables 4 and.5, respectively, compute the correlation between the similarity matrix and the ideal similarity matrix, i.e., the matrix whose ijth entry is 1 if two objects belong to the same cluster, and 0 otherwise.

Idea similarity matrix:

1 1 0 0

1 1 0 0

0 0 1 1

0 0 1 1

y =< 1, 0, 0, 0, 0, 1 > x =< 0.8, 0.65, 0.55, 0.7, 0.6, 0.9 >

13.07.0 xx

n

xn

ixi

1

2

2

n

xn

ii

1

52.033.0 yy

74.052.013.0

6)33.01)(7.09.0...()33.00)(7.055.0()33.00)(7.065.0()33.01)(7.08.0(

))((

yx

yyxx xxECorr

y =< 1, 0, 0, 0, 0, 1 > x =< 0.8, 0.65, 0.55, 0.7, 0.6, 0.9 >

註:取σ 要開平方根

6. Compute the hierarchical F-measure for the eight objects {p1, p2, p3, p4, p5,p6, p7, p8} and hierarchical clustering shown in Figure 3. Class A contains points p1, p2, and p3, while p4, p5, p6, p7, and p8 belong to class B.

F-measure

class i

cluster j

Hierarchical F-measure

),(max jiFm

mF

ji

i

class

cluster

•The maximum is taken over all cluster j at all levels

•mi is the number of objects in class i

•m is the total number of objects

Class A: {p1, p2, p3}Class B: {p4, p5, p6, p7, p8}

Class=B:R(B,1)=5/5=1, P(B,1)=5/8=0.625 F(B,1)=0.77

78.0)(*8/5)(*8/3),(max BFAFjiFm

mF

ji

iOverall Clustering:

7. Figure 4 shows a clustering of a two-dimensional point data set with two clusters: The leftmost cluster, whose points are marked by asterisks, is somewhat diffuse, while the rightmost cluster, whose points are marked by circles, is compact. To the right of the compact cluster, there is a single point (marked by an arrow) that belongs to the diffuse cluster, whose center is farther away than that of the compact cluster. Explain why this is possible with EM clustering, but not K-means clustering.

Ans:

In EM clustering, we compute the probability that a point belongs to a cluster. In turn, this probability depends on both the distance from the cluster center and the spread (variance) of the cluster. Hence, a point that is closer to the centroid of one cluster than another can still have a higher probability with respect to the more distant cluster if that cluster has a higher spread than the closer cluster. K-means only takes into account the distance to the closest cluster when assigning points to clusters. This is equivalent to an EM approach where all clusters are assumed to have the same variance.