Clustering… in General In vector space, clusters are vectors found within of a cluster vector,...

26
Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster vector and . Clustering is unsupervised pattern classification. Unsupervised means no correct answer or feedback. Patterns typically are samples of feature vectors or matrices. Classification means collecting the samples into groups of similar members.
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Clustering… in General In vector space, clusters are vectors found within of a cluster vector,...

Page 1: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Clustering… in General In vector space, clusters are vectors found within

of a cluster vector, with different techniques for determining the cluster vector and .

Clustering is unsupervised pattern classification. Unsupervised means no correct answer or feedback. Patterns typically are samples of feature vectors or

matrices. Classification means collecting the samples into groups

of similar members.

Page 2: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Clustering Decisions Pattern Representation

feature selection (e.g., stop word removal, stemming) number of categories

Pattern proximity distance measure on pairs of patterns

Grouping characteristics of clusters (e.g., fuzzy, hierarchical)

Clustering algorithms embody different assumptions about these decisions and the form of clusters.

Page 3: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Formal Definitions Feature vector x is a single datum

of d measurements. Hard clustering techniques assign

a class label to each cluster; members of clusters are mutually exclusive.

Fuzzy clustering techniques assign a fractional degree of membership to each label for each x.

Page 4: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Proximity Measures Generally, use Euclidean distance or

mean squared distance. In IR, use similarity measure from

retrieval (e.g., cosine measure for TFIDF).

Page 5: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

[Jain, Murty & Flynn] Taxonomy of Clustering

Clustering

Hierarchical Partitional

SingleLink

CompleteLink

SquareError

GraphTheoretic

MixtureResolving

ModeSeeking

k-meansExpectationMinimizationHAC

Page 6: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Clustering Issues

Agglomerative: begin with each sample in its own cluster and merge

Divisive: begin with single cluster and split

Hard: mutually exclusive cluster membership

Fuzzy: degrees of membership in clusters

Deterministic Stochastic

Incremental: samples may be added to clusters

Batch: clusters created over entire sample space

Page 7: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Hierarchical Algorithms Produce hierarchy of

classes (taxonomy) from singleton clusters to just one cluster.

Select level for extracting cluster set.

Representation is a dendrogram.

C1

D1 D2D3 D4

C2C3 C4

C1,3

0.99

C1,3,2

0.29

C1,3,2,4

0.00

Page 8: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Complete-Link Revisited Used to create statistical thesaurus Agglomerative, hard, deterministic, batch1. Start with 1 cluster/sample2. Find two clusters with lowest distance3. Merge two clusters and add to hierarchy4. Repeat from 2 until termination criterion or

until all clusters have merged

Page 9: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Single-Link Like Complete-Link except…

use minimum of distances between all pairs of samples in the two clusters (complete-link uses maximum).

Single-link has chaining effect with elongated clusters, but can construct more complex shapes.

Page 10: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Example:Plot

05

101520253035404550

0 10 20 30 40 50

Page 11: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Example: Proximity Matrix

21,15

26,25

29,22

31,15

21,27

23,32

29,26

33,21

21,15

0 11.2 10.6 10.0 12.0 17.1 13.6 13.4

26,25

0 4.2 11.1 5.4 7.6 3.2 8.1

29,22

0 7.3 9.4 11.7 4.0 4.1

31,15

0 15.6 18.8 11.2 6.3

21,27

0 5.4 8.1 13.4

23,32

0 8.5 14.9

29,26

0 6.4

33,21

0

Page 12: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Complete-Link Solution

1,28

4,9

9,16

13,18

21,15 29,22

31,15 33,21 35,35 42,45

45,4246,3023,32

21,27

29,26

26,25

C1 C2 C3C4 C5

C6C7C8 C9

C10C11 C12

C13 C14

C15

Page 13: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Single-Link Solution

1,28

4,9

9,16

13,18

21,15 29,22

31,15 33,21 35,35 42,45

45,4246,3023,32

21,27

29,26

26,25

C1 C4C5 C6

C7

C9

C13

C10

C11

C15

C2

C3

C8

C12

C14

Page 14: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Hierarchical Agglomerative Clustering (HAC)

Agglomerative, hard, deterministic, batch1. Start with 1 cluster/sample and compute a

proximity matrix between pairs of clusters.2. Merge most similar pair of clusters and

update proximity matrix.3. Repeat 2 until all clusters merged. Difference is in how proximity matrix is

updated. Ability to combine benefits of both single and

complete link algorithms.

Page 15: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

HAC for IR

Intra-cluster Similarity

where S is TFIDF vectors for documents, c is centroid of cluster X, and d is a document.

Proximity is similarity of all documents to the cluster centroid.

Select pair of clusters that produces the smallest decrease in similarity, e.g., if merge(X,Y)=>Z, thenmax[Sim(Z)-(Sim(X)+Sim(Y))]

Sd

Xd

dS

c

cdXSim

1

),cos()(

Page 16: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

HAC for IR- AlternativesCentroid Similarity

cosine similarity between the centroid of the two clusters

UPGMA

Sd

YX

dS

c

ccYXSim

1

),cos(),(

YX

ddYXSim YdXd

*

),cos(),( 21 ,

21

Page 17: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Partitional Algorithms Results in set of unrelated clusters. Issues:

how many clusters is enough? how to search space of possible

partitions? what is appropriate clustering

criterion?

Page 18: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

K Means Number of clusters is set by user to be k. Non-deterministic Clustering criterion is squared error:

where S is document set, L is a clustering, K is number of clusters, x is ith document in jth cluster and c is centroid of jth cluster.

K

j

n

ij

j

i

j

cxLSe1 1

2

),(

Page 19: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

k-Means Clustering Algorithm

1. Randomly select k samples as cluster centroids.

2. Assign each pattern to the closest cluster centroid.

3. Recompute centroids.4. If convergence criterion (e.g., minimal

decrease in error or no change in cluster composition) is not met, return to 2.

Page 20: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Example:K-Means Solutions

05

101520253035404550

0 10 20 30 40 50

Page 21: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

k-Means Sensitivity to Initialization

ABC D E

F G

K=3, red started w/A, D, F; yellow w/A, B, C

Page 22: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

k-Means for IR Update centroids incrementally Calculate centroid as with

hierarchical methods. Can refine into a divisive

hierarchical method by starting with single cluster and splitting using k-means until forms k clusters with highest summed similarities. (bisecting k-means)

Page 23: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Other Types of Clustering AlgorithmsGraph Theoretic: construct minimal

spanning tree and delete edges with largest lengths

Expectation Minimization (EM): assume clusters are drawn from distributions, use maximum likelihood to estimate parameters of distributions.

Nearest Neighbors: iteratively assign each sample to the cluster of its nearest labelled neighbor, so long as distance is below a set threshold.

Page 24: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Comparison of Clustering Algorithms [Steinbach et al.]

Implement 3 versions of HAC and 2 versions of k-Means

Compare performance on documents hand labelled as relevant to one of a set of classes.

Well known data sets (TREC) Found that UPGMA is best of hierarchical,

but bisecting k-means seems to do better if considered over many runs.

M. Steinbach, G. Karypis, V.Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining, 2000.

Page 25: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Evaluation Metrics 1

Evaluation: how to measure cluster quality? Entropy:

where pij is probability that a member of cluster j belongs to class i, nj is size of cluster j, m is number of clusters, n is number of docs and CS is a clustering solution.

m

j

jj

CS

iijijj

n

EnE

ppE

1

*

)log(

Page 26: Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Comparison Measure 2 F measure: combines precision and

recall treat each cluster as the result of a query

and each class as the relevant set of docs

i

i

jij

iij

jiFn

nF

jiji

jijijiF

nnji

nnji

)],(max[

),(Recall),(Precision

),(Precision*),(Recall*2),(

/),(Precision

/),(Recall

nij is # of members of class i in cluster j,nj is # in j, ni is # in i,n is # of docs.