Lecture 17: Hierarchical Clustering

44
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1

description

Lecture 17: Hierarchical Clustering. Hierarchical clustering. Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: We want to create this hierarchy automatically . We can do this either top-down or bottom-up . The best known - PowerPoint PPT Presentation

Transcript of Lecture 17: Hierarchical Clustering

Page 1: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

Introduction to

Information Retrieval

Lecture 17: Hierarchical Clustering

1

Page 2: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

2

Hierarchical clusteringOur goal in hierarchical clustering is tocreate a hierarchy like the one we saw earlierin Reuters:

We want to create this hierarchyautomatically. We can do this eithertop-down or bottom-up. The best knownbottom-up method is hierarchicalagglomerative clustering.

2

Page 3: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

Overview❶ Recap

❷ Introduction

❸ Single-link/ Complete-link

❹ Centroid/ GAAC

❺ Variants

❻ Labeling clusters

3

Page 4: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Introduction

❸ Single-link/ Complete-link

❹ Centroid/ GAAC

❺ Variants

❻ Labeling clusters

4

Page 5: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

5

Top-down vs. Bottom-up

Divisive clustering: Start with all docs in one big cluster Then recursively split clusters Eventually each node forms a cluster on its own. → Bisecting K-means

Not included in this lecture.

5

Page 6: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

6

Hierarchical agglomerative clustering (HAC) HAC creates a hierachy in the form of a binary tree. Assumes a similarity measure for determining the

similarity of two clusters. Start with each document in a separate cluster Then repeatedly merge the two clusters that

are most similar until there is only one cluster The history of merging is a hierarchy in the

form of a binary tree. The standard way of depicting this history is a

dendrogram.

6

Page 7: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

7

A dendrogram The vertical line of each

merger tells us what the similarity of the merger was.

We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering.

7

Page 8: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

8

Naive HAC algorithm

8

Page 9: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

9

Computational complexity of the naive algorithm First, we compute the similarity of all N × N pairs of

documents. Then, in each of N iterations:

We scan the O(N × N) similarities to find the maximum similarity.

We merge the two clusters with maximum similarity. We compute the similarity of the new cluster with all other

(surviving) clusters. There are O(N) iterations, each performing a O(N × N)

“scan” operation. Overall complexity is O(N3). We’ll look at more efficient algorithms later.

9

Page 10: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

10

Key question: How to define cluster similarity Single-link: Maximum similarity

Maximum similarity of any two documents Complete-link: Minimum similarity

Minimum similarity of any two documents Centroid: Average “intersimilarity”

Average similarity of all document pairs (but excluding pairs of docs in the same cluster)

This is equivalent to the similarity of the centroids. Group-average: Average “intrasimilarity”

Average similary of all document pairs, including pairs of docs in the same cluster

10

Page 11: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

11

Cluster similarity: Example

11

Page 12: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

12

Single-link: Maximum similarity

12

Page 13: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

13

Complete-link: Minimum similarity

13

Page 14: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

14

Centroid: Average intersimilarityintersimilarity = similarity of two documents in different clusters

14

Page 15: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

15

Group average: Average intrasimilarityintrasimilarity = similarity of any pair, including cases where thetwo documents are in the same cluster

15

Page 16: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

16

Cluster similarity: Larger Example

16

Page 17: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

17

Single-link: Maximum similarity

17

Page 18: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

18

Complete-link: Minimum similarity

18

Page 19: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

19

Centroid: Average intersimilarity

19

Page 20: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

20

Group average: Average intrasimilarity

20

Page 21: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Introduction

❸ Single-link/ Complete-link

❹ Centroid/ GAAC

❺ Variants

❻ Labeling clusters

21

Page 22: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

22

Single link HAC The similarity of two clusters is the maximum

intersimilarity – the maximum similarity of a document from the first cluster and a document from the second cluster.

Once we have merged two clusters, we update the similarity matrix with this:

SIM(ωi , (ωk1 ∪ ωk2)) = max(SIM(ωi , ωk1), SIM(ωi , ωk2))

22

Page 23: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

23

This dendogram was produced by single-link

Notice: many small clusters (1 or 2 members) being added to the main cluster

There is no balanced 2-cluster or 3-cluster clustering that can be derived by cutting the dendrogram.

23

Page 24: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

24

Complete link HAC The similarity of two clusters is the minimum intersimilarity

– the minimum similarity of a document from the first cluster and a document from the second cluster.

Once we have merged two clusters, we update the similarity matrix with this:

SIM(ωi , (ωk1 ∪ ωk2)) = min(SIM(ωi , ωk1), SIM(ωi , ωk2))

We measure the similarity of two clusters by computing the diameter of the cluster that we would get if we merged them.

24

Page 25: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

25

Complete-link dendrogram

Notice that this dendrogram is much more balanced than the single-link one.

We can create a 2-cluster clustering with two clusters of about the same size.

25

Page 26: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

26

Exercise: Compute single and complete link clustering

26

Page 27: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

27

Single-link vs. Complete link clustering

27

Page 28: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

28

Single-link: Chaining

Single-link clustering often produces long, straggly clusters. For most applications, these are undesirable.

28

Page 29: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

29

Complete-link: Sensitivity to outliers

The complete-link clustering of this set splits d2 from its right neighbors – clearly undesirable.

The reason is the outlier d1. This shows that a single outlier can negatively affect the

outcome of complete-link clustering. Single-link clustering does better in this case.

29

Page 30: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Introduction

❸ Single-link/ Complete-link

❹ Centroid/ GAAC

❺ Variants

❻ Labeling clusters

30

Page 31: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

31

Centroid HAC The similarity of two clusters is the average intersimilarity –

the average similarity of documents from the first cluster with documents from the second cluster.

A implementation of centroid HAC is computing the similarity of the centroids:

Hence the name: centroid HAC

31

Page 32: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

32

Example: Compute centroid clustering

32

Page 33: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

33

Centroid clustering

33

Page 34: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

34

The Inversion in centroid clustering In an inversion, the similarity increases during a merge

sequence. Results in an “inverted” dendrogram. Below: Similarity of the first merger (d1 ∪ d2) is -4.0,

similarity of second merger ((d1 ∪ d2) ∪ d3) is ≈ −3.5.

34

Page 35: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

35

Group-average agglomerative clustering (GAAC)

The similarity of two clusters is the average intrasimilarity – the average similarity of all document pairs (including those from the same cluster).

But we exclude self-similarities. The similarity is defined as:

35

Page 36: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

36

Comparison of HAC algorithms

36

method combination similarity comment

single-link max intersimilarity of any 2 docs chaining effect

complete-link min intersimilarity of any 2 docs sensitive to outliers

group-average average of all sims best choice formost applications

centroid average intersimilarity inversions can occur

Page 37: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Introduction

❸ Single-link/ Complete-link

❹ Centroid/ GAAC

❺ Variants

❻ Labeling clusters

37

Page 38: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

38

Efficient HAC

38

Page 39: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

39

Time complexity of HAC

The algorithm we look at earlier is O(N3). There is no known O(N2) algorithm for complete-link,

centroid and GAAC. Best time complexity for these three is O(N2 log N).

39

Page 40: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

Outline❶ Recap

❷ Introduction

❸ Single-link/ Complete-link

❹ Centroid/ GAAC

❺ Variants

❻ Labeling clusters

40

Page 41: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

41

Discriminative labeling

To label cluster ω, compare ω with all other clusters Find terms or phrases that distinguish ω from the other

clusters. We can use any of the feature selection criteria we

introduced in text classification to identify discriminating terms: mutual information.

41

Page 42: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

42

Non-discriminative labeling

Select terms or phrases based solely on information from the cluster itself.

Terms with high weights in the centroid. (if we are using a vector space model)

Non-discriminative methods sometimes select frequent terms that do not distinguish clusters.

For example, MONDAY, TUESDAY, . . . in newspaper text

42

Page 43: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

43

Using titles for labeling clusters

Terms and phrases are hard to scan and condense into a holistic idea of what the cluster is about.

Alternative: titles For example, the titles of two or three documents that are

closest to the centroid. Titles are easier to scan than a list of phrases.

43

Page 44: Lecture 17:  Hierarchical Clustering

Introduction to Information RetrievalIntroduction to Information Retrieval

44

Cluster labeling: Example

44

# docs

labeling method

centroid mutual information title

4 622 oil plant mexico production crude power000 refinery gas bpd

plant oil productionbarrels crude bpd mexico dolly capacity petroleum

MEXICO: HurricaneDolly heads for Mexico coast

9 1017 police security russianpeople military peace killed told grozny court

police killed militarysecurity peace told troops forces rebels people

RUSSIA: Russia’sLebed meets rebelchief in Chechnya

10 1259 00 000 tonnes tradersfutures wheat pricescents september tonne

delivery traders futurestonne tonnes desk wheat prices 000 00

USA: Export Business- Grain/oilseeds complex

Apply three labeling method to a K-means clustering. Three methods: most prominent terms in centroid, differential

labeling using MI, title of doc closest to centroid All three methods do a pretty good job.