HCS Clustering Algorithm

42
HCS Clustering Algorithm A Clustering Algorithm Based on Graph Connectivity

description

HCS Clustering Algorithm. A Clustering Algorithm Based on Graph Connectivity. Presentation Outline. The Problem HCS Algorithm Overview Main Players General Algorithm Properties Improvements Conclusion. The Problem. Clustering: - PowerPoint PPT Presentation

Transcript of HCS Clustering Algorithm

Page 1: HCS Clustering Algorithm

HCS Clustering Algorithm

A Clustering AlgorithmBased on Graph Connectivity

Page 2: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 2

Presentation Outline

• The Problem• HCS Algorithm Overview

– Main Players– General Algorithm– Properties– Improvements

• Conclusion

Page 3: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 3

The Problem• Clustering:

– Group elements into subsets based on similarity between pairs of elements

• Requirements:– Elements in the same cluster are highly similar to each

other– Elements in different clusters have low similarity to each

other• Challenges:

– Large sets of data– Inaccurate and noisy measurements

Page 4: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 4

Presentation Outline

• The Problem• HCS Algorithm Overview

– Main Players– General Algorithm– Properties– Improvements

• Conclusion

Page 5: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 5

HCS Algorithm Overview

• Highly Connected Subgraphs Algorithm– Uses graph theoretic techniques

• Basic Idea– Uses similarity information to construct a

similarity graph– Groups elements that are highly connected with

each other

Page 6: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 6

Presentation Outline

• The Problem• HCS Algorithm Overview

– Main Players– General Algorithm– Properties– Improvements

• Conclusion

Page 7: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 7

HCS: Main Players

• Similarity Graph– Nodes correspond to elements (genes)– Edges connect similar elements (those whose similarity

value is above some threshold)

gene1

gene2

gene3

Gene1 similar to gene2

Gene1 similar to gene3

Gene2 similar to gene3

Page 8: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8

HCS: Main Players

• Edge Connectivity– Minimum number of edges whose removal results in a

disconnected graph

Must remove 3 edges to disconnect graph, thus has an edge connectivity k(G) = 3

gene1

gene2

gene4

gene3

Page 9: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 9

HCS: Main Players

• Edge Connectivity– Minimum number of edges whose removal results in a

disconnected graph

Must remove 3 edges to disconnect graph, thus has an edge connectivity k(G) = 3

gene1

gene2

gene4

gene3

Page 10: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 10

HCS: Main Players

• Edge Connectivity– Minimum number of edges whose removal results in a

disconnected graph

Must remove 3 edges to disconnect graph, thus has an edge connectivity k(G) = 3

gene1

gene2

gene4

gene3

Page 11: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 11

HCS: Main Players

• Highly Connected Subgraphs– Subgraphs whose edge connectivity exceeds half the

number of nodes

gene1

gene2

gene4

gene5

gene3 gene6

gene7

gene8

Entire GraphNodes = 8Edge connectivity = 1

Not HCS!

Page 12: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 12

HCS: Main Players

• Highly Connected Subgraphs– Subgraphs whose edge connectivity exceeds half the

number of nodes

gene1

gene2

gene4

gene5

gene3 gene6

gene7

gene8

HCS!

Sub GraphNodes = 5Edge connectivity = 3

Page 13: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 13

HCS: Main Players

• Cut– A set of edges whose removal disconnects the graph

gene1

gene2

gene7gene4

gene5

gene3 gene6

gene8

Page 14: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 14

HCS: Main Players

• Minimum Cut– A cut with a minimum number of edges

gene1

gene2

gene7gene4

gene3 gene6

gene5 gene8

Page 15: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 15

HCS: Main Players

• Minimum Cut– A cut with a minimum number of edges

gene1

gene2

gene3 gene6

gene5 gene8

gene7gene4

Page 16: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 16

HCS: Main Players

• Minimum Cut– A cut with a minimum number of edges

gene1

gene2

gene3

gene5 gene8

gene4

gene6

gene7

Page 17: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 17

Presentation Outline

• The Problem• HCS Algorithm Overview

– Main Players– General Algorithm– Properties– Improvements

• Conclusion

Page 18: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 18

HCS: Algorithm (by example)

2 4

1011

5

1

12

3

7

6

9 8find and remove a minimum cut

Page 19: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 19

HCS: Algorithm (by example)

Highly Connected!

2 4

1011

5

1

12

3

7

6

9 8are the resulting subgraphs highly connected?

Page 20: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 20

HCS: Algorithm (by example)

2 4

1011

5

1

12

3

7

6

9 8repeat process on non-highlyconnected subgraphs

Cluster 1

Page 21: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 21

HCS: Algorithm (by example)

2 4

1011

5

1

12

3

7

6

9 8find and remove a minimum cut

Cluster 1

Page 22: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 22

HCS: Algorithm (by example)

Highly Connected!

2 4

Highly Connected!

1011

5

1

12

3

7

6

9 8are the resulting subgraphs highly connected?

Cluster 1

Page 23: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 23

HCS: Algorithm (by example)

Cluster 2

2 4

Cluster 3

1011

5

1

12

3

7

6

9 8resulting clusters

Cluster 1

Page 24: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 24

HCS: AlgorithmHCS( G ) {

MINCUT( G ) = { H1, … , Ht }

for each Hi, i = [ 1, t ] {if k( Hi ) > n ÷ 2return Hi

elseHCS( Hi )}

}

Page 25: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 25

HCS: AlgorithmHCS( G ) {

MINCUT( G ) = { H1, … , Ht }

for each Hi, i = [ 1, t ] {if k( Hi ) > n ÷ 2return Hi

elseHCS( Hi )}

}

Find a minimum cut in graph G. This returns a set of subgraphs { H1, … , Ht } resulting

from the removal of the cut set.

Page 26: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 26

HCS: AlgorithmHCS( G ) {

MINCUT( G ) = { H1, … , Ht }

for each Hi, i = [ 1, t ] {if k( Hi ) > n ÷ 2return Hi

elseHCS( Hi )}

}

For each subgraph…

Page 27: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 27

HCS: AlgorithmHCS( G ) {

MINCUT( G ) = { H1, … , Ht }

for each Hi, i = [ 1, t ] {if k( Hi ) > n ÷ 2return Hi

elseHCS( Hi )}

}

If the subgraph is highly connected, then return that

subgraph as a cluster. (Note: k( Hi ) denotes edge

connectivity of graph Hi, n denotes number of nodes)

Page 28: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 28

HCS: AlgorithmHCS( G ) {

MINCUT( G ) = { H1, … , Ht }

for each Hi, i = [ 1, t ] {if k( Hi ) > n ÷ 2return Hi

elseHCS( Hi )}

}

Otherwise, repeat the algorithm on the subgraph.

(recursive function)

This continues until there are no more subgraphs, and all clusters have been found.

Page 29: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 29

HCS: AlgorithmHCS( G ) {

MINCUT( G ) = { H1, … , Ht }

for each Hi, i = [ 1, t ] {if k( Hi ) > n ÷ 2return Hi

elseHCS( Hi )}

}

Running time is bounded by 2N × f( n, m ) where N is the number of clusters found, and f( n, m ) is the time complexity of computing a

minimum cut in a graph with n nodes and m edges.

Page 30: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 30

HCS: AlgorithmHCS( G ) {

MINCUT( G ) = { H1, … , Ht }

for each Hi, i = [ 1, t ] {if k( Hi ) > n ÷ 2return Hi

elseHCS( Hi )}

}

Deterministic for Un-weighted Graph:

takes O(nm) steps where n is the number of nodes and m is the

number of edges

Page 31: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 31

Presentation Outline

• The Problem• HCS Algorithm Overview

– Main Players– General Algorithm– Properties– Improvements

• Conclusion

Page 32: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 32

HCS: Properties

• Homogeneity– Each cluster has a diameter of at most 2

• Distance is the minimum length path between two nodes– Determined by number of EDGES traveled between nodes

• Diameter is the longest distance in the graph

– Each cluster is at least half as dense as a clique• Clique is a graph with maximum possible edge connectivity

clique

Dist( a, d ) = 2Dist( a, e ) = 3Diam( G ) = 4

a

cb

f

d

e

Page 33: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 33

HCS: Properties

• Separation– Any non-trivial split is unlikely to have diameter of two– Number of edges removed by each iteration is linear in

the size of the underlying subgraph• Compared to quadratic number of edges within final clusters• Indicates separation unless sizes are small• Does not imply number of edges removed overall

Page 34: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 34

Presentation Outline

• The Problem• HCS Algorithm Overview

– Main Players– General Algorithm– Properties– Improvements

• Conclusion

Page 35: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 35

HCS: Improvements

2 4

1011

1

12

3

7

6

8Choosing between cut sets

Page 36: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 36

HCS: Improvements

2

1

12 7

6

8

4

1011

3

Page 37: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 37

HCS: Improvements

2

1

12

6

7

8

4

11

3

10

Page 38: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 38

HCS: Improvements

• Iterated HCS– Sometimes there are multiple minimum cuts to

choose from• Some cuts may create “singletons” or nodes that

become disconnected from the rest of the graph– Performs several iterations of HCS until no new

cluster is found (to find best final clusters)• Theoretically adds another O(n) factor to running time,

but typically only needs 1 – 5 more iterations

Page 39: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 39

HCS: Improvements

• Remove low degree nodes first– If node has low degree, likely will just be

separated from rest of graph– Calculating separation for those nodes is

expensive– Removal helps eliminate unnecessary iterations

and significantly reduces running time

Page 40: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 40

Presentation Outline

• The Problem• HCS Algorithm Overview

– Main Players– General Algorithm– Properties– Improvements

• Conclusion

Page 41: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 41

Conclusion

• Performance– With improvements, can handle problems with

up to thousands of elements in reasonable computing time

– Generates clusters with high homogeneity and separation

– More robust (responds better when noise is introduced) than other approaches based on connectivity

Page 42: HCS Clustering Algorithm

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 42

References

“A Clustering Algorithm based on Graph Connectivity”

By Erez Hartuv and Ron ShamirMarch 1999 ( Revised December 1999)

http://www.math.tau.ac.il/~rshamir/papers.html