Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process...

24
Vladyslav Kolbasin Stable Clustering

Transcript of Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process...

Page 1: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Vladyslav Kolbasin

Stable Clustering

Page 2: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Clustering data

Clustering is part of exploratory process

Standard definition:

Clustering - grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups

There is no “true” solution for clustering

We don't have any “true Y values”

Usually we want to do some data exploration or simplification or even find some data taxonomy

Usually we don't have precise mathematical definition of clustering

Usually we iterate through different methods that have different mathematical target function, then use some best method

2

Page 3: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Usual clustering methods

Methods:

K-means

Hierarchical clustering

Spectral clustering

DBScan

BIRCH

Issues:

Need to estimate clusters count

Non-determinism

Non-stability

3

Page 4: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Are standard methods stable? Kmeans

4

Page 5: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Are standard methods stable? Hierarchical clustering

5

Page 6: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Audience data

A lot of attributes: 9000-30000- ...

All attributes are binary

There are several data providers

There is no very important attributes

6

Page 7: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Stability importance

Data comes from different providers and it is very noisy

It is unlikely that results will change from run to run

Usually audience doesn't change a lot in short period

Many algorithms “explode” when we increase data size

Non linear complexity of clustering

Best count of clusters move to higher values for bigger data

7

Page 8: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Let's add some additional requirement to clustering:

clustering result should be a structure on the data set that is “stable”

So there should be similar results when:

We change some small portion of data

We apply clustering onto several datasets from the same underlying model

Apply clustering onto several subsets of initial dataset

We don't want to process gigabytes and terabytes to get several stable clusters which are independent of randomness in sampling

8

Stable clustering. Requirements

Page 9: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Stable clustering. Requirements

Natural restrictions:

We don't want to have too many clusters

We don't want to have too small or too big clusters

Too small clusters are usually useless for further processing

Too big clusters do not bring significantly new information

Some points can be noise points, so let try to find only significant tendencies

It will be big benefit if we can easily scale results

To be able to look at inner structure of selected cluster without full rerun

Any additional instruments for manual analysis of clustering are welcome

9

Page 10: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Stable clustering ideas

Do not use whole dataset, but use many small sub samples

Use several samplings to mine as much as possible information from data

Average all clustering on samples to get stable result

10

Page 11: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Stable clustering algorithm

1) Select N samples of whole dataset

2) Do clustering for each sample

So for each sample we have set of clusters (possibly very different)

3) Do some clusters' preprocessing

4) Associate clusters from different samples to each other

Build some relationship structure - clusters graph

Set relationship measure - distance measure

5) Do clustering on relationship structure

Do communities search

11

Page 12: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

2. Sample clustering

Any clustering method:

Kmeans

Hierarchical clustering

It is conveniently to use hierarchical clustering:

It is rather fast clustering method

We can estimate clusters count using natural restrictions, not using special criteria like we usually do for kmeans

We can deep into internal structure without any additional calculations

12

Page 13: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

2.1. Dendrogram clustering

Recursive splitting of large clusters

With natural restrictions:

Set max possible cluster size (in %)

Set min cluster size (in %), any smaller cluster – noise

Max count of splits

13

Page 14: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

2.1. Dendrogram clustering

14

Page 15: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

3. Do clusters' preprocessing

Reduce noise points

Cluster smoothing

Make clusters more convenient for associating:

Cluster can be similar to several other clusters (1-to-many)

If split it, it can transform into: 1-to-1 & 1-to-1 clusters

And some other heuristics...

15

Page 16: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

4. Associate clusters from different samples to each other

How similar to each other are clusters?

Set relationship measure:

Simplest measure - distance between cluster's centers

But we can use any suitable measure

16

Page 17: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

4. Associate clusters from different samples to each other

Clusters relationship structure - clusters graph

But we are not interested in edges for very different clusters

So need some threshold:

Can estimate manually, then hard-code

Can estimate automatically

17

Page 18: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

5. Communities search in networks

Methods:

walktrap.community

edge.betweenness.community

fastgreedy.community

spinglass.community

It is possible that some clusters will not be in any community. Then will mark these clusters as special type community

18

Page 19: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

5.1 Community structure detection based on edge betweenness

edge.betweenness.community() implements Girvan–Newman algorithm

Betwenness - the number of geodesics (shortest paths) going through an edge

Algorithm:

Calculate edge-betweenness for all edges

Remove the edge with highest betweenness

Recalculate betweenness

Repeat until all edges are removed, or modularity function is optimized (depending on variation)

19

Page 20: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

5. Communities examples

20

Page 21: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Algorithm analysis

21

Page 22: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Algorithm analysis

22

Page 23: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Algorithm analysis

23

Page 24: Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.

Summary

Issues in clustering algorithms

Why stability is important for business questions?

2 staged clustering algorithm

1st stage – apply simple clustering on samples

2nd stage – do clustering on cluster graph

Real data clustering example

Algorithm can be simply parallelized:

Most time is spent on 2nd step

24