An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños...

26
An Empirical Comparison of Stream Clustering Algorithms Matthias Carnein , Dennis Assenmacher, Heike Trautmann CF’17 | BigDAW Workshop | Siena, Italy | May 15–18, 2017 WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Transcript of An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños...

Page 1: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

An Empirical Comparison of Stream ClusteringAlgorithms

Matthias Carnein, Dennis Assenmacher, Heike Trautmann

CF’17 | BigDAW Workshop | Siena, Italy | May 15–18, 2017

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER

Page 2: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 2 /29

Clustering

0 5

0

2

4

6

0 5

0

2

4

6

I Clustering aims to discover groups of similar objectsI Traditional algorithms assume a fixed set of dataI If new data arrives: Repeat the entire procedure

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 3: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 3 /29

Stream Clustering

0 50

2

46 data stream

0 50

2

46

Stream Model1. Data points arrive as a continuous stream2. The size of the stream is large and potentially unbounded3. Data points can only be evaluated once4. The order of data points cannot be influenced5. The distribution of the data can change over time

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 4: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 4 /29

Outline

1. Stream Clustering

2. Setup

3. Results

4. Conclusion

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 5: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 4 /29

Outline

1. Stream Clustering

2. Setup

3. Results

4. Conclusion

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 6: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 5 /29

Two-Stage Clusteringdata stream

0 2 4 60

2

4

6

online

‘micro-clusters’ ‘macro-clusters’

‘Reclustering’

offline

0 2 4 60

2

4

6

[Aggarwal et al. 2003]

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 7: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

Distance-Based

I BIRCH [Zhang et al. 1996]

I STREAM [O’Callaghan et al. 2002]

I CluStream [Aggarwal et al. 2003]

I DenStream [Cao et al. 2006]

I ClusTree [Kranen et al. 2009]

I streamKM++ [Ackermann et al. 2012]

I BICO [Fichtenberger et al. 2013]

I DBSTREAM [Hahsler et al. 2016]

Grid-Based

I D-Stream [Chen et al. 2007]

I D-Stream + attraction

[Tu et al. 2009]

Challenges

I Many parametersI No clear guidelines how to

set them

Page 8: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 7 /29

DenStream

I Clustering Feature CF = ( n ,−→LS ,−→SS ) describes micro-cluster

I Decay the weight of clusters over time: ω(t) = 2−λt

I Core clusters can decay to outlier clusters

#pointslinear sum

per dimensionsquared sum

per dimension

attempt 3

attempt 1 attempt 2

core micro-clusteroutlier micro-cluster

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 9: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 8 /29

D-Stream

I Maintains density of grid cellsI Reclustering: combine neighbouring dense cellsI Extension using attraction: only combine cells that share

points at the border

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 10: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 9 /29

DBSTREAM

I Assign points based on distance thresholdI Move clusters towards new points (competitive learning)I Maintains shared density between micro-clustersI Reclustering: Merge clusters with high shared density

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 11: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 10 /29

Remaining Algorithms

I BIRCH: maintain Clustering Features in a tree structureI STREAM: repeatedly apply k-median approximation on chunks

of dataI CluStream: store latest snapshots for various time intervals to

extract portion of entire streamI streamKM++: divisive clustering approach using medoidsI BICO: similar tree structure as BIRCH but with medoids similar

to streamKM++

Research Goal

Fair evaluation and comparison of ten popular stream clusteringalgorithms

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 12: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 11 /29

Outline

1. Stream Clustering

2. Setup

3. Results

4. Conclusion

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 13: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 12 /29

Data sets

0 10

0

10

−5 0 5

−5

0

5

0 2004006000

200

400

0 200400600

0

200

400

Data set Description Observations Dimension Cluster Noise

square1 Spherical 1, 000 2 4 0%spiral Connected 1, 000 2 2 0%complex9 Various shapes 3, 031 2 9 0%t8.8k Shapes + Noise 8, 000 2 8 4%KDD Cup'99 Network Intrusion 4, 898, 431 34 23 –Sensor Environment sensors 2, 219, 803 4 54 –Covertype Forest type 581, 012 10 6 –

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 14: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 13 /29

EvaluationI Static Data: Evaluate clusters at the end of the streamI Real world Data: Prequential evaluation [Cao et al. 2006]

1. Evaluate2. Update

3. Evaluate4. Update

5. Evaluate6. Update

External EvaluationI Micro-clusters: Purity

I Proportion of points that belong to the majority class in acluster

I Macro-clusters: Adjusted Rand Index (ARI)I Similarity between true groups and generated clusters

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 15: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 14 /29

Parameter-Tuning

I Choice of parameters is important for a fair comparisonI Automatic parameter-configuration using Iterated Racing

(irace) [López-Ibáñez et al. 2016]

I Search for configurations that optimize the ARI of themacro-clustering

I Upper limit on the number of micro-clustersI Static data: 5% of the number of observations in the streamI Real world data: fixed upper limit of 5000

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 16: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 15 /29

Outline

1. Stream Clustering

2. Setup

3. Results

4. Conclusion

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 17: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

Quality – square1

BIRCH STREAM CluStream DenStream D-Stream

D-Stream + attr. ClusTree StreamKM++ BICO DBSTREAM

0

0.5

1

Purity

0

0.5

1

ARI

I Micro-clusters: generally high purity, inconsistency forDenStream

I Macro-clusters: good results but lowest ARI for DBSTREAMI Example: Clustering result of DBSTREAM

Page 18: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

Quality – spiral

BIRCH STREAM CluStream DenStream D-Stream

D-Stream + attr. ClusTree StreamKM++ BICO DBSTREAM

0

0.5

1

Purity

0

0.5

1

ARI

I Micro-clusters: high purity for CluStream, DBSTREAM, BICO,streamKM++

I Macro-clusters: only decent results for DBSTREAMI Example: clustering result of STREAM

Page 19: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

Quality – Sensor

BIRCH STREAM CluStream DenStream D-Stream

D-Stream + attr. ClusTree StreamKM++ BICO DBSTREAM

0 500 1,000 1,500 2,0000

0.5

1

Time (in 1000)

Puri

ty

0 500 1,000 1,500 2,0000

0.5

1

Time (in 1000)

ARI

Page 20: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

Runtime

I Various programming languages (R, C, C++, Java)I Different interfaces, different authors

BIRCH STREAM CluStream DenStream D-Stream

D-Stream + attr. ClusTree StreamKM++ BICO DBSTREAM

square1

spiral

complex9

t8.8k

0

1

2

3

4

Runtime(ins)

KDDCup'99

Sensor

Covertype

102

103

104

Page 21: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 20 /29

Outline

1. Stream Clustering

2. Setup

3. Results

4. Conclusion

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 22: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 21 /29

ConclusionI Empirical comparison of 10 stream clustering algorithmsI DBSTREAM performed best but dependent on insertion orderI D-Stream good results but requires more micro-clusterI DenStream poor results and very inconsistent

Future Work

I More algorithms and data sets and runtime experimentsI Alternative configuration methodsI Use results to design an improved algorithmI Application to Multi-Channel Customer Relationship

Management

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 23: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 22 /29

Questions?

?????,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 24: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 23 /29

References I

Ackermann, Marcel R., Marcus Märtens, Christoph Raupach, Kamil Swierkot,Christiane Lammersen and Christian Sohler (2012). ‘StreamKM++: A ClusteringAlgorithm for Data Streams’. In: J. Exp. Algorithmics 17, 2.4:2.1–2.4:2.30. (Visited on16/02/2017).

Aggarwal, Charu C., Jiawei Han, Jianyong Wang and Philip S. Yu (2003). ‘A Framework forClustering Evolving Data Streams’. In: Proceedings of the 29th International Conferenceon Very Large Data Bases. Vol. 29. VLDB ’03. Berlin, Germany: VLDB Endowment,pp. 81–92.

Cao, Feng, Martin Ester, Weining Qian and Aoying Zhou (2006). ‘Density-based clusteringover an evolving data stream with noise’. In: Conference on Data Mining (SIAM ’06),pp. 328–339. DOI: 10.1109/CCCM.2009.5267735.

Chen, Yixin and Li Tu (2007). ‘Density-based Clustering for Real-time Stream Data’. In:Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining. KDD ’07. San Jose, California, USA: ACM, pp. 133–142. DOI:10.1145/1281192.1281210.

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 25: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 24 /29

References II

Fichtenberger, Hendrik, Marc Gillé, Melanie Schmidt, Chris Schwiegelshohn andChristian Sohler (2013). ‘BICO: BIRCH Meets Coresets for k-Means Clustering’. In:Algorithms - ESA 2013 - 21st Annual European Symposium, Sophia Antipolis, France,September 2-4, 2013. Proceedings. URL:http://ls2-www.cs.uni-dortmund.de/bico/ (visited on 02/03/2017),pp. 481–492. DOI: 10.1007/978-3-642-40450-4_41.

Guha, S., A. Meyerson, N. Mishra, R. Motwani and L. O’Callaghan (2003). ‘Clustering datastreams: Theory and practice’. In: IEEE Transactions on Knowledge and DataEngineering 15.3, pp. 515–528. DOI: 10.1109/TKDE.2003.1198387.

Hahsler, Michael and Matthew Bolaños (2016). ‘Clustering Data Streams Based on SharedDensity between Micro-Clusters’. In: IEEE Transactions on Knowledge and DataEngineering 28.6, pp. 1449–1461. DOI: 10.1109/TKDE.2016.2522412.

Kranen, Philipp, Ira Assent, Corinna Baldauf and Thomas Seidl (2009). ‘Self-AdaptiveAnytime Stream Clustering’. In: Ninth IEEE International Conference on Data Mining(ICDM ’09), pp. 249–258. DOI: 10.1109/ICDM.2009.47.

,,

Matthias Carnein Stream Clustering Setup Results Conclusion

Page 26: An Empirical Comparison of Stream Clustering Algorithms · Hahsler, Michael and Matthew Bolaños (2016).‘Clustering Data Streams Based on Shared Density between Micro-Clusters’.In:

livin

gkn

owle

dge

WW

UM

ünst

er

WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER An Empirical Comparison of Stream Clustering Algorithms 25 /29

References III

López-Ibáñez, Manuel, Jérémie Dubois-Lacoste, Leslie Pérez Cáceres, Thomas Stützle andMauro Birattari (2016). ‘The irace package: Iterated Racing for Automatic AlgorithmConfiguration’. In: Operations Research Perspectives. DOI:10.1016/j.orp.2016.09.002.

O’Callaghan, L., N. Mishra, A. Meyerson, S. Guha and R. Motwani (2002). ‘Streaming-dataalgorithms for high-quality clustering’. In: Proceedings of the 18th InternationalConference on Data Engineering (ICDE). URL: http://infolab.stanford.edu/~loc/(visited on 02/03/2017), pp. 685–694. DOI: 10.1109/ICDE.2002.994785.

Tu, Li and Yixin Chen (2009). ‘Stream Data Clustering Based on Grid Density and Attraction’.In: ACM Trans. Knowl. Discov. Data 3.3, 12:1–12:27. DOI: 10.1145/1552303.1552305.

Zhang, Tian, Raghu Ramakrishnan and Miron Livny (1996). ‘BIRCH: An Efficient DataClustering Method for Very Large Databases’. In: Proceedings of the 1996 ACM SIGMODInternational Conference on Management of Data. SIGMOD ’96. New York, NY, USA:ACM, pp. 103–114. DOI: 10.1145/233269.233324.

,,

Matthias Carnein Stream Clustering Setup Results Conclusion