Presentation ucb 2012

23
The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Philipp Kranen 1 , Ira Assent 2 , Corinna Baldauf 1 , Thomas Seidl 1 1 Data Management and Data Exploration Group, RWTH Aachen University, Germany 2 Department of Computer Science, Aarhus University, Denmark

description

 

Transcript of Presentation ucb 2012

Page 1: Presentation ucb 2012

The ClusTree: Indexing Micro-Clustersfor Anytime Stream Mining

Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1

1Data Management and Data Exploration Group, RWTH Aachen University, Germany

2Department of Computer Science, Aarhus University, Denmark

Page 2: Presentation ucb 2012

emergency

professionaldecision

fullclassifier

preclassifier

normal

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Motivating examples

Page 3: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Applications and tasks

Classification Modeling

Outlierdetection

vary

ing

data

rat

eco

nsta

ntda

ta r

ate

Page 4: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Agenda

4

I. The Anytime principleAnytime algorithms for stream data mining

II. The ClusTreeSelf-adaptive anytime stream clustering

III. The MOA FrameworkAn open source framework for stream mining algorithms

Page 5: Presentation ucb 2012

5

StreamA stream  : → : → , is an infinite sequence of objects  ∈ from a d‐dimensional input space  and∈ ,  ∀ is the discrete arrival time of object  .

Inter‐arrival timeThe inter‐arrival time between two consecutive objects and is denoted as Δt , i.e. 0 Δ ∈ .

Constant and varying streams

A stream  is called constant  ↔ Δ Δ ∀ ,

Stream algorithms– Online algorithms – the input is given one at a time

– Budget algorithms – tailored to a specific time budget b

– Anytime algorithms – provide a result after any amount of processing time

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Definitions I

Page 6: Presentation ucb 2012

For a given input an anytime algorithm can provide a first result after a very short initialization time and it uses additional time to improve its result. The algorithm is interruptible after any time and will deliver the best result obtained until the point of interruption.

6

Budget Algorithms – tailored to a specific time budget

– Available time < budget no result

– Available time > budget idle times

How should stream processing be done?

– Little time fast result

– More time use it to improve the result

Anytime Algorithms – provide a result after any time

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Definitions II

time

qual

ity

Page 7: Presentation ucb 2012

7

Can we do better than using all available time? 

Yes we can!

Distribute computation time according to confidence values– Spend less time on confident items

– Use additional time for uncertain objects

Prerequisites– Anytime algorithm

– Confidence measure

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Anytime algorithms on constant streams

type 1type 2

type m

arrival interval ta

constant data streamtf td

Page 8: Presentation ucb 2012

8

Anytime support vector machines

Anytime nearest neighbor classification

Anytime Bayesian classification

Categorical data

Continuous data

Others

Anytime induction of decision trees

Anytime A* algorithm

Anytime clustering

Anytime outlier detection

[References on last slide.]

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Existing anytime classification approaches

Page 9: Presentation ucb 2012

9

What about sampling?

Not appropriate for classification or outlier detection.

What about buffering?

Durations of bursts are unknown.

Why anytime clustering?

“Smart buffering”

Use micro‐clusters as input for further analysis

Provide constant (maximal) granularity at regular intervals

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Sampling, buffering, anytime clustering

Page 10: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Agenda

10

I. The Anytime principleAnytime algorithms for stream data mining

II. The ClusTreeSelf-adaptive anytime stream clustering

III. The MOA FrameworkAn open source framework for stream mining algorithms

Page 11: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Problem statement

11

Clustering is a frequently used technique Provides an overview, reduces amount of data, groups similar objects Streaming scenario:

Use summaries (micro clusters) as input for further analysis But: endless amounts of data (streams) are hard to handle

Stream clustering challenges: Single pass clustering Limited time, varying time allowance Limited memory, yet least information loss Evolving data Flexible number and size of clusters Self-adaptive

Anytime

Drift&Novelty

Fine grained

Page 12: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Related work

12

Stream clustering approaches and paradigms Convex clustering approaches (k-center) Density-based, grid-based approaches kernels, graphs, fractal dimensions, … Process chunks, merge results Maintain list, remove oldest or merge closest pair Online and Offline component

All approaches have to restrict themselves to the worst case time

Page 13: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Goals

13

Anytime clustering don’t miss any point, no matter at which speed

Adaptive model size don’t restrict model to worst case assumptions

Fine grained representation provide more detailed input for offline component

Compatible to existing work on drift and novelty Aging / Decay Snapshots / Drift & Novelty

Self-adaptive

Anytime

Drift&Novelty

Fine grained

Page 14: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

ClusTree – basic idea

14

Cluster features CF = (N, LS, SS) represent micro-clusters Allow to compute statistics like mean and variance

Maintain a balanced hierarchical data structure Insert new object into

the closest subtree Insertion stops

if next object arrives Most detailed model

is stored at leaf level Tree (= model) grows

if more time is available

less time

more time

Page 15: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

ClusTree structure and anytime insert

15

LS1…LSd

(t)

n(t)SS1…SSd

(t) LS1…LSd

(t)

n(t)SS1…SSd

(t)

b b

bb

inner entry

LS1…LSd

(t)

n(t)SS1…SSd

(t)

bleaf entry

Hierarchy of micro-clusters CF = (N, LS, SS) New objects (x1 … xd) are simply added to the cluster feature

N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2

Anytime insert: buffer object locally in a local buffer CF

Anytime

Fine grained

Page 16: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Buffer and hitchhiker

16

Buffer: interrupt insertion – aggregate objects on interrupt Hitchhiker: resume insertion – take buffer along (if same way)

Maximally two objects to descend with

Tree grows through splitting nodes starting from the leaf

Self-adaptive

. . .

.

.

.

.

destination of destination of .

Level 1: root

Level 2: hitchhike

Level 3: buffer

Level 4: insert .

entry structure:(CF, pointer, CFb )

Page 17: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Maintaining an up-to-date view

17

Goal: Compatible to existing work on drift and novelty New leaf entries get a unique ID Aging by an exponential decay function w(Δt) = β‐λΔt

Benefits of the employed decay function Avoid splits by reusing insignificant entries An entry’s CF still represents exactly its subtree and its buffer

Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δtand decay function w(Δt) = 2‐λΔt it holds

[Proof in the paper.]

s

i

tts

tis

tts buffereCFetwCFe

1

)()()( .).)((.

Drift&Novelty

Page 18: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Extensions of the ClusTree

Insertion of aggregates for extremely fast streams

Iterative depth first descentfor slower streams

Local look ahead to reduce overlapping

Explicit noise handlingand noise to cluster events

e e na)

e e ne

L

b) e e nec)

L L L

e e nd)

e

Page 19: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Evaluation – anytime clustering and aggregation

19

Anytime clustering (90.000 pps) 88% purity on leaf level Purity on higher levels

corresponds to faster streams >70% purity starting

three levels under root

Aggregation (varying streams) Purity drops under 70%

at 150.000 pps Aggregation significantly

improves the purityon the leaf level

Forest Covertype

Page 20: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Evaluation – adaptive clustering

20

Setup for constant streams ClusTree: stream speed maintainable #MC DenStream [SDM06] & CluStream [VLDB03]: #MC processable pps

ClusTree results: #MC is exponential (#dists is logarithmic)

Page 21: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Agenda

21

I. The Anytime principleAnytime algorithms for stream data mining

II. The ClusTreeSelf-adaptive anytime stream clustering

III. The MOA FrameworkAn open source framework for stream mining algorithms

Page 22: Presentation ucb 2012

Extensible open source software

– Data generators, file streams

– Stream mining algorithms

– Measure collection

Supported stream mining tasks

– Stream clustering, stream

classification, outlier detection, …

Repeatable/benchmark settings

In collaboration with

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

The MOA framework

Page 23: Presentation ucb 2012

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

References

23

Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via CholeskyFactorization. SDM, 2003

DeCoste et al.: Fast query-optimized kernel machine classification via incremental approximate nearest support vectors. ICML, 2003

Bayes (continuous data): Seidl et al.: Indexing density models for incremental learning and anytime classification on data streams. EDBT, 2009

Bayes (categorical):Yang et al.: Classifying under computational resource constraints: anytime classification using probabilistic estimators. Machine Learning, 2007

Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006

Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms for constant data streams. DMKD Journal, 2009

ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009 A complete list of references including stream clustering, MOA, evaluation, etc.:

Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011