Presentation ucb 2012

The ClusTree: Indexing Micro-Clustersfor Anytime Stream Mining

Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1

1Data Management and Data Exploration Group, RWTH Aachen University, Germany

2Department of Computer Science, Aarhus University, Denmark

emergency

professionaldecision

fullclassifier

preclassifier

normal

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Motivating examples


Applications and tasks

Classification Modeling

Outlierdetection

vary

ing

data

rat

eco

nsta

ntda

ta r

ate


Agenda

4

I. The Anytime principleAnytime algorithms for stream data mining

II. The ClusTreeSelf-adaptive anytime stream clustering

III. The MOA FrameworkAn open source framework for stream mining algorithms

5

StreamA stream : → : → , is an infinite sequence of objects ∈ from a d‐dimensional input space and∈ , ∀ is the discrete arrival time of object .

Inter‐arrival timeThe inter‐arrival time between two consecutive objects and is denoted as Δt , i.e. 0 Δ ∈ .

Constant and varying streams

A stream is called constant ↔ Δ Δ ∀ ,

Stream algorithms– Online algorithms – the input is given one at a time

– Budget algorithms – tailored to a specific time budget b

– Anytime algorithms – provide a result after any amount of processing time


Definitions I

For a given input an anytime algorithm can provide a first result after a very short initialization time and it uses additional time to improve its result. The algorithm is interruptible after any time and will deliver the best result obtained until the point of interruption.

6

Budget Algorithms – tailored to a specific time budget

– Available time < budget no result

– Available time > budget idle times

How should stream processing be done?

– Little time fast result

– More time use it to improve the result

Anytime Algorithms – provide a result after any time


Definitions II

time

qual

ity

7

Can we do better than using all available time?

Yes we can!

Distribute computation time according to confidence values– Spend less time on confident items

– Use additional time for uncertain objects

Prerequisites– Anytime algorithm

– Confidence measure


Anytime algorithms on constant streams

type 1type 2

type m

…

arrival interval ta

constant data streamtf td

8

Anytime support vector machines

Anytime nearest neighbor classification

Anytime Bayesian classification

Categorical data

Continuous data

Others

Anytime induction of decision trees

Anytime A* algorithm

Anytime clustering

Anytime outlier detection

[References on last slide.]


Existing anytime classification approaches

9

What about sampling?

Not appropriate for classification or outlier detection.

What about buffering?

Durations of bursts are unknown.

Why anytime clustering?

…

“Smart buffering”

Use micro‐clusters as input for further analysis

Provide constant (maximal) granularity at regular intervals


Sampling, buffering, anytime clustering


Agenda

10





Problem statement

11

Clustering is a frequently used technique Provides an overview, reduces amount of data, groups similar objects Streaming scenario:

Use summaries (micro clusters) as input for further analysis But: endless amounts of data (streams) are hard to handle

Stream clustering challenges: Single pass clustering Limited time, varying time allowance Limited memory, yet least information loss Evolving data Flexible number and size of clusters Self-adaptive

Anytime

Drift&Novelty

Fine grained


Related work

12

Stream clustering approaches and paradigms Convex clustering approaches (k-center) Density-based, grid-based approaches kernels, graphs, fractal dimensions, … Process chunks, merge results Maintain list, remove oldest or merge closest pair Online and Offline component

All approaches have to restrict themselves to the worst case time


Goals

13

Anytime clustering don’t miss any point, no matter at which speed

Adaptive model size don’t restrict model to worst case assumptions

Fine grained representation provide more detailed input for offline component

Compatible to existing work on drift and novelty Aging / Decay Snapshots / Drift & Novelty

Self-adaptive

Anytime

Drift&Novelty

Fine grained


ClusTree – basic idea

14

Cluster features CF = (N, LS, SS) represent micro-clusters Allow to compute statistics like mean and variance

Maintain a balanced hierarchical data structure Insert new object into

the closest subtree Insertion stops

if next object arrives Most detailed model

is stored at leaf level Tree (= model) grows

if more time is available

less time

more time


ClusTree structure and anytime insert

15

LS1…LSd

(t)

n(t)SS1…SSd

(t) LS1…LSd

(t)

n(t)SS1…SSd

(t)

b b

bb

inner entry

LS1…LSd

(t)

n(t)SS1…SSd

(t)

bleaf entry

Hierarchy of micro-clusters CF = (N, LS, SS) New objects (x1 … xd) are simply added to the cluster feature

N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2

Anytime insert: buffer object locally in a local buffer CF

Anytime

Fine grained


Buffer and hitchhiker

16

Buffer: interrupt insertion – aggregate objects on interrupt Hitchhiker: resume insertion – take buffer along (if same way)

Maximally two objects to descend with

Tree grows through splitting nodes starting from the leaf

Self-adaptive

. . .

.

.

.

.

destination of destination of .

Level 1: root

Level 2: hitchhike

Level 3: buffer

Level 4: insert .

entry structure:(CF, pointer, CFb )


Maintaining an up-to-date view

17

Goal: Compatible to existing work on drift and novelty New leaf entries get a unique ID Aging by an exponential decay function w(Δt) = β‐λΔt

Benefits of the employed decay function Avoid splits by reusing insignificant entries An entry’s CF still represents exactly its subtree and its buffer

Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δtand decay function w(Δt) = 2‐λΔt it holds

[Proof in the paper.]

s

i

tts

tis

tts buffereCFetwCFe

1

)()()( .).)((.

Drift&Novelty


Extensions of the ClusTree

Insertion of aggregates for extremely fast streams

Iterative depth first descentfor slower streams

Local look ahead to reduce overlapping

Explicit noise handlingand noise to cluster events

e e na)

e e ne

L

b) e e nec)

L L L

e e nd)

e


Evaluation – anytime clustering and aggregation

19

Anytime clustering (90.000 pps) 88% purity on leaf level Purity on higher levels

corresponds to faster streams >70% purity starting

three levels under root

Aggregation (varying streams) Purity drops under 70%

at 150.000 pps Aggregation significantly

improves the purityon the leaf level

Forest Covertype


Evaluation – adaptive clustering

20

Setup for constant streams ClusTree: stream speed maintainable #MC DenStream [SDM06] & CluStream [VLDB03]: #MC processable pps

ClusTree results: #MC is exponential (#dists is logarithmic)


Agenda

21




Extensible open source software

– Data generators, file streams

– Stream mining algorithms

– Measure collection

Supported stream mining tasks

– Stream clustering, stream

classification, outlier detection, …

Repeatable/benchmark settings

In collaboration with


The MOA framework


References

23

Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via CholeskyFactorization. SDM, 2003

DeCoste et al.: Fast query-optimized kernel machine classification via incremental approximate nearest support vectors. ICML, 2003

Bayes (continuous data): Seidl et al.: Indexing density models for incremental learning and anytime classification on data streams. EDBT, 2009

Bayes (categorical):Yang et al.: Classifying under computational resource constraints: anytime classification using probabilistic estimators. Machine Learning, 2007

Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006

Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms for constant data streams. DMKD Journal, 2009

ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009 A complete list of references including stream clustering, MOA, evaluation, etc.:

Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011

Presentation ucb 2012

Technology

Transcript of Presentation ucb 2012