Photochemical Restoration of Vision in Blind Mice 2012 July Neuron Cxblindmice Richard Kramer UCB
Presentation ucb 2012
-
Upload
kranen -
Category
Technology
-
view
562 -
download
0
description
Transcript of Presentation ucb 2012
The ClusTree: Indexing Micro-Clustersfor Anytime Stream Mining
Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1
1Data Management and Data Exploration Group, RWTH Aachen University, Germany
2Department of Computer Science, Aarhus University, Denmark
emergency
professionaldecision
fullclassifier
preclassifier
normal
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Motivating examples
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Applications and tasks
Classification Modeling
Outlierdetection
vary
ing
data
rat
eco
nsta
ntda
ta r
ate
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Agenda
4
I. The Anytime principleAnytime algorithms for stream data mining
II. The ClusTreeSelf-adaptive anytime stream clustering
III. The MOA FrameworkAn open source framework for stream mining algorithms
5
StreamA stream : → : → , is an infinite sequence of objects ∈ from a d‐dimensional input space and∈ , ∀ is the discrete arrival time of object .
Inter‐arrival timeThe inter‐arrival time between two consecutive objects and is denoted as Δt , i.e. 0 Δ ∈ .
Constant and varying streams
A stream is called constant ↔ Δ Δ ∀ ,
Stream algorithms– Online algorithms – the input is given one at a time
– Budget algorithms – tailored to a specific time budget b
– Anytime algorithms – provide a result after any amount of processing time
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Definitions I
For a given input an anytime algorithm can provide a first result after a very short initialization time and it uses additional time to improve its result. The algorithm is interruptible after any time and will deliver the best result obtained until the point of interruption.
6
Budget Algorithms – tailored to a specific time budget
– Available time < budget no result
– Available time > budget idle times
How should stream processing be done?
– Little time fast result
– More time use it to improve the result
Anytime Algorithms – provide a result after any time
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Definitions II
time
qual
ity
7
Can we do better than using all available time?
Yes we can!
Distribute computation time according to confidence values– Spend less time on confident items
– Use additional time for uncertain objects
Prerequisites– Anytime algorithm
– Confidence measure
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Anytime algorithms on constant streams
type 1type 2
type m
…
arrival interval ta
constant data streamtf td
8
Anytime support vector machines
Anytime nearest neighbor classification
Anytime Bayesian classification
Categorical data
Continuous data
Others
Anytime induction of decision trees
Anytime A* algorithm
Anytime clustering
Anytime outlier detection
[References on last slide.]
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Existing anytime classification approaches
9
What about sampling?
Not appropriate for classification or outlier detection.
What about buffering?
Durations of bursts are unknown.
Why anytime clustering?
…
“Smart buffering”
Use micro‐clusters as input for further analysis
Provide constant (maximal) granularity at regular intervals
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Sampling, buffering, anytime clustering
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Agenda
10
I. The Anytime principleAnytime algorithms for stream data mining
II. The ClusTreeSelf-adaptive anytime stream clustering
III. The MOA FrameworkAn open source framework for stream mining algorithms
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Problem statement
11
Clustering is a frequently used technique Provides an overview, reduces amount of data, groups similar objects Streaming scenario:
Use summaries (micro clusters) as input for further analysis But: endless amounts of data (streams) are hard to handle
Stream clustering challenges: Single pass clustering Limited time, varying time allowance Limited memory, yet least information loss Evolving data Flexible number and size of clusters Self-adaptive
Anytime
Drift&Novelty
Fine grained
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Related work
12
Stream clustering approaches and paradigms Convex clustering approaches (k-center) Density-based, grid-based approaches kernels, graphs, fractal dimensions, … Process chunks, merge results Maintain list, remove oldest or merge closest pair Online and Offline component
All approaches have to restrict themselves to the worst case time
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Goals
13
Anytime clustering don’t miss any point, no matter at which speed
Adaptive model size don’t restrict model to worst case assumptions
Fine grained representation provide more detailed input for offline component
Compatible to existing work on drift and novelty Aging / Decay Snapshots / Drift & Novelty
Self-adaptive
Anytime
Drift&Novelty
Fine grained
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
ClusTree – basic idea
14
Cluster features CF = (N, LS, SS) represent micro-clusters Allow to compute statistics like mean and variance
Maintain a balanced hierarchical data structure Insert new object into
the closest subtree Insertion stops
if next object arrives Most detailed model
is stored at leaf level Tree (= model) grows
if more time is available
less time
more time
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
ClusTree structure and anytime insert
15
LS1…LSd
(t)
n(t)SS1…SSd
(t) LS1…LSd
(t)
n(t)SS1…SSd
(t)
b b
bb
inner entry
LS1…LSd
(t)
n(t)SS1…SSd
(t)
bleaf entry
Hierarchy of micro-clusters CF = (N, LS, SS) New objects (x1 … xd) are simply added to the cluster feature
N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2
Anytime insert: buffer object locally in a local buffer CF
Anytime
Fine grained
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Buffer and hitchhiker
16
Buffer: interrupt insertion – aggregate objects on interrupt Hitchhiker: resume insertion – take buffer along (if same way)
Maximally two objects to descend with
Tree grows through splitting nodes starting from the leaf
Self-adaptive
. . .
.
.
.
.
destination of destination of .
Level 1: root
Level 2: hitchhike
Level 3: buffer
Level 4: insert .
entry structure:(CF, pointer, CFb )
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Maintaining an up-to-date view
17
Goal: Compatible to existing work on drift and novelty New leaf entries get a unique ID Aging by an exponential decay function w(Δt) = β‐λΔt
Benefits of the employed decay function Avoid splits by reusing insignificant entries An entry’s CF still represents exactly its subtree and its buffer
Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δtand decay function w(Δt) = 2‐λΔt it holds
[Proof in the paper.]
s
i
tts
tis
tts buffereCFetwCFe
1
)()()( .).)((.
Drift&Novelty
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Extensions of the ClusTree
Insertion of aggregates for extremely fast streams
Iterative depth first descentfor slower streams
Local look ahead to reduce overlapping
Explicit noise handlingand noise to cluster events
e e na)
e e ne
L
b) e e nec)
L L L
e e nd)
e
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Evaluation – anytime clustering and aggregation
19
Anytime clustering (90.000 pps) 88% purity on leaf level Purity on higher levels
corresponds to faster streams >70% purity starting
three levels under root
Aggregation (varying streams) Purity drops under 70%
at 150.000 pps Aggregation significantly
improves the purityon the leaf level
Forest Covertype
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Evaluation – adaptive clustering
20
Setup for constant streams ClusTree: stream speed maintainable #MC DenStream [SDM06] & CluStream [VLDB03]: #MC processable pps
ClusTree results: #MC is exponential (#dists is logarithmic)
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
Agenda
21
I. The Anytime principleAnytime algorithms for stream data mining
II. The ClusTreeSelf-adaptive anytime stream clustering
III. The MOA FrameworkAn open source framework for stream mining algorithms
Extensible open source software
– Data generators, file streams
– Stream mining algorithms
– Measure collection
Supported stream mining tasks
– Stream clustering, stream
classification, outlier detection, …
Repeatable/benchmark settings
In collaboration with
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
The MOA framework
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining
References
23
Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via CholeskyFactorization. SDM, 2003
DeCoste et al.: Fast query-optimized kernel machine classification via incremental approximate nearest support vectors. ICML, 2003
Bayes (continuous data): Seidl et al.: Indexing density models for incremental learning and anytime classification on data streams. EDBT, 2009
Bayes (categorical):Yang et al.: Classifying under computational resource constraints: anytime classification using probabilistic estimators. Machine Learning, 2007
Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006
Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms for constant data streams. DMKD Journal, 2009
ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009 A complete list of references including stream clustering, MOA, evaluation, etc.:
Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011