Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun...

30
Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB Beyond Scaling: Real-Time Event Processing with Stream Mining Mikio L. Braun, @mikiobraun http://blog.mikiobraun.de TWIMPACT http://twimpact.com

Transcript of Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun...

Page 1: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Beyond Scaling: Real-Time Event Processing with

Stream Mining

Mikio L. Braun, @mikiobraunhttp://blog.mikiobraun.de

TWIMPACThttp://twimpact.com

Page 2: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Event Data

Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie

FinanceFinance GamingGaming MonitoringMonitoring

Social MediaSocial MediaSensor NetworksSensor NetworksAdvertismentAdvertisment

Page 3: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Event Data is Huge: Volume

● The problem: You easily get A LOT OF DATA!– 100 events per second

– 360k events per hour

– 8.6M events per day

– 260M events per month

– 3.2B events per year

Page 4: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Event Data is Huge: Diversity

● Potentially large spaces:– distinct words: >100k

– IP addresses: >100M

– users in a social network: >10M

http://www.flickr.com/photos/arenamontanus/269158554/http://wordle.net

Page 5: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

To Scale Or Not To Scale?

Page 6: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Stream Mining to the rescue

● Stream mining algorithms:– answer “stream queries” with finite resources

● Typical examples:– how often does an item appear in a stream?

– how many distinct elements are in the stream?

– what are the top-k most frequent items?

Continuous Stream of Data

Bounded ResourceAnalyzer Stream Queries

Page 7: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

The Trade-Off

ExactFast

Big Data

Stream Mining Map Reduce and friends

First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra

Page 8: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Heavy Hitters (a.k.a. Top-k)

● Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users)

● Interested in most active elements only.

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

132

142

432

553

712

023

15

12

8

5

3

2

Fixed tables of counts

Case 1: element already in data base

Case 2: new element

142 142 12 13

713 023 2

713 3

Page 9: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Heavy Hitters over Time-Window

Time

DB

● Keep quite a big log (a month?)

● Constant write/erase in database

● Alternative: Exponential decay

Page 10: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Exponential Decay

● Instead of a fixed window, use exponential decay

● The beauty: updates are recursivehalftime

score

timestamp

time shift term

Page 11: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Exponential Decay

● Collect stats by a table of expdecay counterscounters[item] # counters

ts[item] # last timestamp

● update(C, item, timestamp, count) – update countsC.counters[item] = count + weight(timestamp, C.ts[item]) * C.counters[item]

C.ts[item] = timestampC.lastupdate = timestamp

● score(C, item) – return scorereturn weight(C.lastupdate, C.ts[item]) * C.counters[item]

Page 12: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Count-Min Sketches

● Summarize histograms over large feature sets● Like bloom filters, but better

● Query: Take minimum over all hash functions

0 0 3 0

1 1 0 2

0 2 0 0

0 3 5 2

0 5 3 2

2 4 5 0

1 3 7 3

0 2 0 8

m bins

n differenthash functions

Updates for new entryQuery result: 1

G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .

Page 13: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Wait a minute? Only Counting?

● Well, getting the top most active items is already useful.– Web analytics, Users, Trending Topics

● Counting is statistics!

Page 14: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Counting is Statistics

● Empirical mean:

● Correlations:

● Principal Component Analysis:

Page 15: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example 1: Least Squares Regression

d

d

with entries

this could be huge!

d x d is probably ok

Idea: Batch method like least squares on recent portion of the data.

But: It's just a sum!

Page 16: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Least Squares Regression

● Need to compute● For each do

● Then, reconstruct –

As a reminder:

Page 17: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example 2: Maximum-Likelihood

● Estimate probabilistic models

● But wait, how do I “1/n” with randomly spaced events?

based on

which is slightly biased, but simpler

Page 18: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example 3: Outlier detection

● Once you have a model, you can compute p-values (based on recent time frames!)

Page 19: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example 4: Clustering● Online clustering

– For each data point:● Map to closest centroid ( compute distances)⇒● Update centroid

– count-min sketches to represent sum over all vectors in a class

Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009

0 0 3 01 1 0 2

0 2 0 00 3 5 2

0 5 3 22 4 5 0

1 3 7 30 2 0 8

Page 20: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example 5: TF-IDF

● estimate word – document frequencies

● for each word: update(word, t, 1.0)● for each document: update(“#docs”, t, 1.0)● query: score(word) / score(“#docs”)

Page 21: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example 6: Classification with Naïve Bayes● Naive Bayes is also just counting, right?

class priors

Multinomnial naïve Bayes

Priors

Total number of words in class

Number of times word appears in classfrequency of word in document

Page 22: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example 6: Classification with Naive Bayes

ICML 2003

Page 23: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example 6: Classification with Naive Bayes● 7 Steps to improve NB:

– transform TF to log( . + 1)

– IDF-style normalization

– square length normalization

– use complement probability

– another log

– normalize those weights again

– Predict linearly using those weights

Page 24: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

What about non-parametric methods and Kernel Methods?● Problem here, no real accumulation of information

in statistics, e.g. SVMs

● Could still use streamdrill to extract a representative subset.

sum over all elements!

Page 25: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Streamdrill

● Heavy Hitters counting + exponential decay● Instant counts & top-k results over time

windows.● Indices!● Snapshots for historical analysis● Beta demo available at http://streamdrill.com,

launch imminent

Page 26: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Architecture Overview

Page 27: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example: Twitter Stock Analysis

http://play.streamdrill.com/vis/

Page 28: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example: Twitter Stock Analysis

Page 29: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Example: Twitter Stock Analysis

Page 30: Beyond Scaling: Real-Time Event Processing with Stream Mining - … · 2013. 6. 12. · Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords

Mikio Braun Beyond Scaling: Real-Time Event Processing with Stream Mining Berlin Buzzwords 2013 © by MB

Summary

● Doesn't always have to be scaling!● Stream mining: Approximate results with

finite resources.● streamdrill: stream analysis engine