MLconf NYC Edo Liberty

32
Streaming Data Mining PRESENTED BY Edo Liberty April 11, 2014 Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission. Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.

description

 

Transcript of MLconf NYC Edo Liberty

Page 1: MLconf NYC Edo Liberty

St reaming Da ta M in ing

PRESENTED BY Edo Liberty⎪ April 11, 2014

Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission.

Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.

Page 2: MLconf NYC Edo Liberty

2 Yahoo Confidential & Proprietary

Data

Computation Result

The World

Single machine data mining

Page 3: MLconf NYC Edo Liberty

3 Yahoo Confidential & Proprietary

Data Data Data Data

Computation Result

The World

Distributed storage

Page 4: MLconf NYC Edo Liberty

4 Yahoo Confidential & Proprietary

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Computation Result

The World

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Distributed model (map/reduce, message passing, …)

Page 5: MLconf NYC Edo Liberty

5 Yahoo Confidential & Proprietary

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Computation Result

The World

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Computation Query

Distributed model (indexes, tables, databases, …)

Page 6: MLconf NYC Edo Liberty

207 big-data infographics (meta infographic)

6 Yahoo Confidential & Proprietary

Page 7: MLconf NYC Edo Liberty

7 Yahoo Confidential & Proprietary

Page 8: MLconf NYC Edo Liberty

8 Yahoo Confidential & Proprietary

Sketch

The World

Query Algorithm Result Query

Result

Computation

The streaming model

Page 9: MLconf NYC Edo Liberty

9 Yahoo Confidential & Proprietary

Aggregate+ Sketch

The World

Query Algorithm Result Query

Result

Compute + Sketch

Compute + Sketch

Compute + Sketch

Compute + Sketch

The parallel streaming model

Page 10: MLconf NYC Edo Liberty

10 Yahoo Confidential & Proprietary

1 7 8 1 0 1 7 7

Sketch

Result

Iterator

Computation

The streaming model (more accurately)

O(n) Items

O(polylog(n)) Space

O(polylog(n)) Computation per item

Page 11: MLconf NYC Edo Liberty

11 Yahoo Confidential & Proprietary

Sketch Result

Iterator Iterator

Communication complexity

1 7 8 1 0 1 7 7

Page 12: MLconf NYC Edo Liberty

Frequent i tems

Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002 Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003 The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002 Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006

Page 13: MLconf NYC Edo Liberty

13 Yahoo Confidential & Proprietary

d

n

f( ) = 5

Page 14: MLconf NYC Edo Liberty

14 Yahoo Confidential & Proprietary

f( ) = 5

d

Page 15: MLconf NYC Edo Liberty

15 Yahoo Confidential & Proprietary

`

Page 16: MLconf NYC Edo Liberty

16 Yahoo Confidential & Proprietary

`

Page 17: MLconf NYC Edo Liberty

17 Yahoo Confidential & Proprietary

`

Page 18: MLconf NYC Edo Liberty

18 Yahoo Confidential & Proprietary

`

Page 19: MLconf NYC Edo Liberty

19 Yahoo Confidential & Proprietary

`

Page 20: MLconf NYC Edo Liberty

20 Yahoo Confidential & Proprietary

`

Page 21: MLconf NYC Edo Liberty

21 Yahoo Confidential & Proprietary

`

Page 22: MLconf NYC Edo Liberty

22 Yahoo Confidential & Proprietary

f 0( ) = 0

`

f 0( ) = 2

Page 23: MLconf NYC Edo Liberty

23 Yahoo Confidential & Proprietary

Assume we do this times t

Second fact: f 0(x) � f(x)� t

f

0(x) f(x) First fact:

The proof (very short)

Page 24: MLconf NYC Edo Liberty

24 Yahoo Confidential & Proprietary

Third (not so obvious) fact: Which gives . In words: We can only delete items times!

t n/`

0 �P

f

0(x) =P

f(x)� t · ` = n� t · `

The proof (very short)

` n/`

|f 0(x)� f(x)| n/`

Page 25: MLconf NYC Edo Liberty

Useful form…

25 Yahoo Confidential & Proprietary

Define And We get that This is very useful for keeping approx’ distributions!

p(x) = f(x)/np

0(x) = f

0(x)/n

|p0(x)� p(x)| 1/`

Page 26: MLconf NYC Edo Liberty

Threading Machine Generated Emai l

Page 27: MLconf NYC Edo Liberty

27 Yahoo Confidential & Proprietary

Email threads

A simple email thread (that’s not very hard to do…)

Page 28: MLconf NYC Edo Liberty

Threading Machine Generated Email

28 Yahoo Confidential & Proprietary

Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013

Page 29: MLconf NYC Edo Liberty

29 Yahoo Confidential & Proprietary

Threading Machine Generated Email

Page 30: MLconf NYC Edo Liberty

30 Yahoo Confidential & Proprietary

Threading Machine Generated Email

Page 31: MLconf NYC Edo Liberty

What else can we do in the streaming model…

31 Yahoo Confidential & Proprietary

Items (words, IP-adresses, events, clicks,...): §  Item frequencies §  Counting distinct elements §  Moment and entropy estimation §  Approximate set operations

Vectors (text documents, images, example features,...) §  Dimensionality reduction §  Clustering (k-means, k-median,…) §  Linear Regression §  Machine learning (some of it at least)

Matrices (text corpora, user preferences, graphs...) §  Covariance estimation matrix §  Low rank approximation §  Sparsification

Page 32: MLconf NYC Edo Liberty

Thanks!

32 Yahoo Confidential & Proprietary

Yahoo does big data algorithms, software and systems! Speak to our Talent Team or visit Careers.Yahoo.com and explore our career opportunities in NYC or Sunnyvale, CA

Seth Tropper [email protected]

Doug DeSimone [email protected]

Keith Daniels [email protected]

Yahoo is an equal opportunity employer.