MLconf NYC Edo Liberty

Post on 01-Nov-2014

227 views 1 download

Tags:

description

 

Transcript of MLconf NYC Edo Liberty

St reaming Da ta M in ing

PRESENTED BY Edo Liberty⎪ April 11, 2014

Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission.

Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.

2 Yahoo Confidential & Proprietary

Data

Computation Result

The World

Single machine data mining

3 Yahoo Confidential & Proprietary

Data Data Data Data

Computation Result

The World

Distributed storage

4 Yahoo Confidential & Proprietary

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Computation Result

The World

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Distributed model (map/reduce, message passing, …)

5 Yahoo Confidential & Proprietary

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Computation Result

The World

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Computation Query

Distributed model (indexes, tables, databases, …)

207 big-data infographics (meta infographic)

6 Yahoo Confidential & Proprietary

7 Yahoo Confidential & Proprietary

8 Yahoo Confidential & Proprietary

Sketch

The World

Query Algorithm Result Query

Result

Computation

The streaming model

9 Yahoo Confidential & Proprietary

Aggregate+ Sketch

The World

Query Algorithm Result Query

Result

Compute + Sketch

Compute + Sketch

Compute + Sketch

Compute + Sketch

The parallel streaming model

10 Yahoo Confidential & Proprietary

1 7 8 1 0 1 7 7

Sketch

Result

Iterator

Computation

The streaming model (more accurately)

O(n) Items

O(polylog(n)) Space

O(polylog(n)) Computation per item

11 Yahoo Confidential & Proprietary

Sketch Result

Iterator Iterator

Communication complexity

1 7 8 1 0 1 7 7

Frequent i tems

Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002 Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003 The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002 Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006

13 Yahoo Confidential & Proprietary

d

n

f( ) = 5

14 Yahoo Confidential & Proprietary

f( ) = 5

d

15 Yahoo Confidential & Proprietary

`

16 Yahoo Confidential & Proprietary

`

17 Yahoo Confidential & Proprietary

`

18 Yahoo Confidential & Proprietary

`

19 Yahoo Confidential & Proprietary

`

20 Yahoo Confidential & Proprietary

`

21 Yahoo Confidential & Proprietary

`

22 Yahoo Confidential & Proprietary

f 0( ) = 0

`

f 0( ) = 2

23 Yahoo Confidential & Proprietary

Assume we do this times t

Second fact: f 0(x) � f(x)� t

f

0(x) f(x) First fact:

The proof (very short)

24 Yahoo Confidential & Proprietary

Third (not so obvious) fact: Which gives . In words: We can only delete items times!

t n/`

0 �P

f

0(x) =P

f(x)� t · ` = n� t · `

The proof (very short)

` n/`

|f 0(x)� f(x)| n/`

Useful form…

25 Yahoo Confidential & Proprietary

Define And We get that This is very useful for keeping approx’ distributions!

p(x) = f(x)/np

0(x) = f

0(x)/n

|p0(x)� p(x)| 1/`

Threading Machine Generated Emai l

27 Yahoo Confidential & Proprietary

Email threads

A simple email thread (that’s not very hard to do…)

Threading Machine Generated Email

28 Yahoo Confidential & Proprietary

Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013

29 Yahoo Confidential & Proprietary

Threading Machine Generated Email

30 Yahoo Confidential & Proprietary

Threading Machine Generated Email

What else can we do in the streaming model…

31 Yahoo Confidential & Proprietary

Items (words, IP-adresses, events, clicks,...): §  Item frequencies §  Counting distinct elements §  Moment and entropy estimation §  Approximate set operations

Vectors (text documents, images, example features,...) §  Dimensionality reduction §  Clustering (k-means, k-median,…) §  Linear Regression §  Machine learning (some of it at least)

Matrices (text corpora, user preferences, graphs...) §  Covariance estimation matrix §  Low rank approximation §  Sparsification

Thanks!

32 Yahoo Confidential & Proprietary

Yahoo does big data algorithms, software and systems! Speak to our Talent Team or visit Careers.Yahoo.com and explore our career opportunities in NYC or Sunnyvale, CA

Seth Tropper satropper@yahoo-inc.com

Doug DeSimone desimone@yahoo-inc.com

Keith Daniels kdnl@yahoo-inc.com

Yahoo is an equal opportunity employer.