GoTo Amsterdam 2013 Skinned

53
1 ©MapR Technologies - Confidential Old and New Building Blocks Come Together For Big Data

description

 

Transcript of GoTo Amsterdam 2013 Skinned

Page 1: GoTo Amsterdam 2013 Skinned

1©MapR Technologies - Confidential

Old and New Building Blocks Come Together For Big Data

Page 2: GoTo Amsterdam 2013 Skinned

2©MapR Technologies - Confidential

Contact:[email protected]@ted_dunning

Slides and such http://slideshare.net/tdunning

Hash tags: #mapr #gotoams #d3 #node

Page 3: GoTo Amsterdam 2013 Skinned

3©MapR Technologies - Confidential

Embarrassment of Riches

d3.js allows really pretty pictures node.js allows simple (not just web) servers Storm does real-time Hadoop does big data d3 allows very cool visualizations

Page 4: GoTo Amsterdam 2013 Skinned

4©MapR Technologies - Confidential

D3 demo

Page 5: GoTo Amsterdam 2013 Skinned

5©MapR Technologies - Confidential

node demo

Page 6: GoTo Amsterdam 2013 Skinned

6©MapR Technologies - Confidential

Hadoop demo

Page 7: GoTo Amsterdam 2013 Skinned

7©MapR Technologies - Confidential

But …

Web camp– everything is a service with a URL or a DOM

Big data camp– non-traditional file systems

Everybody else– files and databases

They don’t like to talk to each other

Page 8: GoTo Amsterdam 2013 Skinned

8©MapR Technologies - Confidential

Why Not Tiered Architectures?

Tiered architectures – translations between services and cultures – standard corporate answer

Feels like molasses

Page 9: GoTo Amsterdam 2013 Skinned

9©MapR Technologies - Confidential

The Vision

Integrate – multiple computing paradigms– many computing communities

How?– common storage, queuing and data platforms

Page 10: GoTo Amsterdam 2013 Skinned

10©MapR Technologies - Confidential

For Example, …

Incoming documents with text– store in file-based queues– index in real-time using Storm and Solr– add initial engagement class, “don’t-know”

Search for documents using original text– add random noise, small for well understood docs, large for “don’t-know”

docs

Record engagement

Page 11: GoTo Amsterdam 2013 Skinned

11©MapR Technologies - Confidential

Add Analysis

Process engagement logs– item-item cooccurrence– user-item histories

Update search index– indicator items– decrease uncertainty on well understood docs

Update user profile– item history

Page 12: GoTo Amsterdam 2013 Skinned

12©MapR Technologies - Confidential

Search Again

Now searches use recent views + text– recent views query indicator fields– text queries normal text data– add noise as appropriate

Page 13: GoTo Amsterdam 2013 Skinned

13©MapR Technologies - Confidential

And Draw a Picture

Searches and clicks can be logged– real-time metrics– real-time trending topics

What’s hot, what’s not Popular searches Document clusters Word clouds

Page 14: GoTo Amsterdam 2013 Skinned

14©MapR Technologies - Confidential

In Pictures

Page 15: GoTo Amsterdam 2013 Skinned

15©MapR Technologies - Confidential

In Pictures

Doc queue

Search index

Real-time indexing

Doc sources

Page 16: GoTo Amsterdam 2013 Skinned

16©MapR Technologies - Confidential

In Pictures

Doc queue

Search index

Real-time indexing

Doc sources

User queries

Search engine

Page 17: GoTo Amsterdam 2013 Skinned

17©MapR Technologies - Confidential

In Pictures

Doc queue

Search index

Real-time indexing

Doc sources

User queries

Search engine Logs

Recommendation analysis

Page 18: GoTo Amsterdam 2013 Skinned

18©MapR Technologies - Confidential

In Pictures

Doc queue

Search index

Real-time indexing

Doc sources

User queries

Search engine Logs

Recommendation analysis

Usage analysisRenderingAdmin

queries

Page 19: GoTo Amsterdam 2013 Skinned

19©MapR Technologies - Confidential

Which Technology?

Doc queue

Search index

Real-time indexing

Doc sources

User queries

Search engine Logs

Recommendation analysis

Usage analysis

Admin queries

Rendering

Storm/node

Solr

MapR

D3/node

Other

Page 20: GoTo Amsterdam 2013 Skinned

20©MapR Technologies - Confidential

Yeah, But …

This isn’t as easy as it looks

Take the real-time / long-time part

Page 21: GoTo Amsterdam 2013 Skinned

21©MapR Technologies - Confidential

t

now

Hadoop is Not Very Real-time

UnprocessedData

Fully processed

Latest full period

Hadoop job takes this long for this data

Page 22: GoTo Amsterdam 2013 Skinned

22©MapR Technologies - Confidential

t

now

Hadoop works great back here

Storm workshere

Real-time and Long-time together

Blended view

Blended view

Blended View

Page 23: GoTo Amsterdam 2013 Skinned

23©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

Page 24: GoTo Amsterdam 2013 Skinned

24©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Page 25: GoTo Amsterdam 2013 Skinned

25©MapR Technologies - Confidential

Users

Catcher Storm

Topic Queue

Web-server

http

Web Data

MapR

Page 26: GoTo Amsterdam 2013 Skinned

26©MapR Technologies - Confidential

Closer Look – Catcher Protocol

Data Sources

Catcher ClusterCatcher Cluster

Data Sources

The data sources and catchers communicate with a very simple protocol.

Hello() => list of catchersLog(topic,message) => (OK|FAIL, redirect-to-catcher)

Page 27: GoTo Amsterdam 2013 Skinned

27©MapR Technologies - Confidential

Closer Look – Catcher Queues

Catcher Cluster

Catcher Cluster

The catchers forward log requests to the correct catcher and return that host in the reply to allow the client to avoid the extra hop.

Each topic file is appended by exactly one catcher.

Topic files are kept in shared file storage.

TopicFile

TopicFile

Page 28: GoTo Amsterdam 2013 Skinned

28©MapR Technologies - Confidential

Closer Look – ProtoSpout

The ProtoSpout tails the topic files,parses log records into tuples andinjects them into the Storm topology.

Last fully acked position stored in shared file system.

TopicFile

TopicFile

ProtoSpout

Page 29: GoTo Amsterdam 2013 Skinned

29©MapR Technologies - Confidential

Yeah, But …

What was that about adding noise in scoring?

Why would I do that??

Is there a simple answer?

Page 30: GoTo Amsterdam 2013 Skinned

30©MapR Technologies - Confidential

Thompson Sampling

Select each shell according to the probability that it is the best

Probability that it is the best can be computed using posterior

But I promised a simple answer

Page 31: GoTo Amsterdam 2013 Skinned

31©MapR Technologies - Confidential

Thompson Sampling – Take 2

Sample θ

Pick i to maximize reward

Record result from using i

Page 32: GoTo Amsterdam 2013 Skinned

32©MapR Technologies - Confidential

Nearly Forgotten until Recently

Citations for Thompson sampling

Page 33: GoTo Amsterdam 2013 Skinned

33©MapR Technologies - Confidential

Bayesian Bandit for the Search

Compute distributions based on data so far Sample scores s1, s2 …

– based on actual score– plus per doc noise from these distributions

Rank docs by si

Lemma 1: The probability of showing doc i at first position will match the probability it is the best

Lemma 2: This is as good as it gets

Page 34: GoTo Amsterdam 2013 Skinned

34©MapR Technologies - Confidential

And it works!

Page 35: GoTo Amsterdam 2013 Skinned

35©MapR Technologies - Confidential

Yeah, But …

Isn’t recommendations complicated?

How can I implement this?

Page 36: GoTo Amsterdam 2013 Skinned

36©MapR Technologies - Confidential

Recommendation Basics

History:

User Thing1 3

2 4

3 4

2 3

3 2

1 1

2 1

Page 37: GoTo Amsterdam 2013 Skinned

37©MapR Technologies - Confidential

Recommendation Basics

History as matrix:

(t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) once

t1 t2 t3 t4

u1 1 0 1 0

u2 1 0 1 1

u3 0 1 0 1

Page 38: GoTo Amsterdam 2013 Skinned

38©MapR Technologies - Confidential

A Quick Simplification

Users who do h

Also do r

User-centric recommendations

Item-centric recommendations

Page 39: GoTo Amsterdam 2013 Skinned

39©MapR Technologies - Confidential

Recommendation Basics

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2

Page 40: GoTo Amsterdam 2013 Skinned

40©MapR Technologies - Confidential

Problems with Raw Cooccurrence

Very popular items co-occur with everything– Welcome document– Elevator music

That isn’t interesting– We want anomalous cooccurrence

Page 41: GoTo Amsterdam 2013 Skinned

41©MapR Technologies - Confidential

Recommendation Basics

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2t3 not t3

t1 2 1

not t1 1 1

Page 42: GoTo Amsterdam 2013 Skinned

42©MapR Technologies - Confidential

Spot the Anomaly

Root LLR is roughly like standard deviations

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 2

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

0.44 0.98

2.26 7.15

Page 43: GoTo Amsterdam 2013 Skinned

43©MapR Technologies - Confidential

Root LLR Details

In Rentropy = function(k) { -sum(k*log((k==0)+(k/sum(k))))}rootLLr = function(k) { sign = … sign * sqrt( (entropy(rowSums(k))+entropy(colSums(k)) - entropy(k))/2)}

Like sqrt(mutual information * N/2)See http://bit.ly/16DvLVK

Page 44: GoTo Amsterdam 2013 Skinned

44©MapR Technologies - Confidential

Threshold by Score

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2

Page 45: GoTo Amsterdam 2013 Skinned

45©MapR Technologies - Confidential

Threshold by Score

Significant cooccurrence => Indicators

t1 t2 t3 t4

t1 1 0 0 1t2 0 1 0 1t3 0 0 1 1t4 1 0 0 1

Page 46: GoTo Amsterdam 2013 Skinned

46©MapR Technologies - Confidential

Yeah, But …

Why go to all this trouble?

Does it really help?

Page 47: GoTo Amsterdam 2013 Skinned

47©MapR Technologies - Confidential

Real-life example

Page 48: GoTo Amsterdam 2013 Skinned

48©MapR Technologies - Confidential

The Real Life Issues

Exploration Diversity Speed

Not the last percent

Page 49: GoTo Amsterdam 2013 Skinned

49©MapR Technologies - Confidential

The Second Page

Page 50: GoTo Amsterdam 2013 Skinned

50©MapR Technologies - Confidential

Make it Worse to Make It Better

Add noise to rank

1 2 8 7 6 3 5 4 10 13 21 18 12 9 14 24 34 28 32 1711 27 40 30 41 49 16 15 35 23 19 22 26 31 20 43 25 29 33 6238 60 74 53 36 37 39 70 45 44 46 71 42 69 47 63 52 57 51 48

Results are worse today But better tomorrow

Page 51: GoTo Amsterdam 2013 Skinned

51©MapR Technologies - Confidential

Anti-Flood

200 of the same result is no better than 2

The recommender list is a portfolio of results– If probability of success is highly correlated, then probability of at least one

success is much lower

Suppressing items similar to higher ranking items helps

Page 52: GoTo Amsterdam 2013 Skinned

52©MapR Technologies - Confidential

The Punchline

Hybrid systems really can work today

Middle tiers aren’t as interesting as they used to be– No need for Flume … queue directly in big data system– No need for external queues, tail the data directly with Storm– No need for query systems for presentation data … read it directly with

node

Absolutely require common frameworks and standard interfaces

You can do this today!

Page 53: GoTo Amsterdam 2013 Skinned

53©MapR Technologies - Confidential

Contact:[email protected]@ted_dunning

Slides and such http://slideshare.net/tdunning

Hash tags: #mapr #gotoams #d3 #node