News from Mahout

64
1 ©MapR Technologies - Confidential News From Mahout

description

Presentation to the NYC HUG on March 5, 2013 regarding the upcoming Mahout release.

Transcript of News from Mahout

Page 1: News from Mahout

1©MapR Technologies - Confidential

News From Mahout

Page 2: News from Mahout

2©MapR Technologies - Confidential

whoami – Ted Dunning

Chief Application Architect, MapR Technologies Committer, member, Apache Software Foundation– particularly Mahout, Zookeeper and Drill

(we’re hiring)

Contact me [email protected]@[email protected]@ted_dunning

Page 3: News from Mahout

3©MapR Technologies - Confidential

Slides and such (available late tonight):– http://www.mapr.com/company/events/nyhug-03-05-2013

Hash tags: #mapr #nyhug #mahout

Page 4: News from Mahout

4©MapR Technologies - Confidential

New in Mahout

0.8 is coming soon (1-2 months) gobs of fixes QR decomposition is 10x faster– makes ALS 2-3 times faster

May include Bayesian Bandits Super fast k-means– fast– online (!?!)

Page 5: News from Mahout

5©MapR Technologies - Confidential

New in Mahout

0.8 is coming soon (1-2 months) gobs of fixes QR decomposition is 10x faster– makes ALS 2-3 times faster

May include Bayesian Bandits Super fast k-means– fast– online (!?!)– fast

Possible new edition of MiA coming– Japanese and Korean editions released, Chinese coming

Page 6: News from Mahout

6©MapR Technologies - Confidential

New in Mahout

0.8 is coming soon (1-2 months) gobs of fixes QR decomposition is 10x faster– makes ALS 2-3 times faster

May include Bayesian Bandits Super fast k-means– fast– online (!?!)– fast

Possible new edition of MiA coming– Japanese and Korean editions released, Chinese coming

Page 7: News from Mahout

7©MapR Technologies - Confidential

Real-time Learning

Page 8: News from Mahout

8©MapR Technologies - Confidential

We have a product to sell … from a web-site

Page 9: News from Mahout

9©MapR Technologies - Confidential

What picture?

What tag-line?

What call to action?

Page 10: News from Mahout

10©MapR Technologies - Confidential

The Challenge

Design decisions affect probability of success– Cheesy web-sites don’t even sell cheese

The best designers do better when allowed to fail– Exploration juices creativity

But failing is expensive– If only because we could have succeeded– But also because offending or disappointing customers is bad

Page 11: News from Mahout

11©MapR Technologies - Confidential

More Challenges

Too many designs– 5 pictures– 10 tag-lines– 4 calls to action– 3 back-ground colors=> 5 x 10 x 4 x 3 = 600 designs

It gets worse quickly– What about changes on the back-end?– Search engine variants?– Checkout process variants?

Page 12: News from Mahout

12©MapR Technologies - Confidential

Example – AB testing in real-time

I have 15 versions of my landing page Each visitor is assigned to a version– Which version?

A conversion or sale or whatever can happen– How long to wait?

Some versions of the landing page are horrible– Don’t want to give them traffic

Page 13: News from Mahout

13©MapR Technologies - Confidential

A Quick Diversion

You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change?– And did it ever have a single value?

Page 14: News from Mahout

14©MapR Technologies - Confidential

A Philosophical Conclusion

Probability as expressed by humans is subjective and depends on information and experience

Page 15: News from Mahout

15©MapR Technologies - Confidential

I Dunno

Page 16: News from Mahout

16©MapR Technologies - Confidential

5 heads out of 10 throws

Page 17: News from Mahout

17©MapR Technologies - Confidential

2 heads out of 12 throws

Page 18: News from Mahout

18©MapR Technologies - Confidential

So now you understand Bayesian probability

Page 19: News from Mahout

19©MapR Technologies - Confidential

Another Quick Diversion

Let’s play a shell game This is a special shell game It costs you nothing to play The pea has constant probability of being under each shell

(trust me)

How do you find the best shell? How do you find it while maximizing the number of wins?

Page 20: News from Mahout

20©MapR Technologies - Confidential

Pause for short con-game

Page 21: News from Mahout

21©MapR Technologies - Confidential

Interim Thoughts

Can you identify winners or losers without trying them out?

Can you ever completely eliminate a shell with a bad streak?

Should you keep trying apparent losers?

Page 22: News from Mahout

22©MapR Technologies - Confidential

So now you understand multi-armed bandits

Page 23: News from Mahout

23©MapR Technologies - Confidential

Conclusions

Can you identify winners or losers without trying them out?No

Can you ever completely eliminate a shell with a bad streak?No

Should you keep trying apparent losers?Yes, but at a decreasing rate

Page 24: News from Mahout

24©MapR Technologies - Confidential

Is there an optimum strategy?

Page 25: News from Mahout

25©MapR Technologies - Confidential

Bayesian Bandit

Compute distributions based on data so far Sample p1, p2 and p2 from these distributions

Pick shell i where i = argmaxi pi

Lemma 1: The probability of picking shell i will match the probability it is the best shell

Lemma 2: This is as good as it gets

Page 26: News from Mahout

26©MapR Technologies - Confidential

And it works!

Page 27: News from Mahout

27©MapR Technologies - Confidential

Video Demo

Page 28: News from Mahout

28©MapR Technologies - Confidential

The Code

Select an alternative

Select and learn

But we already know how to count!

n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0)))

for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)

Page 29: News from Mahout

29©MapR Technologies - Confidential

The Basic Idea

We can encode a distribution by sampling Sampling allows unification of exploration and exploitation

Can be extended to more general response models

Page 30: News from Mahout

30©MapR Technologies - Confidential

The Original Problem

x1x2

x3

Page 31: News from Mahout

31©MapR Technologies - Confidential

Response Function

Page 32: News from Mahout

32©MapR Technologies - Confidential

Generalized Banditry

Suppose we have an infinite number of bandits– suppose they are each labeled by two real numbers x and y in [0,1]– also that expected payoff is a parameterized function of x and y

– now assume a distribution for θ that we can learn online

Selection works by sampling θ, then computing f Learning works by propagating updates back to θ– If f is linear, this is very easy– For special other kinds of f it isn’t too hard

Don’t just have to have two labels, could have labels and context

Page 33: News from Mahout

33©MapR Technologies - Confidential

Context Variables

x1x2

x3

user.geo env.time env.day_of_week env.weekend

Page 34: News from Mahout

34©MapR Technologies - Confidential

Caveats

Original Bayesian Bandit only requires real-time

Generalized Bandit may require access to long history for learning– Pseudo online learning may be easier than true online

Bandit variables can include content, time of day, day of week

Context variables can include user id, user features

Bandit × context variables provide the real power

Page 35: News from Mahout

35©MapR Technologies - Confidential

You can do thisyourself!

Page 36: News from Mahout

36©MapR Technologies - Confidential

Super-fast k-means Clustering

Page 37: News from Mahout

37©MapR Technologies - Confidential

Rationale

Page 38: News from Mahout

38©MapR Technologies - Confidential

What is Quality?

Robust clustering not a goal– we don’t care if the same clustering is replicated

Generalization is critical Agreement to “gold standard” is a non-issue

Page 39: News from Mahout

39©MapR Technologies - Confidential

An Example

Page 40: News from Mahout

40©MapR Technologies - Confidential

An Example

Page 41: News from Mahout

41©MapR Technologies - Confidential

Diagonalized Cluster Proximity

Page 42: News from Mahout

42©MapR Technologies - Confidential

Clusters as Distribution Surrogate

Page 43: News from Mahout

43©MapR Technologies - Confidential

Clusters as Distribution Surrogate

Page 44: News from Mahout

44©MapR Technologies - Confidential

Theory

Page 45: News from Mahout

45©MapR Technologies - Confidential

For Example

Grouping these two clusters

seriously hurts squared distance

Page 46: News from Mahout

46©MapR Technologies - Confidential

Algorithms

Page 47: News from Mahout

47©MapR Technologies - Confidential

Typical k-means Failure

Selecting two seeds here cannot be

fixed with Lloyds

Result is that these two clusters get glued

together

Page 48: News from Mahout

48©MapR Technologies - Confidential

Ball k-means

Provably better for highly clusterable data Tries to find initial centroids in each “core” of each real clusters Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendencyfor each of a very few iterations:

for each data point:assign point to nearest cluster

recompute centroids using only points much closer than closest cluster

Page 49: News from Mahout

49©MapR Technologies - Confidential

Still Not a Win

Ball k-means is nearly guaranteed with k = 2 Probability of successful seeding drops exponentially with k Alternative strategy has high probability of success, but takes

O(nkd + k3d) time

Page 50: News from Mahout

50©MapR Technologies - Confidential

Still Not a Win

Ball k-means is nearly guaranteed with k = 2 Probability of successful seeding drops exponentially with k Alternative strategy has high probability of success, but takes

O( nkd + k3d ) time

But for big data, k gets large

Page 51: News from Mahout

51©MapR Technologies - Confidential

Surrogate Method

Start with sloppy clustering into lots of clusters

κ = k log n clusters Use this sketch as a weighted surrogate for the data Results are provably good for highly clusterable data

Page 52: News from Mahout

52©MapR Technologies - Confidential

Algorithm Costs

Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,

O(d log κ) = O(d (log k + log log n)) per point– fast, in-memory, high-quality clustering of κ weighted centroids

O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy surrogate may suffice

Page 53: News from Mahout

53©MapR Technologies - Confidential

Algorithm Costs

Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,

O(d log κ) = O(d ( log k + log log n )) per point– fast, in-memory, high-quality clustering of κ weighted centroids

O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality

– result is k high-quality centroids• For many purposes, even the sloppy surrogate may suffice

Page 54: News from Mahout

54©MapR Technologies - Confidential

Algorithm Costs

How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal

Page 55: News from Mahout

55©MapR Technologies - Confidential

Algorithm Costs

How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal

Page 56: News from Mahout

56©MapR Technologies - Confidential

How It Works

For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

Page 57: News from Mahout

57©MapR Technologies - Confidential

Implementation

Page 58: News from Mahout

58©MapR Technologies - Confidential

But Wait, …

Finding nearest centroid is inner loop

This could take O( d κ ) per point and κ can be big

Happily, approximate nearest centroid works fine

Page 59: News from Mahout

59©MapR Technologies - Confidential

Projection Search

total ordering!

Page 60: News from Mahout

60©MapR Technologies - Confidential

LSH Bit-match Versus Cosine

Page 61: News from Mahout

61©MapR Technologies - Confidential

Results

Page 62: News from Mahout

62©MapR Technologies - Confidential

Parallel Speedup?

Page 63: News from Mahout

63©MapR Technologies - Confidential

Quality

Ball k-means implementation appears significantly better than simple k-means

Streaming k-means + ball k-means appears to be about as good as ball k-means alone

All evaluations on 20 newsgroups with held-out data

Figure of merit is mean and median squared distance to nearest cluster

Page 64: News from Mahout

64©MapR Technologies - Confidential

Contact Me!

We’re hiring at MapR in US and Europe

MapR software available for research use

Get the code as part of Mahout trunk (or 0.8 very soon)

Contact me at [email protected] or @ted_dunning

Share news with @apachemahout