Introduction to Mahout

72
1 ©MapR Technologies 2013- Confidential Introduction to Mahout And How To Build a Recommender

description

Slides for the talk I gave to the Twin cities HUG

Transcript of Introduction to Mahout

Page 1: Introduction to Mahout

1©MapR Technologies 2013- Confidential

Introduction to MahoutAnd How To Build a Recommender

Page 2: Introduction to Mahout

2©MapR Technologies 2013- Confidential

Me, Us

Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG

MapRDistributes more open source components for HadoopAdds major technology for performance, HA, industry standard API’s

TonightHash tag - #tchugSee also - @ApacheMahout @ApacheDrill

@ted_dunning and @mapR

Page 3: Introduction to Mahout

3©MapR Technologies 2013- Confidential

Sidebar on Drill

Apache Drill– SQL on Hadoop (and other things)– Intended to solve problems for 1-5 years from now

Not the problems from 1-10 years ago– Multiple levels of API supported• SQL-2003• Logical plan language (DAG in JSON)• Physical plan language (DAG with push-down, exchange markers)• Execution plan language (many DAG’s)

Current state– SQL 2003 support in place– Logical plan interpreter useful for testing– Value vectors near completion– High performance RPC working

Page 4: Introduction to Mahout

4©MapR Technologies 2013- Confidential

More on Drill

Just completed OSCON workshop

Workshop materials available shortly– Extracted technology demonstrators– Sample queries

Send me email or tweet for more info

Page 5: Introduction to Mahout

5©MapR Technologies 2013- Confidential

What’s Up?

What is Mahout?– Math library– Clustering, classifiers, other stuff

Recommendation– Generalities– Algorithm Specifics– System Design– Important things never mentioned

Final thoughts

Page 6: Introduction to Mahout

6©MapR Technologies 2013- Confidential

What is Mahout?

“Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly.

Components– math library– clustering– classification– decompositions– recommendations

Page 7: Introduction to Mahout

7©MapR Technologies 2013- Confidential

What is Mahout?

“Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly.

Components– math library– clustering– classification– decompositions– recommendations

Page 8: Introduction to Mahout

8©MapR Technologies 2013- Confidential

Mahout Math

Page 9: Introduction to Mahout

9©MapR Technologies 2013- Confidential

Mahout Math

Goals are– basic linear algebra,– and statistical sampling,– and good clustering,– decent speed,– extensibility,– especially for sparse data

But not – totally badass speed– comprehensive set of algorithms– optimization, root finders, quadrature

Page 10: Introduction to Mahout

10©MapR Technologies 2013- Confidential

Matrices and Vectors

At the core:– DenseVector, RandomAccessSparseVector– DenseMatrix, SparseRowMatrix

Highly composable API

Important ideas: – view*, assign and aggregate– iteration

m.viewDiagonal().assign(v)

Page 11: Introduction to Mahout

11©MapR Technologies 2013- Confidential

Assign? View?

Why assign?– Copying is the major cost for naïve matrix packages– In-place operations critical to reasonable performance– Many kinds of updates required, so functional style very helpful

Why view?– In-place operations often required for blocks, rows, columns or diagonals– With views, we need #assign + #views methods– Without views, we need #assign x #views methods

Synergies– With both views and assign, many loops become single line

Page 12: Introduction to Mahout

12©MapR Technologies 2013- Confidential

Assign

Matrices

Vectors

Matrix assign(double value);Matrix assign(double[][] values);Matrix assign(Matrix other);Matrix assign(DoubleFunction f);Matrix assign(Matrix other, DoubleDoubleFunction f);

Vector assign(double value);Vector assign(double[] values);Vector assign(Vector other);Vector assign(DoubleFunction f);Vector assign(Vector other, DoubleDoubleFunction f);Vector assign(DoubleDoubleFunction f, double y);

Page 13: Introduction to Mahout

13©MapR Technologies 2013- Confidential

Views

Matrices

Vectors

Matrix viewPart(int[] offset, int[] size);Matrix viewPart(int row, int rlen, int col, int clen);Vector viewRow(int row);Vector viewColumn(int column);Vector viewDiagonal();

Vector viewPart(int offset, int length);

Page 14: Introduction to Mahout

14©MapR Technologies 2013- Confidential

Aggregates

Matrices

Vectors

double zSum();double aggregate( DoubleDoubleFunction reduce, DoubleFunction map);double aggregate(Vector other, DoubleDoubleFunction aggregator, DoubleDoubleFunction combiner);

double zSum();Vector aggregateRows(VectorFunction f);Vector aggregateColumns(VectorFunction f);double aggregate(DoubleDoubleFunction combiner, DoubleFunction mapper);

Page 15: Introduction to Mahout

15©MapR Technologies 2013- Confidential

Predefined Functions

Many handy functions

ABS LOG2 ACOS NEGATE ASIN RINT ATAN SIGN CEIL SIN COS SQRT EXP SQUARE FLOOR SIGMOID IDENTITY SIGMOIDGRADIENT INV TAN LOGARITHM

Page 16: Introduction to Mahout

16©MapR Technologies 2013- Confidential

Examples

double alpha; a.assign(alpha);

a.assign(b, Functions.chain( Functions.plus(beta), Functions.times(alpha));

Page 17: Introduction to Mahout

17©MapR Technologies 2013- Confidential

Sparse Optimizations

DoubleDoubleFunction abstract properties

And Vector properties

public boolean isLikeRightPlus();public boolean isLikeLeftMult();public boolean isLikeRightMult();public boolean isLikeMult();public boolean isCommutative();public boolean isAssociative();public boolean isAssociativeAndCommutative();public boolean isDensifying();

public boolean isDense();public boolean isSequentialAccess();public double getLookupCost();public double getIteratorAdvanceCost();public boolean isAddConstantTime();

Page 18: Introduction to Mahout

18©MapR Technologies 2013- Confidential

More Examples

The trace of a matrix

Set diagonal to zero

Set diagonal to negative of row sums

Page 19: Introduction to Mahout

19©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Set diagonal to zero

Set diagonal to negative of row sums

m.viewDiagonal().zSum()

Page 20: Introduction to Mahout

20©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Set diagonal to zero

Set diagonal to negative of row sums

m.viewDiagonal().zSum()

m.viewDiagonal().assign(0)

Page 21: Introduction to Mahout

21©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Set diagonal to zero

Set diagonal to negative of row sums excluding the diagonal

m.viewDiagonal().zSum()

m.viewDiagonal().assign(0)

Vector diag = m.viewDiagonal().assign(0);diag.assign(m.rowSums().assign(Functions.MINUS));

Page 22: Introduction to Mahout

22©MapR Technologies 2013- Confidential

Iteration

Matrices are Iterable in Mahout

Vectors are densely or sparsely iterable

// compute both row and columns sums in one passfor (MatrixSlice row: m) { rSums.set(row.index(), row.zSum()); cSums.assign(row, Functions.PLUS);}

double entropy = 0;for (Vector.Element e: v.nonZeroes()) { entropy += e.get() * Math.log(e.get());}

Page 23: Introduction to Mahout

23©MapR Technologies 2013- Confidential

Random Sampling

Samples from some type

Lots of kinds

ChineseRestaurant Missing Normal Empirical Multinomial PoissonSampler IndianBuffet MultiNormal Sampler

public interface Sampler<T> { T sample();}

public abstract class AbstractSamplerFunction extends DoubleFunction implements Sampler<Double>

Page 24: Introduction to Mahout

24©MapR Technologies 2013- Confidential

Clustering and Such

Streaming k-means and ball k-means– streaming reduces very large data to a cluster sketch– ball k-means is a high quality k-means implementation– the cluster sketch is also usable for other applications– single machine threaded and map-reduce versions available

SVD and friends– stochastic SVD has in-memory, single machine out-of-core and map-reduce

versions– good for reducing very large sparse matrices to tall skinny dense ones

Spectral clustering– based on SVD, allows massive dimensional clustering

Page 25: Introduction to Mahout

25©MapR Technologies 2013- Confidential

Mahout Math Summary

Matrices, Vectors– views– in-place assignment– aggregations– iterations

Functions– lots built-in– cooperate with sparse vector optimizations

Sampling– abstract samplers– samplers as functions

Other stuff … clustering, SVD

Page 26: Introduction to Mahout

26©MapR Technologies 2013- Confidential

Recommenders

Page 27: Introduction to Mahout

27©MapR Technologies 2013- Confidential

Recommendations

Often known as collaborative filtering Actors interact with items– observe successful interaction

We want to suggest additional successful interactions Observations inherently very sparse

Page 28: Introduction to Mahout

28©MapR Technologies 2013- Confidential

The Big Ideas

Cooccurrence is the core operation (and it is pretty simple)

Cooccurrence can be extended to handle important new capabilities

Recommendation systems can be deployed ideally using search technology

Page 29: Introduction to Mahout

29©MapR Technologies 2013- Confidential

Examples of Recommendations

Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl,

et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s (Veoh) Visibility in a map UI (new Google maps)

Page 30: Introduction to Mahout

30©MapR Technologies 2013- Confidential

A simple recommendation architecture

Look at the history of interactions

Find significant item cooccurrence in user histories

Use these cooccurring items as “indicators”

For all indicators in user history, accumulate scores for related items

Page 31: Introduction to Mahout

31©MapR Technologies 2013- Confidential

Recommendation Basics

History:

User Thing1 3

2 4

3 4

2 3

3 2

1 1

2 1

Page 32: Introduction to Mahout

32©MapR Technologies 2013- Confidential

Recommendation Basics

History as matrix:

(t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) once

t1 t2 t3 t4

u1 1 0 1 0

u2 1 0 1 1

u3 0 1 0 1

Page 33: Introduction to Mahout

33©MapR Technologies 2013- Confidential

A Quick Simplification

Users who do h

Also do r

User-centric recommendations

Item-centric recommendations

Page 34: Introduction to Mahout

34©MapR Technologies 2013- Confidential

Recommendation Basics

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2

Page 35: Introduction to Mahout

35©MapR Technologies 2013- Confidential

Problems with Raw Cooccurrence

Very popular items co-occur with everything– Welcome document– Elevator music

That isn’t interesting– We want anomalous cooccurrence

Page 36: Introduction to Mahout

36©MapR Technologies 2013- Confidential

Recommendation Basics

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2t3 not t3

t1 2 1

not t1 1 1

Page 37: Introduction to Mahout

37©MapR Technologies 2013- Confidential

Spot the Anomaly

Root LLR is roughly like standard deviations

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 2

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

0.44 0.98

2.26 7.15

Page 38: Introduction to Mahout

39©MapR Technologies 2013- Confidential

Threshold by Score

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2

Page 39: Introduction to Mahout

40©MapR Technologies 2013- Confidential

Threshold by Score

Significant cooccurrence => Indicators

t1 t2 t3 t4

t1 1 0 0 1t2 0 1 0 1t3 0 0 1 1t4 1 0 0 1

Page 40: Introduction to Mahout

41©MapR Technologies 2013- Confidential

So Far, So Good

Classic recommendation systems based on these approaches– Musicmatch (ca 2000)– Veoh Networks (ca 2005)

Currently available in Mahout– See RowSimilarityJob

Very simple to deploy– Compute indicators– Store in search engine– Works very well with enough data

Page 41: Introduction to Mahout

42©MapR Technologies 2013- Confidential

What’s right about this?

Page 42: Introduction to Mahout

43©MapR Technologies 2013- Confidential

Virtues of Current State of the Art

Lots of well publicized history– Musicmatch, Veoh, Netflix, Amazon, Overstock

Lots of support– Mahout, commercial offerings like Myrrix

Lots of existing code– Mahout, commercial codes

Proven track record Well socialized solution

Page 43: Introduction to Mahout

44©MapR Technologies 2013- Confidential

What’s wrong about this?

Page 44: Introduction to Mahout

45©MapR Technologies 2013- Confidential

Problems for Recommenders

Cold start Disjoint populations Long tail Multiple kinds of evidence (multi-modal recommendations)– unstructured add-on data– other transaction streams– textual descriptions

Page 45: Introduction to Mahout

46©MapR Technologies 2013- Confidential

What is this multi-modal stuff?

But people don’t just do one thing

One kind of behavior is useful for predicting other kinds

Having a complete picture is important for accuracy

What has the user said, viewed, clicked, closed, bought lately?

Page 46: Introduction to Mahout

47©MapR Technologies 2013- Confidential

Example Multi-modal Inputs

Overlap in restaurant visits is useful Big spender cues Cuisine as an indicator Review text as an indicator

Page 47: Introduction to Mahout

48©MapR Technologies 2013- Confidential

Too Limited

People do more than one kind of thing Different kinds of behaviors give different quality, quantity and

kind of information

We don’t have to do co-occurrence We can do cross-occurrence

Result is cross-recommendation

Page 48: Introduction to Mahout

49©MapR Technologies 2013- Confidential

Heh?

Page 49: Introduction to Mahout

51©MapR Technologies 2013- Confidential

For example

Users enter queries (A)– (actor = user, item=query)

Users view videos (B)– (actor = user, item=video)

ATA gives query recommendation– “did you mean to ask for”

BTB gives video recommendation– “you might like these videos”

Page 50: Introduction to Mahout

52©MapR Technologies 2013- Confidential

The punch-line

BTA recommends videos in response to a query– (isn’t that a search engine?)– (not quite, it doesn’t look at content or meta-data)

Page 51: Introduction to Mahout

53©MapR Technologies 2013- Confidential

Real-life example

Query: “Paco de Lucia” Conventional meta-data search results:– “hombres del paco” times 400– not much else

Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff

Page 52: Introduction to Mahout

54©MapR Technologies 2013- Confidential

Real-life example

Page 53: Introduction to Mahout

55©MapR Technologies 2013- Confidential

Hypothetical Example

Want a navigational ontology? Just put labels on a web page with traffic– This gives A = users x label clicks

Remember viewing history– This gives B = users x items

Cross recommend– B’A = label to item mapping

After several users click, results are whatever users think they should be

Page 54: Introduction to Mahout

56©MapR Technologies 2013- Confidential

Page 55: Introduction to Mahout

57©MapR Technologies 2013- Confidential

Nice. But we can do better?

Page 56: Introduction to Mahout

58©MapR Technologies 2013- Confidential

users

things

Page 57: Introduction to Mahout

59©MapR Technologies 2013- Confidential

users

thingtype 1

thingtype 2

Page 58: Introduction to Mahout

60©MapR Technologies 2013- Confidential

Page 59: Introduction to Mahout

61©MapR Technologies 2013- Confidential

Summary

Input: Multiple kinds of behavior on one set of things

Output: Recommendations for one kind of behavior with a different set of things

Cross recommendation is a special case

Page 60: Introduction to Mahout

62©MapR Technologies 2013- Confidential

Now again, without the scary math

Page 61: Introduction to Mahout

63©MapR Technologies 2013- Confidential

Input Data User transactions– user id, merchant id– SIC code, amount– Descriptions, cuisine, …

Offer transactions– user id, offer id– vendor id, merchant id’s, – offers, views, accepts

Page 62: Introduction to Mahout

64©MapR Technologies 2013- Confidential

Input Data User transactions– user id, merchant id– SIC code, amount– Descriptions, cuisine, …

Offer transactions– user id, offer id– vendor id, merchant id’s, – offers, views, accepts

Derived user data– merchant id’s– anomalous descriptor terms– offer & vendor id’s

Derived merchant data– local top40– SIC code– vendor code– amount distribution

Page 63: Introduction to Mahout

65©MapR Technologies 2013- Confidential

Cross-recommendation

Per merchant indicators– merchant id’s– chain id’s– SIC codes– indicator terms from text– offer vendor id’s

Computed by finding anomalous (indicator => merchant) rates

Page 64: Introduction to Mahout

66©MapR Technologies 2013- Confidential

How can we deploy this?

Page 65: Introduction to Mahout

67©MapR Technologies 2013- Confidential

Search-based Recommendations

Sample document– Merchant Id– Field for text description– Phone– Address– Location

Page 66: Introduction to Mahout

68©MapR Technologies 2013- Confidential

Search-based Recommendations

Sample document– Merchant Id– Field for text description– Phone– Address– Location

– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40

Page 67: Introduction to Mahout

69©MapR Technologies 2013- Confidential

Search-based Recommendations

Sample document– Merchant Id– Field for text description– Phone– Address– Location

– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40

Sample query– Current location– Recent merchant descriptions– Recent merchant id’s– Recent SIC codes– Recent accepted offers– Local top40

Page 68: Introduction to Mahout

70©MapR Technologies 2013- Confidential

Search-based Recommendations

Sample document– Merchant Id– Field for text description– Phone– Address– Location

– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40

Sample query– Current location– Recent merchant descriptions– Recent merchant id’s– Recent SIC codes– Recent accepted offers– Local top40

Original data and meta-data

Derived from cooccurrence and cross-occurrence analysis

Recommendation query

Page 69: Introduction to Mahout

71©MapR Technologies 2013- Confidential

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

Analyze with Map-Reduce

Page 70: Introduction to Mahout

72©MapR Technologies 2013- Confidential

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Deploy with Conventional Search System

Page 71: Introduction to Mahout

73©MapR Technologies 2013- Confidential

Objective Results

At a very large credit card company

History is all transactions

Development time to minimal viable product about 4 months

General release 2-3 months later

Search-based recs at or equal in quality to other techniques

Page 72: Introduction to Mahout

74©MapR Technologies 2013- Confidential

Contact:– [email protected]– @ted_dunning– @apachemahout– @[email protected]

Slides and suchhttp://www.slideshare.net/tdunning

Hash tags: #mapr #apachemahout #recommendations