Introduction to Mahout

Post on 10-May-2015

3.592 views 0 download

Tags:

description

Slides for the talk I gave to the Twin cities HUG

Transcript of Introduction to Mahout

1©MapR Technologies 2013- Confidential

Introduction to MahoutAnd How To Build a Recommender

2©MapR Technologies 2013- Confidential

Me, Us

Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG

MapRDistributes more open source components for HadoopAdds major technology for performance, HA, industry standard API’s

TonightHash tag - #tchugSee also - @ApacheMahout @ApacheDrill

@ted_dunning and @mapR

3©MapR Technologies 2013- Confidential

Sidebar on Drill

Apache Drill– SQL on Hadoop (and other things)– Intended to solve problems for 1-5 years from now

Not the problems from 1-10 years ago– Multiple levels of API supported• SQL-2003• Logical plan language (DAG in JSON)• Physical plan language (DAG with push-down, exchange markers)• Execution plan language (many DAG’s)

Current state– SQL 2003 support in place– Logical plan interpreter useful for testing– Value vectors near completion– High performance RPC working

4©MapR Technologies 2013- Confidential

More on Drill

Just completed OSCON workshop

Workshop materials available shortly– Extracted technology demonstrators– Sample queries

Send me email or tweet for more info

5©MapR Technologies 2013- Confidential

What’s Up?

What is Mahout?– Math library– Clustering, classifiers, other stuff

Recommendation– Generalities– Algorithm Specifics– System Design– Important things never mentioned

Final thoughts

6©MapR Technologies 2013- Confidential

What is Mahout?

“Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly.

Components– math library– clustering– classification– decompositions– recommendations

7©MapR Technologies 2013- Confidential

What is Mahout?

“Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly.

Components– math library– clustering– classification– decompositions– recommendations

8©MapR Technologies 2013- Confidential

Mahout Math

9©MapR Technologies 2013- Confidential

Mahout Math

Goals are– basic linear algebra,– and statistical sampling,– and good clustering,– decent speed,– extensibility,– especially for sparse data

But not – totally badass speed– comprehensive set of algorithms– optimization, root finders, quadrature

10©MapR Technologies 2013- Confidential

Matrices and Vectors

At the core:– DenseVector, RandomAccessSparseVector– DenseMatrix, SparseRowMatrix

Highly composable API

Important ideas: – view*, assign and aggregate– iteration

m.viewDiagonal().assign(v)

11©MapR Technologies 2013- Confidential

Assign? View?

Why assign?– Copying is the major cost for naïve matrix packages– In-place operations critical to reasonable performance– Many kinds of updates required, so functional style very helpful

Why view?– In-place operations often required for blocks, rows, columns or diagonals– With views, we need #assign + #views methods– Without views, we need #assign x #views methods

Synergies– With both views and assign, many loops become single line

12©MapR Technologies 2013- Confidential

Assign

Matrices

Vectors

Matrix assign(double value);Matrix assign(double[][] values);Matrix assign(Matrix other);Matrix assign(DoubleFunction f);Matrix assign(Matrix other, DoubleDoubleFunction f);

Vector assign(double value);Vector assign(double[] values);Vector assign(Vector other);Vector assign(DoubleFunction f);Vector assign(Vector other, DoubleDoubleFunction f);Vector assign(DoubleDoubleFunction f, double y);

13©MapR Technologies 2013- Confidential

Views

Matrices

Vectors

Matrix viewPart(int[] offset, int[] size);Matrix viewPart(int row, int rlen, int col, int clen);Vector viewRow(int row);Vector viewColumn(int column);Vector viewDiagonal();

Vector viewPart(int offset, int length);

14©MapR Technologies 2013- Confidential

Aggregates

Matrices

Vectors

double zSum();double aggregate( DoubleDoubleFunction reduce, DoubleFunction map);double aggregate(Vector other, DoubleDoubleFunction aggregator, DoubleDoubleFunction combiner);

double zSum();Vector aggregateRows(VectorFunction f);Vector aggregateColumns(VectorFunction f);double aggregate(DoubleDoubleFunction combiner, DoubleFunction mapper);

15©MapR Technologies 2013- Confidential

Predefined Functions

Many handy functions

ABS LOG2 ACOS NEGATE ASIN RINT ATAN SIGN CEIL SIN COS SQRT EXP SQUARE FLOOR SIGMOID IDENTITY SIGMOIDGRADIENT INV TAN LOGARITHM

16©MapR Technologies 2013- Confidential

Examples

double alpha; a.assign(alpha);

a.assign(b, Functions.chain( Functions.plus(beta), Functions.times(alpha));

17©MapR Technologies 2013- Confidential

Sparse Optimizations

DoubleDoubleFunction abstract properties

And Vector properties

public boolean isLikeRightPlus();public boolean isLikeLeftMult();public boolean isLikeRightMult();public boolean isLikeMult();public boolean isCommutative();public boolean isAssociative();public boolean isAssociativeAndCommutative();public boolean isDensifying();

public boolean isDense();public boolean isSequentialAccess();public double getLookupCost();public double getIteratorAdvanceCost();public boolean isAddConstantTime();

18©MapR Technologies 2013- Confidential

More Examples

The trace of a matrix

Set diagonal to zero

Set diagonal to negative of row sums

19©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Set diagonal to zero

Set diagonal to negative of row sums

m.viewDiagonal().zSum()

20©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Set diagonal to zero

Set diagonal to negative of row sums

m.viewDiagonal().zSum()

m.viewDiagonal().assign(0)

21©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Set diagonal to zero

Set diagonal to negative of row sums excluding the diagonal

m.viewDiagonal().zSum()

m.viewDiagonal().assign(0)

Vector diag = m.viewDiagonal().assign(0);diag.assign(m.rowSums().assign(Functions.MINUS));

22©MapR Technologies 2013- Confidential

Iteration

Matrices are Iterable in Mahout

Vectors are densely or sparsely iterable

// compute both row and columns sums in one passfor (MatrixSlice row: m) { rSums.set(row.index(), row.zSum()); cSums.assign(row, Functions.PLUS);}

double entropy = 0;for (Vector.Element e: v.nonZeroes()) { entropy += e.get() * Math.log(e.get());}

23©MapR Technologies 2013- Confidential

Random Sampling

Samples from some type

Lots of kinds

ChineseRestaurant Missing Normal Empirical Multinomial PoissonSampler IndianBuffet MultiNormal Sampler

public interface Sampler<T> { T sample();}

public abstract class AbstractSamplerFunction extends DoubleFunction implements Sampler<Double>

24©MapR Technologies 2013- Confidential

Clustering and Such

Streaming k-means and ball k-means– streaming reduces very large data to a cluster sketch– ball k-means is a high quality k-means implementation– the cluster sketch is also usable for other applications– single machine threaded and map-reduce versions available

SVD and friends– stochastic SVD has in-memory, single machine out-of-core and map-reduce

versions– good for reducing very large sparse matrices to tall skinny dense ones

Spectral clustering– based on SVD, allows massive dimensional clustering

25©MapR Technologies 2013- Confidential

Mahout Math Summary

Matrices, Vectors– views– in-place assignment– aggregations– iterations

Functions– lots built-in– cooperate with sparse vector optimizations

Sampling– abstract samplers– samplers as functions

Other stuff … clustering, SVD

26©MapR Technologies 2013- Confidential

Recommenders

27©MapR Technologies 2013- Confidential

Recommendations

Often known as collaborative filtering Actors interact with items– observe successful interaction

We want to suggest additional successful interactions Observations inherently very sparse

28©MapR Technologies 2013- Confidential

The Big Ideas

Cooccurrence is the core operation (and it is pretty simple)

Cooccurrence can be extended to handle important new capabilities

Recommendation systems can be deployed ideally using search technology

29©MapR Technologies 2013- Confidential

Examples of Recommendations

Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl,

et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s (Veoh) Visibility in a map UI (new Google maps)

30©MapR Technologies 2013- Confidential

A simple recommendation architecture

Look at the history of interactions

Find significant item cooccurrence in user histories

Use these cooccurring items as “indicators”

For all indicators in user history, accumulate scores for related items

31©MapR Technologies 2013- Confidential

Recommendation Basics

History:

User Thing1 3

2 4

3 4

2 3

3 2

1 1

2 1

32©MapR Technologies 2013- Confidential

Recommendation Basics

History as matrix:

(t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) once

t1 t2 t3 t4

u1 1 0 1 0

u2 1 0 1 1

u3 0 1 0 1

33©MapR Technologies 2013- Confidential

A Quick Simplification

Users who do h

Also do r

User-centric recommendations

Item-centric recommendations

34©MapR Technologies 2013- Confidential

Recommendation Basics

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2

35©MapR Technologies 2013- Confidential

Problems with Raw Cooccurrence

Very popular items co-occur with everything– Welcome document– Elevator music

That isn’t interesting– We want anomalous cooccurrence

36©MapR Technologies 2013- Confidential

Recommendation Basics

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2t3 not t3

t1 2 1

not t1 1 1

37©MapR Technologies 2013- Confidential

Spot the Anomaly

Root LLR is roughly like standard deviations

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 2

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

0.44 0.98

2.26 7.15

39©MapR Technologies 2013- Confidential

Threshold by Score

Coocurrence

t1 t2 t3 t4

t1 2 0 2 1

t2 0 1 0 1

t3 2 0 1 1

t4 1 1 1 2

40©MapR Technologies 2013- Confidential

Threshold by Score

Significant cooccurrence => Indicators

t1 t2 t3 t4

t1 1 0 0 1t2 0 1 0 1t3 0 0 1 1t4 1 0 0 1

41©MapR Technologies 2013- Confidential

So Far, So Good

Classic recommendation systems based on these approaches– Musicmatch (ca 2000)– Veoh Networks (ca 2005)

Currently available in Mahout– See RowSimilarityJob

Very simple to deploy– Compute indicators– Store in search engine– Works very well with enough data

42©MapR Technologies 2013- Confidential

What’s right about this?

43©MapR Technologies 2013- Confidential

Virtues of Current State of the Art

Lots of well publicized history– Musicmatch, Veoh, Netflix, Amazon, Overstock

Lots of support– Mahout, commercial offerings like Myrrix

Lots of existing code– Mahout, commercial codes

Proven track record Well socialized solution

44©MapR Technologies 2013- Confidential

What’s wrong about this?

45©MapR Technologies 2013- Confidential

Problems for Recommenders

Cold start Disjoint populations Long tail Multiple kinds of evidence (multi-modal recommendations)– unstructured add-on data– other transaction streams– textual descriptions

46©MapR Technologies 2013- Confidential

What is this multi-modal stuff?

But people don’t just do one thing

One kind of behavior is useful for predicting other kinds

Having a complete picture is important for accuracy

What has the user said, viewed, clicked, closed, bought lately?

47©MapR Technologies 2013- Confidential

Example Multi-modal Inputs

Overlap in restaurant visits is useful Big spender cues Cuisine as an indicator Review text as an indicator

48©MapR Technologies 2013- Confidential

Too Limited

People do more than one kind of thing Different kinds of behaviors give different quality, quantity and

kind of information

We don’t have to do co-occurrence We can do cross-occurrence

Result is cross-recommendation

49©MapR Technologies 2013- Confidential

Heh?

51©MapR Technologies 2013- Confidential

For example

Users enter queries (A)– (actor = user, item=query)

Users view videos (B)– (actor = user, item=video)

ATA gives query recommendation– “did you mean to ask for”

BTB gives video recommendation– “you might like these videos”

52©MapR Technologies 2013- Confidential

The punch-line

BTA recommends videos in response to a query– (isn’t that a search engine?)– (not quite, it doesn’t look at content or meta-data)

53©MapR Technologies 2013- Confidential

Real-life example

Query: “Paco de Lucia” Conventional meta-data search results:– “hombres del paco” times 400– not much else

Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff

54©MapR Technologies 2013- Confidential

Real-life example

55©MapR Technologies 2013- Confidential

Hypothetical Example

Want a navigational ontology? Just put labels on a web page with traffic– This gives A = users x label clicks

Remember viewing history– This gives B = users x items

Cross recommend– B’A = label to item mapping

After several users click, results are whatever users think they should be

56©MapR Technologies 2013- Confidential

57©MapR Technologies 2013- Confidential

Nice. But we can do better?

58©MapR Technologies 2013- Confidential

users

things

59©MapR Technologies 2013- Confidential

users

thingtype 1

thingtype 2

60©MapR Technologies 2013- Confidential

61©MapR Technologies 2013- Confidential

Summary

Input: Multiple kinds of behavior on one set of things

Output: Recommendations for one kind of behavior with a different set of things

Cross recommendation is a special case

62©MapR Technologies 2013- Confidential

Now again, without the scary math

63©MapR Technologies 2013- Confidential

Input Data User transactions– user id, merchant id– SIC code, amount– Descriptions, cuisine, …

Offer transactions– user id, offer id– vendor id, merchant id’s, – offers, views, accepts

64©MapR Technologies 2013- Confidential

Input Data User transactions– user id, merchant id– SIC code, amount– Descriptions, cuisine, …

Offer transactions– user id, offer id– vendor id, merchant id’s, – offers, views, accepts

Derived user data– merchant id’s– anomalous descriptor terms– offer & vendor id’s

Derived merchant data– local top40– SIC code– vendor code– amount distribution

65©MapR Technologies 2013- Confidential

Cross-recommendation

Per merchant indicators– merchant id’s– chain id’s– SIC codes– indicator terms from text– offer vendor id’s

Computed by finding anomalous (indicator => merchant) rates

66©MapR Technologies 2013- Confidential

How can we deploy this?

67©MapR Technologies 2013- Confidential

Search-based Recommendations

Sample document– Merchant Id– Field for text description– Phone– Address– Location

68©MapR Technologies 2013- Confidential

Search-based Recommendations

Sample document– Merchant Id– Field for text description– Phone– Address– Location

– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40

69©MapR Technologies 2013- Confidential

Search-based Recommendations

Sample document– Merchant Id– Field for text description– Phone– Address– Location

– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40

Sample query– Current location– Recent merchant descriptions– Recent merchant id’s– Recent SIC codes– Recent accepted offers– Local top40

70©MapR Technologies 2013- Confidential

Search-based Recommendations

Sample document– Merchant Id– Field for text description– Phone– Address– Location

– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40

Sample query– Current location– Recent merchant descriptions– Recent merchant id’s– Recent SIC codes– Recent accepted offers– Local top40

Original data and meta-data

Derived from cooccurrence and cross-occurrence analysis

Recommendation query

71©MapR Technologies 2013- Confidential

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

Analyze with Map-Reduce

72©MapR Technologies 2013- Confidential

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Deploy with Conventional Search System

73©MapR Technologies 2013- Confidential

Objective Results

At a very large credit card company

History is all transactions

Development time to minimal viable product about 4 months

General release 2-3 months later

Search-based recs at or equal in quality to other techniques

74©MapR Technologies 2013- Confidential

Contact:– tdunning@maprtech.com– @ted_dunning– @apachemahout– @user-subscribe@mahout.apache.org

Slides and suchhttp://www.slideshare.net/tdunning

Hash tags: #mapr #apachemahout #recommendations