New Directions for Mahout

36
1 ©MapR Technologies - Confidential New Directions in Mahout

description

I gave this talk at Buzzwords just now to fill in for an ill speaker. The topics include things that are being added to or taken out of Mahout. These include cruft (out), fast clustering (in), nearest neighbor search (in), Pig bindings for Mahout (who knows).

Transcript of New Directions for Mahout

Page 1: New Directions for Mahout

1©MapR Technologies - Confidential

New Directions in Mahout

Page 2: New Directions for Mahout

2©MapR Technologies - Confidential

Cut Out Bloat

Page 3: New Directions for Mahout

3©MapR Technologies - Confidential

Page 4: New Directions for Mahout

4©MapR Technologies - Confidential

Bloat is Leaving in 0.7

Lots of abandoned code in Mahout– average code quality is poor– no users– no maintainers– why do we care?

Examples– old LDA– old Naïve Bayes– genetic algorithms

If you care, get on the mailing list 0.7 is about to be released

Page 5: New Directions for Mahout

5©MapR Technologies - Confidential

Integration of Collections

Page 6: New Directions for Mahout

6©MapR Technologies - Confidential

Nobody Cares about Collections

We need it, math is built on it Pull it into math

Broke the build (battle of the code expanders)

Fixed now (thanks Grant)

Page 7: New Directions for Mahout

7©MapR Technologies - Confidential

K-nearest Neighbor withSuper Fast k-means

Page 8: New Directions for Mahout

8©MapR Technologies - Confidential

What’s that?

Find the k nearest training examples Use the average value of the target variable from them

This is easy … but hard– easy because it is so conceptually simple and you don’t have knobs to turn

or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results

Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time

Page 9: New Directions for Mahout

9©MapR Technologies - Confidential

How We Did It

2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)

Goal is new open technology to facilitate new closed solutions

Ambitious goal of ~ 1,000,000 x speedup

Page 10: New Directions for Mahout

10©MapR Technologies - Confidential

How We Did It

2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)

Goal is new open technology to facilitate new closed solutions

Ambitious goal of ~ 1,000,000 x speedup– well, really only 100-1000x after basic hygiene

Page 11: New Directions for Mahout

11©MapR Technologies - Confidential

What We Did

Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute

Super-fast clustering– Kmeans, StreamingKmeans

Page 12: New Directions for Mahout

12©MapR Technologies - Confidential

Projection Search

java.lang.TreeSet!

Page 13: New Directions for Mahout

13©MapR Technologies - Confidential

How Many Projections?

Page 14: New Directions for Mahout

14©MapR Technologies - Confidential

K-means Search

Simple Idea– pre-cluster the data– to find the nearest points, search the nearest clusters

Recursive application– to search a cluster, use a Searcher!

Page 15: New Directions for Mahout

15©MapR Technologies - Confidential

Page 16: New Directions for Mahout

16©MapR Technologies - Confidential

x

Page 17: New Directions for Mahout

17©MapR Technologies - Confidential

Page 18: New Directions for Mahout

18©MapR Technologies - Confidential

Page 19: New Directions for Mahout

19©MapR Technologies - Confidential

x

Page 20: New Directions for Mahout

20©MapR Technologies - Confidential

But This Require k-means!

Need a new k-means algorithm to get speed– Hadoop is very slow at iterative map-reduce– Maybe Pregel clones like Giraph would be better– Or maybe not

Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads)– Very parallelizable

Page 21: New Directions for Mahout

21©MapR Technologies - Confidential

How It Works

For each point– Find approximately nearest centroid (distance = d)– If d > threshold, new centroid– Else possibly new cluster– Else add to nearest centroid

If centroids > K ~ C log N– Recursively cluster centroids with higher threshold

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

Page 22: New Directions for Mahout

22©MapR Technologies - Confidential

Parallel Speedup?

Page 23: New Directions for Mahout

23©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

Page 24: New Directions for Mahout

24©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)

Page 25: New Directions for Mahout

25©MapR Technologies - Confidential

Pig Vector

Page 26: New Directions for Mahout

26©MapR Technologies - Confidential

What is it?

Supports Pig access to Mahout functions

So far text vectorization

And classification

And model saving

Page 27: New Directions for Mahout

27©MapR Technologies - Confidential

What is it?

Supports Pig access to Mahout functions

So far text vectorization

And classification

And model saving

Kind of works (see pigML from twitter for better function)

Page 28: New Directions for Mahout

28©MapR Technologies - Confidential

Compile and Install

Start by compiling and installing mahout in your local repository:cd ~/Apache

git clone https://github.com/apache/mahout.git

cd mahout

mvn install -DskipTests

Then do the same with pig-vectorcd ~/Apache

git clone [email protected]:tdunning/pig-vector.git

cd pig-vector

mvn package

Page 29: New Directions for Mahout

29©MapR Technologies - Confidential

Tokenize and Vectorize Text

Tokenized is done using a text encoder– the dimension of the resulting vectors (typically 100,000-1,000,000– a description of the variables to be included in the encoding– the schema of the tuples that pig will pass together with their data types

Example:define EncodeVector

org.apache.mahout.pig.encoders.EncodeVector

('10','x+y+1', 'x:numeric, y:word, z:text');

You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier

Page 30: New Directions for Mahout

30©MapR Technologies - Confidential

The Formula

Not normal arithmetic

Describes which variables to use, whether offset is included

Also describes which interactions to use

Page 31: New Directions for Mahout

31©MapR Technologies - Confidential

The Formula

Not normal arithmetic

Describes which variables to use, whether offset is included

Also describes which interactions to use– but that doesn’t do anything yet!

Page 32: New Directions for Mahout

32©MapR Technologies - Confidential

Load and Encode Data

Load the dataa = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',')

as (x1:int, x2:int, x3:int);

And encode itb = foreach a generate 1 as key, EncodeVector(*) as v;

Note that the true meaning of * is very subtle Now store it

store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage (

'-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c com.twitter.elephantbird.pig.util.GenericWritableConverter

-t org.apache.mahout.math.VectorWritable’);

Page 33: New Directions for Mahout

33©MapR Technologies - Confidential

Train a Model

Pass previously encoded data to a sequential model trainerdefine train org.apache.mahout.pig.LogisticRegression('iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');

Note that the argument is a string with its own syntax

Page 34: New Directions for Mahout

34©MapR Technologies - Confidential

Reservations and Qualms

Pig-vector isn’t done

And it is ugly

And it doesn’t quite work

And it is hard to build

But there seems to be promise

Page 35: New Directions for Mahout

35©MapR Technologies - Confidential

Potential

Add Naïve Bayes Model?

Somehow simplify the syntax?

Try a recent version of elephant-bird?

Switch to pigML?

Page 36: New Directions for Mahout

36©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such:– http://info.mapr.com/ted-bbuzz-2012 Hash tags: #bbuzz #mahout