MAHOUT classifier tour

38
Mahout 1 Wednesday, March 16, 2011

description

This ta

Transcript of MAHOUT classifier tour

Page 1: MAHOUT classifier tour

Mahout

1Wednesday, March 16, 2011

Page 2: MAHOUT classifier tour

MahoutScalable Data Mining for Everybody

1Wednesday, March 16, 2011

Page 3: MAHOUT classifier tour

What is Mahout

• Recommendations (people who x this also x that)

• Clustering (segment data into groups of)

• Classification (learn decision making from examples)

• Stuff (LDA, SVD, frequent item-set, math)

2Wednesday, March 16, 2011

Page 4: MAHOUT classifier tour

What is Mahout?

• Recommendations (people who x this also x that)

• Clustering (segment data into groups of)

• Classification (learn decision making from examples)

• Stuff (LDA, SVM, frequent item-set, math)

3Wednesday, March 16, 2011

Page 5: MAHOUT classifier tour

Classification in Detail

• Naive Bayes Family

• Hadoop based training

• Decision Forests

• Hadoop based training

• Logistic Regression (aka SGD)

• fast on-line (sequential) training

4Wednesday, March 16, 2011

Page 6: MAHOUT classifier tour

Classification in Detail

• Naive Bayes Family

• Hadoop based training

• Decision Forests

• Hadoop based training

• Logistic Regression (aka SGD)

• fast on-line (sequential) training

5Wednesday, March 16, 2011

Page 7: MAHOUT classifier tour

So What?

Online training has low overhead for small and moderate size data-sets

6Wednesday, March 16, 2011

Page 8: MAHOUT classifier tour

So What?

Online training has low overhead for small and moderate size data-sets

6Wednesday, March 16, 2011

Page 9: MAHOUT classifier tour

So What?

Online training has low overhead for small and moderate size data-sets

6Wednesday, March 16, 2011

Page 10: MAHOUT classifier tour

So What?

Online training has low overhead for small and moderate size data-sets

6Wednesday, March 16, 2011

Page 11: MAHOUT classifier tour

So What?

Online training has low overhead for small and moderate size data-sets

big starts here

6Wednesday, March 16, 2011

Page 12: MAHOUT classifier tour

An Example

7Wednesday, March 16, 2011

Page 13: MAHOUT classifier tour

An Example

7Wednesday, March 16, 2011

Page 14: MAHOUT classifier tour

An Example

7Wednesday, March 16, 2011

Page 15: MAHOUT classifier tour

An Example

7Wednesday, March 16, 2011

Page 16: MAHOUT classifier tour

An Example

7Wednesday, March 16, 2011

Page 17: MAHOUT classifier tour

An Example

7Wednesday, March 16, 2011

Page 18: MAHOUT classifier tour

An Example

7Wednesday, March 16, 2011

Page 19: MAHOUT classifier tour

And Another

From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....

8Wednesday, March 16, 2011

Page 20: MAHOUT classifier tour

And Another

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

8Wednesday, March 16, 2011

Page 21: MAHOUT classifier tour

And Another

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

8Wednesday, March 16, 2011

Page 22: MAHOUT classifier tour

And Another

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

8Wednesday, March 16, 2011

Page 23: MAHOUT classifier tour

Mahout’s SGD

• Learns on-line per example

• O(1) memory

• O(1) time per training example

• Sequential implementation

• fast, but not parallel

9Wednesday, March 16, 2011

Page 24: MAHOUT classifier tour

Special Features

• Hashed feature encoding

• Per-term annealing

• learn the boring stuff once

• Auto-magical learning knob turning

• learns correct learning rate, learns correct learning rate for learning learning rate, ...

10Wednesday, March 16, 2011

Page 25: MAHOUT classifier tour

Feature Encoding

11Wednesday, March 16, 2011

Page 26: MAHOUT classifier tour

Feature Encoding

11Wednesday, March 16, 2011

Page 27: MAHOUT classifier tour

Hashed Encoding

12Wednesday, March 16, 2011

Page 28: MAHOUT classifier tour

Feature Collisions

13Wednesday, March 16, 2011

Page 29: MAHOUT classifier tour

Learning Rate AnnealingLe

arni

ng R

ate

# training examples seen

14Wednesday, March 16, 2011

Page 30: MAHOUT classifier tour

Per-term AnnealingLe

arni

ng R

ate

# training examples seen

15Wednesday, March 16, 2011

Page 31: MAHOUT classifier tour

Per-term AnnealingLe

arni

ng R

ate

# training examples seen

Common Feature

15Wednesday, March 16, 2011

Page 32: MAHOUT classifier tour

Per-term AnnealingLe

arni

ng R

ate

# training examples seen

Rare Feature

15Wednesday, March 16, 2011

Page 33: MAHOUT classifier tour

General Structure

• OnlineLogisticRegression

• Traditional logistic regression

• Stochastic Gradient Descent

• Per term annealing

• Too fast (for the disk + encoder)

16Wednesday, March 16, 2011

Page 34: MAHOUT classifier tour

Next Level

• CrossFoldLearner

• contains multiple primitive learners

• online cross validation

• 5x more work

17Wednesday, March 16, 2011

Page 35: MAHOUT classifier tour

And again

• AdaptiveLogisticRegression

• 20 x CrossFoldLearner

• evolves good learning and regularization rates

• 100 x more work than basic learner

• still faster than disk + encoding

18Wednesday, March 16, 2011

Page 36: MAHOUT classifier tour

A comparison

• Traditional view

• 400 x (read + OLR)

• Revised Mahout view

• 1 x (read + mu x 100 x OLR) x eta

• mu = efficiency from killing losers early

• eta = efficiency from stopping early

19Wednesday, March 16, 2011

Page 37: MAHOUT classifier tour

Deployment

• Training

• ModelSerializer.writeBinary(..., model)

• Deployment

• m = ModelSerializer.readBinary(...)

• r = m.classifyScalar(featureVector)

20Wednesday, March 16, 2011

Page 38: MAHOUT classifier tour

The Upshot

• One machine can go fast

• SITM trains in 2 billion examples in 3 hours

• Deployability pays off big

• simple sample server farm

21Wednesday, March 16, 2011