MAHOUT classifier tour
-
Upload
ted-dunning -
Category
Technology
-
view
5.437 -
download
5
description
Transcript of MAHOUT classifier tour
Mahout
1Wednesday, March 16, 2011
MahoutScalable Data Mining for Everybody
1Wednesday, March 16, 2011
What is Mahout
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)
• Classification (learn decision making from examples)
• Stuff (LDA, SVD, frequent item-set, math)
2Wednesday, March 16, 2011
What is Mahout?
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)
• Classification (learn decision making from examples)
• Stuff (LDA, SVM, frequent item-set, math)
3Wednesday, March 16, 2011
Classification in Detail
• Naive Bayes Family
• Hadoop based training
• Decision Forests
• Hadoop based training
• Logistic Regression (aka SGD)
• fast on-line (sequential) training
4Wednesday, March 16, 2011
Classification in Detail
• Naive Bayes Family
• Hadoop based training
• Decision Forests
• Hadoop based training
• Logistic Regression (aka SGD)
• fast on-line (sequential) training
5Wednesday, March 16, 2011
So What?
Online training has low overhead for small and moderate size data-sets
6Wednesday, March 16, 2011
So What?
Online training has low overhead for small and moderate size data-sets
6Wednesday, March 16, 2011
So What?
Online training has low overhead for small and moderate size data-sets
6Wednesday, March 16, 2011
So What?
Online training has low overhead for small and moderate size data-sets
6Wednesday, March 16, 2011
So What?
Online training has low overhead for small and moderate size data-sets
big starts here
6Wednesday, March 16, 2011
An Example
7Wednesday, March 16, 2011
An Example
7Wednesday, March 16, 2011
An Example
7Wednesday, March 16, 2011
An Example
7Wednesday, March 16, 2011
An Example
7Wednesday, March 16, 2011
An Example
7Wednesday, March 16, 2011
An Example
7Wednesday, March 16, 2011
And Another
From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....
8Wednesday, March 16, 2011
And Another
Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>
Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
8Wednesday, March 16, 2011
And Another
Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>
Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
8Wednesday, March 16, 2011
And Another
Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>
Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
8Wednesday, March 16, 2011
Mahout’s SGD
• Learns on-line per example
• O(1) memory
• O(1) time per training example
• Sequential implementation
• fast, but not parallel
9Wednesday, March 16, 2011
Special Features
• Hashed feature encoding
• Per-term annealing
• learn the boring stuff once
• Auto-magical learning knob turning
• learns correct learning rate, learns correct learning rate for learning learning rate, ...
10Wednesday, March 16, 2011
Feature Encoding
11Wednesday, March 16, 2011
Feature Encoding
11Wednesday, March 16, 2011
Hashed Encoding
12Wednesday, March 16, 2011
Feature Collisions
13Wednesday, March 16, 2011
Learning Rate AnnealingLe
arni
ng R
ate
# training examples seen
14Wednesday, March 16, 2011
Per-term AnnealingLe
arni
ng R
ate
# training examples seen
15Wednesday, March 16, 2011
Per-term AnnealingLe
arni
ng R
ate
# training examples seen
Common Feature
15Wednesday, March 16, 2011
Per-term AnnealingLe
arni
ng R
ate
# training examples seen
Rare Feature
15Wednesday, March 16, 2011
General Structure
• OnlineLogisticRegression
• Traditional logistic regression
• Stochastic Gradient Descent
• Per term annealing
• Too fast (for the disk + encoder)
16Wednesday, March 16, 2011
Next Level
• CrossFoldLearner
• contains multiple primitive learners
• online cross validation
• 5x more work
17Wednesday, March 16, 2011
And again
• AdaptiveLogisticRegression
• 20 x CrossFoldLearner
• evolves good learning and regularization rates
• 100 x more work than basic learner
• still faster than disk + encoding
18Wednesday, March 16, 2011
A comparison
• Traditional view
• 400 x (read + OLR)
• Revised Mahout view
• 1 x (read + mu x 100 x OLR) x eta
• mu = efficiency from killing losers early
• eta = efficiency from stopping early
19Wednesday, March 16, 2011
Deployment
• Training
• ModelSerializer.writeBinary(..., model)
• Deployment
• m = ModelSerializer.readBinary(...)
• r = m.classifyScalar(featureVector)
20Wednesday, March 16, 2011
The Upshot
• One machine can go fast
• SITM trains in 2 billion examples in 3 hours
• Deployability pays off big
• simple sample server farm
21Wednesday, March 16, 2011