San Francisco Hadoop User Group Meetup Deep Learning
-
Upload
sri-ambati -
Category
Software
-
view
930 -
download
2
Transcript of San Francisco Hadoop User Group Meetup Deep Learning
Distributed Deep Learning for Classification and Regression
Problems using H2O
Hadoop User Group Meetup San Francisco 121014
Arno Candel H2Oai
Who am IPhD in Computational Physics 2005
from ETH Zurich Switzerland
6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree - Machine Learning
1 year at H2Oai - Machine Learning
15 years in Supercomputing amp Modeling
Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine
ArnoCandel
H2O Deep Learning ArnoCandel
OutlineIntro (5 mins)
Methods amp Implementation (10 mins)
Results amp Live Demos (20 mins)
Higgs boson classification
MNIST handwritten digits
text classification
3
H2O Deep Learning ArnoCandel
Teamwork at H2OaiJava Apache v2 Open-Source
1 Java Machine Learning in Github Join the community
4
H2O Deep Learning ArnoCandel
H2O Open-Source (Apache v2) Predictive Analytics Platform
5
H2O Deep Learning ArnoCandel 6
H2O Architecture - Designed for speed scale accuracy amp ease of use
Key technical points bull distributed JVMs + REST API bull no Java GC issues
(data in byte[] Double) bull loss-less number compression bull Hadoop integration (v1YARN) bull R package (CRAN)
Pre-built fully featured algos K-Means NB PCA CoxPH GLM RF GBM DeepLearning
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
InputImage
Output User ID
7
Example Facebook DeepFace
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
8
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async forkjoin worker threads operate at FORTRAN speeds) + smart algorithms for fast amp accurate results (automatic standardization one-hot encoding of categoricals missing value imputation weight amp bias initialization adaptive learning rate momentum dropoutl1L2 regularization grid search N-fold cross-validation checkpointing load balancing auto-tuning model averaging etc)
= powerful tool for (un)supervised machine learning on real-world data
H2O Deep Learning9
all 320 cores maxed out
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
Who am IPhD in Computational Physics 2005
from ETH Zurich Switzerland
6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree - Machine Learning
1 year at H2Oai - Machine Learning
15 years in Supercomputing amp Modeling
Named ldquo2014 Big Data All-Starrdquo by Fortune Magazine
ArnoCandel
H2O Deep Learning ArnoCandel
OutlineIntro (5 mins)
Methods amp Implementation (10 mins)
Results amp Live Demos (20 mins)
Higgs boson classification
MNIST handwritten digits
text classification
3
H2O Deep Learning ArnoCandel
Teamwork at H2OaiJava Apache v2 Open-Source
1 Java Machine Learning in Github Join the community
4
H2O Deep Learning ArnoCandel
H2O Open-Source (Apache v2) Predictive Analytics Platform
5
H2O Deep Learning ArnoCandel 6
H2O Architecture - Designed for speed scale accuracy amp ease of use
Key technical points bull distributed JVMs + REST API bull no Java GC issues
(data in byte[] Double) bull loss-less number compression bull Hadoop integration (v1YARN) bull R package (CRAN)
Pre-built fully featured algos K-Means NB PCA CoxPH GLM RF GBM DeepLearning
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
InputImage
Output User ID
7
Example Facebook DeepFace
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
8
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async forkjoin worker threads operate at FORTRAN speeds) + smart algorithms for fast amp accurate results (automatic standardization one-hot encoding of categoricals missing value imputation weight amp bias initialization adaptive learning rate momentum dropoutl1L2 regularization grid search N-fold cross-validation checkpointing load balancing auto-tuning model averaging etc)
= powerful tool for (un)supervised machine learning on real-world data
H2O Deep Learning9
all 320 cores maxed out
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
OutlineIntro (5 mins)
Methods amp Implementation (10 mins)
Results amp Live Demos (20 mins)
Higgs boson classification
MNIST handwritten digits
text classification
3
H2O Deep Learning ArnoCandel
Teamwork at H2OaiJava Apache v2 Open-Source
1 Java Machine Learning in Github Join the community
4
H2O Deep Learning ArnoCandel
H2O Open-Source (Apache v2) Predictive Analytics Platform
5
H2O Deep Learning ArnoCandel 6
H2O Architecture - Designed for speed scale accuracy amp ease of use
Key technical points bull distributed JVMs + REST API bull no Java GC issues
(data in byte[] Double) bull loss-less number compression bull Hadoop integration (v1YARN) bull R package (CRAN)
Pre-built fully featured algos K-Means NB PCA CoxPH GLM RF GBM DeepLearning
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
InputImage
Output User ID
7
Example Facebook DeepFace
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
8
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async forkjoin worker threads operate at FORTRAN speeds) + smart algorithms for fast amp accurate results (automatic standardization one-hot encoding of categoricals missing value imputation weight amp bias initialization adaptive learning rate momentum dropoutl1L2 regularization grid search N-fold cross-validation checkpointing load balancing auto-tuning model averaging etc)
= powerful tool for (un)supervised machine learning on real-world data
H2O Deep Learning9
all 320 cores maxed out
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Teamwork at H2OaiJava Apache v2 Open-Source
1 Java Machine Learning in Github Join the community
4
H2O Deep Learning ArnoCandel
H2O Open-Source (Apache v2) Predictive Analytics Platform
5
H2O Deep Learning ArnoCandel 6
H2O Architecture - Designed for speed scale accuracy amp ease of use
Key technical points bull distributed JVMs + REST API bull no Java GC issues
(data in byte[] Double) bull loss-less number compression bull Hadoop integration (v1YARN) bull R package (CRAN)
Pre-built fully featured algos K-Means NB PCA CoxPH GLM RF GBM DeepLearning
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
InputImage
Output User ID
7
Example Facebook DeepFace
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
8
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async forkjoin worker threads operate at FORTRAN speeds) + smart algorithms for fast amp accurate results (automatic standardization one-hot encoding of categoricals missing value imputation weight amp bias initialization adaptive learning rate momentum dropoutl1L2 regularization grid search N-fold cross-validation checkpointing load balancing auto-tuning model averaging etc)
= powerful tool for (un)supervised machine learning on real-world data
H2O Deep Learning9
all 320 cores maxed out
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
H2O Open-Source (Apache v2) Predictive Analytics Platform
5
H2O Deep Learning ArnoCandel 6
H2O Architecture - Designed for speed scale accuracy amp ease of use
Key technical points bull distributed JVMs + REST API bull no Java GC issues
(data in byte[] Double) bull loss-less number compression bull Hadoop integration (v1YARN) bull R package (CRAN)
Pre-built fully featured algos K-Means NB PCA CoxPH GLM RF GBM DeepLearning
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
InputImage
Output User ID
7
Example Facebook DeepFace
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
8
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async forkjoin worker threads operate at FORTRAN speeds) + smart algorithms for fast amp accurate results (automatic standardization one-hot encoding of categoricals missing value imputation weight amp bias initialization adaptive learning rate momentum dropoutl1L2 regularization grid search N-fold cross-validation checkpointing load balancing auto-tuning model averaging etc)
= powerful tool for (un)supervised machine learning on real-world data
H2O Deep Learning9
all 320 cores maxed out
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel 6
H2O Architecture - Designed for speed scale accuracy amp ease of use
Key technical points bull distributed JVMs + REST API bull no Java GC issues
(data in byte[] Double) bull loss-less number compression bull Hadoop integration (v1YARN) bull R package (CRAN)
Pre-built fully featured algos K-Means NB PCA CoxPH GLM RF GBM DeepLearning
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
InputImage
Output User ID
7
Example Facebook DeepFace
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
8
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async forkjoin worker threads operate at FORTRAN speeds) + smart algorithms for fast amp accurate results (automatic standardization one-hot encoding of categoricals missing value imputation weight amp bias initialization adaptive learning rate momentum dropoutl1L2 regularization grid search N-fold cross-validation checkpointing load balancing auto-tuning model averaging etc)
= powerful tool for (un)supervised machine learning on real-world data
H2O Deep Learning9
all 320 cores maxed out
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
WikipediaDeep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using
architectures composed of multiple non-linear transformations
What is Deep Learning
InputImage
Output User ID
7
Example Facebook DeepFace
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
8
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async forkjoin worker threads operate at FORTRAN speeds) + smart algorithms for fast amp accurate results (automatic standardization one-hot encoding of categoricals missing value imputation weight amp bias initialization adaptive learning rate momentum dropoutl1L2 regularization grid search N-fold cross-validation checkpointing load balancing auto-tuning model averaging etc)
= powerful tool for (un)supervised machine learning on real-world data
H2O Deep Learning9
all 320 cores maxed out
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
What is NOT DeepLinear models are not deep (by definition)
Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy)
SVMs and Kernel methods are not deep (2 layers kernel + linear)
Classification trees are not deep (operate on original input space no new features generated)
8
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async forkjoin worker threads operate at FORTRAN speeds) + smart algorithms for fast amp accurate results (automatic standardization one-hot encoding of categoricals missing value imputation weight amp bias initialization adaptive learning rate momentum dropoutl1L2 regularization grid search N-fold cross-validation checkpointing load balancing auto-tuning model averaging etc)
= powerful tool for (un)supervised machine learning on real-world data
H2O Deep Learning9
all 320 cores maxed out
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) + multi-threaded speedup (async forkjoin worker threads operate at FORTRAN speeds) + smart algorithms for fast amp accurate results (automatic standardization one-hot encoding of categoricals missing value imputation weight amp bias initialization adaptive learning rate momentum dropoutl1L2 regularization grid search N-fold cross-validation checkpointing load balancing auto-tuning model averaging etc)
= powerful tool for (un)supervised machine learning on real-world data
H2O Deep Learning9
all 320 cores maxed out
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
ldquofully connectedrdquo directed graph of neurons
age
income
employment
married
single
Input layerHidden layer 1
Hidden layer 2
Output layer
3x4 4x3 3x2connections
information flow
inputoutput neuronhidden neuron
4 3 2neurons 3
Example Neural Network10
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
age
income
employmentyj = tanh(sumi(xiuij)+bj)
uij
xi
yj
per-class probabilities sum(pl) = 1
zk = tanh(sumj(yjvjk)+ck)
vjk
zk pl
pl = softmax(sumk(zkwkl)+dl)
wkl
softmax(xk) = exp(xk) sumk(exp(xk))
ldquoneurons activate each other via weighted sumsrdquo
Prediction Forward Propagation
activation function tanh alternative
x -gt max(0x) ldquorectifierrdquo
pl is a non-linear function of xi can approximate ANY function
with enough layers
bj ck dl bias values(indep of inputs)
11
married
single
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
age
income
employment
xi
Automatic standardization of data xi mean = 0 stddev = 1
horizontalize categorical variables eg
full-time part-time none self-employed -gt
010 = part-time 000 = self-employed
Automatic initialization of weights
Poor manrsquos initialization random weights wkl
Default (better) Uniform distribution in+- sqrt(6(units + units_previous_layer))
Data preparation amp InitializationNeural Networks are sensitive to numerical noise operate best in the linear regime (not saturated)
12
married
single
wkl
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Mean Square Error = (022 + 022)2 ldquopenalize differences per-classrdquo Cross-entropy = -log(08) ldquostrongly penalize non-1-nessrdquo
Training Update Weights amp Biases
Stochastic Gradient Descent Update weights and biases via gradient of the error (via back-propagation)
For each training row we make a prediction and compare with the actual label (supervised learning)
married108predicted actual
Objective minimize prediction error (MSE or cross-entropy)
w ltmdash w - rate partEpartw
1
13
single002
E
wrate
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Backward Propagation
partEpartwi = partEparty partypartnet partnetpartwi
= part(error(y))party part(activation(net))partnet xi
Backprop Compute partEpartwi via chain rule going backwards
wi
net = sumi(wixi) + b
xiE = error(y)
y = activation(net)
How to compute partEpartwi for wi ltmdash wi - rate partEpartwi
Naive For every i evaluate E twice at (w1hellipwiplusmn∆hellipwN)hellip Slow
14
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodesJVMs sync
threads async
communication
w
w w
w w w w
w1 w3 w2w4
w2+w4w1+w3
w = (w1+w2+w3+w4)4
map each node trains a copy of the weights
and biases with (some or all of) its
local data with asynchronous FJ
threads
initial model weights and biases w
updated model w
H2O atomic in-memoryK-V store
reduce model averaging
average weights and biases from all nodes
speedup is at least nodeslog(rows) arxiv12094129v3
Keep iterating over the data (ldquoepochsrdquo) score from time to time
Query amp display the model via
JSON WWW
2
2 431
1
1
1
43 2
1 2
1
i
auto-tuned (default) or user-specified number of points per MapReduce iteration
15
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Adaptive learning rate - ADADELTA (Google)Automatically set learning rate for each neuron based on its training history
Grid Search and Checkpointing Run a grid search to scan many hyper-parameters then continue training the most promising model(s)
RegularizationL1 penalizes non-zero weights L2 penalizes large weightsDropout randomly ignore certain inputs
16
ldquoSecretrdquo Sauce to Higher Accuracy
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Detail Adaptive Learning Rate
Compute moving average of ∆wi2 at time t for window length rho
E[∆wi2]t = rho E[∆wi2]t-1 + (1-rho) ∆wi2
Compute RMS of ∆wi at time t with smoothing epsilon
RMS[∆wi]t = sqrt( E[∆wi2]t + epsilon )
Adaptive annealing progress Gradient-dependent learning rate moving window prevents ldquofreezingrdquo (unlike ADAGRAD no window)
Adaptive acceleration momentum accumulate previous weight updates but over a window of time
RMS[∆wi]t-1
RMS[partEpartwi]t
rate(wi t) =
Do the same for partEpartwi then obtain per-weight learning rate
cf ADADELTA paper
17
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Detail Dropout Regularization18
Training For each hidden neuron for each training sample for each iteration ignore (zero out) a different random fraction p of input activations
age
income
employment
married
singleX
X
X
Testing Use all activations but reduce them by a factor p
(to ldquosimulaterdquo the missing activations during training)
cf Geoff Hintons paper
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel 19
Application Higgs Boson Classification
Higgsvs
Background
Large Hadron Collider Largest experiment of mankind $13+ billion 168 miles long 120 MegaWatts -456F 1PBday etc Higgs boson discovery (July rsquo12) led to 2013 Nobel prize
httparxivorgpdf14024735v2pdf
Images courtesy CERN LHC
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features (physics formulae) Train 10M rows Valid 500k Test 500k rows
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel 20
Live Demo Letrsquos see what Deep Learning can do with low-level features alone
Former baseline for AUC 0733 and 0816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0596 0684
Random Forest 0764 0840
Gradient Boosted Trees 0753 0839
Neural Net 1 hidden layer 0760 0830
H2O Deep Learning
add derived
features
Higgs Derived features are important
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
MNIST digits classification
Standing world record Without distortions or convolutions the best-ever published error rate on test set 083 (Microsoft)
21
Train 60000 rows 784 integer columns 10 classes Test 10000 rows 784 integer columns 10 classes
MNIST = Digitized handwritten digits database (Yann LeCun)
Data 28x28=784 pixels with (gray-scale) values in 0hellip255
Yann LeCun ldquoYet another advice dont get fooled by people who claim to have a solution to Artificial General Intelligence Ask them what error rate they get on MNIST or ImageNetrdquo
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel 22
H2O Deep Learning beats MNIST
Standard 60k10k data No distortions
No convolutions No unsupervised training
No ensemble
10 hours on 10 16-core nodes
World-record 083 test set error
httplearnh2oaicontenthands-on_trainingdeep_learninghtml
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
POJO Model Export for Production Scoring
23
Plain old Java code is auto-generated to take your H2O Deep Learning models into production
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Parallel Scalability (for 64 epochs on MNIST with ldquo087rdquo parameters)
24
Speedup
000
1000
2000
3000
4000
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node 1 epoch per node per MapReduce)
27 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Goal Predict the item from sellerrsquos text description
25
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
ldquoVintage 18KT gold Rolex 2 Tone in great conditionrdquo
Data Bag of words vector 0010000010001hellip0
vintagegold condition
Text Classification
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Out-Of-The-Box 116 test set error after 10 epochs Predicts the correct class (out of 143) 884 of the time
26
Note 2 No tuning was done(results are for illustration only)
Train 578361 rows 8647 cols 467 classes Test 64263 rows 8647 cols 143 classes
Note 1 H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB)
Text Classification
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
MNIST Unsupervised Anomaly Detection with Deep Learning (Autoencoder)
27
The good The bad The ugly
smallmedianlarge reconstruction errorhttplearnh2oaicontenthands-on_traininganomaly_detectionhtml
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel 28
How well did Deep Learning do
Letrsquos see how H2O did in the past 10 minutes
Higgs Live Demo (Continued)
ltyour guessgt
reference paper results
Any guesses for AUC on low-level features AUC=076 was the best for RFGBMNN (H2O)
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
H2O Steam Scoring Platform
29
Higgs Dataset Demo on 10-node cluster Letrsquos score all our H2O models and compare them
httpserverportsteamindexhtml
Live Demo
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel 30
Live Demo on 10-node cluster lt10 minutes runtime for all H2O algos Better than LHC baseline of AUC=073
Scoring Higgs Models in H2O Steam
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel 31
AlgorithmPaperrsquosl-l AUC
low-level H2O AUC
all featuresH2O AUC
Parameters (not heavily tuned) H2O running on 10 nodes
Generalized Linear Model - 0596 0684 default binomial
Random Forest - 0764 0840 50 trees max depth 50
Gradient Boosted Trees 073 0753 0839 50 trees max depth 15
Neural Net 1 layer 0733 0760 0830 1x300 Rectifier 100 epochs
Deep Learning 3 hidden layers 0836 0850 - 3x1000 Rectifier L2=1e-5 40 epochs
Deep Learning 4 hidden layers 0868 0869 - 4x500 Rectifier L1=L2=1e-5 300 epochs
Deep Learning 5 hidden layers 0880 0871 - 5x500 Rectifier L1=L2=1e-5
Deep Learning on low-level features alone beats everything else Prelim H2O results compare well with paperrsquos results (TMVA amp Theano)
Higgs Particle Detection with H2O
Nature paper httparxivorgpdf14024735v2pdf
HIGGS UCI Dataset 21 low-level features AND 7 high-level derived features Train 10M rows Test 500k rows
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel 32
H2O Deep Learning BookletR Vignette Explains methods amp parameters
Even more tips amp tricks in our other presentations and at H2O World next week
Grab your copy today or download at httptcokWzyFMGJ2S
or view as GitBook
httph2ogitbooksioh2o-deep-learning
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
You can participate33
- Images Convolutional amp Pooling Layers PUB-644
- Sequences Recurrent Neural Networks PUB-1052
- Faster Training GPGPU support PUB-1013
- Pre-Training Stacked Auto-Encoders PUB-1014
- Ensembles PUB-1072
- Use H2O at Kaggle Challenges
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
H2O Kaggle Starter Scripts34
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Re-Live H2O World35
httph2oaih2o-world httplearnh2oai Watch the Videos
Day 2 bull Speakers from Academia amp Industry bull Trevor Hastie (ML) bull John Chambers (S R) bull Josh Bloch (Java API) bull Many use cases from customers bull 3 Top Kaggle Contestants (Top 10)
bull 3 Panel discussions
Day 1 bull Hands-On Training bull Supervised bull Unsupervised bull Advanced Topics bull Markting Usecase
bull Product Demos bull Hacker-Fest with Cliff Click (CTO Hotspot)
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you
H2O Deep Learning ArnoCandel
Key Take-AwaysH2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning
H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data
Join our Community and Meetups httpsgithubcomh2oai h2ostream community forum wwwh2oai h2oai
36
Thank you