Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: •...
Transcript of Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: •...
![Page 1: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/1.jpg)
Scalable Bayesian Optimization using Deep Neural Networks
Jasper Snoek
with
Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Prabhat, Ryan P. Adams
![Page 2: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/2.jpg)
Motivation
Bayesian optimization:
• Global optimization of expensive, multi-modal and noisy functions
• E.g. the hyperparameters of machine learning algorithms
• Robots, chemistry, cooking recipes, etc
![Page 3: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/3.jpg)
Bayesian Optimization for Hyperparameters
Instead of relying on intuition or brute-force strategies:
Perform a regression from the high-level model parameters to the error metric (e.g. classification error)
• Build a statistical model of the function, with a suitable prior – e.g. a Gaussian process
• Use stats to tell us:
• Where is the expected minimum of the function?
• Expected improvement of trying other parameters
![Page 4: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/4.jpg)
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2 True Function with Three Observations
![Page 5: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/5.jpg)
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Bayesian nonlinear regression predictive distributions
![Page 6: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/6.jpg)
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
How do the predictions compare to the current best?
![Page 7: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/7.jpg)
How do the predictions compare to the current best?
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Expected Improvement
![Page 8: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/8.jpg)
GPs as Distributions over FunctionsPrior Posterior
But the computational cost grows cubically in N!
![Page 9: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/9.jpg)
Having a Statistical Framework Helps
• Reason about constraints:• Gramacy et al., 2010,. Gardner et al., 2014., Gelbart, Snoek & Adams, 2014. …
• Think about multi-task & transfer across related problems• Krause & Ong, 2011. Hutter et al., 2011. Bardenet et al., 2013. Swersky, Snoek & Adams,
2013. …
• Run experiments in parallel• Ginsbourger & Riche, 2010. Hutter et al., 2011. Snoek, Larochelle & Adams, 2012.
Frazier et al., 2014 …
• Determine when to stop experiments early• Swersky, Snoek & Adams, 2014. Domhan et al., 2014
![Page 10: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/10.jpg)
GP-Based Bayesian Optimization • Gaussian Processes scale poorly - N3
• Due to having to invert data covariance matrix• This prevents us from…
• Running hundreds/thousands of experiments in parallel• Sharing information across many optimizations• Modeling every epoch of learning (early stopping)• Having very complex constraint spaces• Tackling high dimensional problems
• In order to address more interesting problems, we have to scale it up
![Page 11: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/11.jpg)
Need a Different Model• Random Forests
• Empirical estimate of uncertainty• Outperformed by neural nets in general
• Sparse GPs• Scale better but aren’t actually used in practice• Hard to get to work well. Uncertainty is not great
• Bayesian Neural Nets• Very flexible, powerful models• Marginalizing all the parameters is prohibitively expensive
![Page 12: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/12.jpg)
Deep Nets for Global Optimization• A pragmatic Bayesian deep neural net
Bayesian Linear Regression
![Page 13: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/13.jpg)
How does this work?
Expected Improvement depends on the predictive mean and variance of the model
![Page 14: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/14.jpg)
How does this work?
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Expected Improvement
Expected Improvement depends on the predictive mean and variance of the model
![Page 15: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/15.jpg)
How does this work?
m = �K�1�Ty 2 RD
K = ��T�+ I↵2 2 RD⇥D
last hidden layer of the neural net for test data
last hidden layer of the neural net for training data
D << N !
![Page 16: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/16.jpg)
How does this work?
m = �K�1�Ty 2 RD
K = ��T�+ I↵2 2 RD⇥D
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Expected Improvement
last hidden layer of the neural net for test data
last hidden layer of the neural net for training data
D << N !
![Page 17: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/17.jpg)
How does this work?
⌘(x) = �+ (x� c)T⇤(x� c)
We set a quadratic prior - a bowl centered in the middle of the search region
![Page 18: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/18.jpg)
How does this work?
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Expected Improvement
⌘(x) = �+ (x� c)T⇤(x� c)
We set a quadratic prior - a bowl centered in the middle of the search region
![Page 19: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/19.jpg)
ConstraintsAlmost every real problem has complex constraints
• Often unknown a-priori• E.g. training of a model diverging and producing NaN
• We developed a principled approach to dealing with constraints• Gelbart, Snoek & Adams. Bayesian Optimization with Unknown Constraints.
UAI 2014.
• Need to scale that up as well
![Page 20: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/20.jpg)
Constraints
Use a classification neural net and integrate out the last layer (Laplace Approximation)
![Page 21: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/21.jpg)
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
With 3 complete and 2 pending, what to do next?
![Page 22: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/22.jpg)
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
With 3 complete and 2 pending, what to do next?
Use posterior predictive to “fantasize” outcomes.
![Page 23: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/23.jpg)
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
With 3 complete and 2 pending, what to do next?
Use posterior predictive to “fantasize” outcomes.
Compute acquisition function (EI) for each predictive
fantasy.
![Page 24: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/24.jpg)
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
With 3 complete and 2 pending, what to do next?
Use posterior predictive to “fantasize” outcomes.
Compute acquisition function (EI) for each predictive
fantasy.
Monte Carlo estimate of overall acquisition function.
![Page 25: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/25.jpg)
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
Sample outputs for both objective and constraint
Monte Carlo Constrained EI
![Page 26: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/26.jpg)
What about all the hyperparameters of this model?
Integrate out hyperparameters of Bayesian layers
![Page 27: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/27.jpg)
What about all the hyperparameters of this model?
Integrate out hyperparameters of Bayesian layers
Use GP Bayesian optimization for the neural net hyperparameters
![Page 28: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/28.jpg)
Putting it all together
Backprop down to the inputs to optimize for the most promising next experiment
![Page 29: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/29.jpg)
How does it scale?
![Page 30: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/30.jpg)
A collection of Bayesian optimization benchmarks(Eggensperger et. al)
How well does it optimize?
![Page 31: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/31.jpg)
Convolutional Networks• Notoriously hard to tune
• 14 hyperparameters with broad support• e.g. learning rate, momentum, input dropout, dropout,
weight-decay, weight initialization, parameters on input transformations, etc.
• Very generic architecture
• Evaluate 40 in parallel on Intel® Xeon Phi™ coprocessors
![Page 32: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/32.jpg)
Convolutional Networks
Achieved “state-of-the-art” within a few sequential steps
![Page 33: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/33.jpg)
Image Caption Generation
Tune the hyperparameters of this model
• MS COCO Benchmark Dataset• Each experiment takes ~26 hours• 11 hyperparameters (including categorical)
• Approx half of the space is invalid• 500-800 in parallel
Zaremba, Sutskever & Vinyals, 2015
![Page 34: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/34.jpg)
Image Caption Generation
Tune the hyperparameters of this model
Zaremba, Sutskever & Vinyals, 2015
Iteration #500 1000 1500 2000 2500
Valid
atio
n BL
EU-4
Sco
re
0
5
10
15
20
25
![Page 35: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/35.jpg)
Image Caption Generation
Tune the hyperparameters of this model
Zaremba, Sutskever & Vinyals, 2015
Iteration #500 1000 1500 2000 2500
Valid
atio
n BL
EU-4
Sco
re
0
5
10
15
20
25
“A person riding a wave in the ocean” “A bird sitting on top of a field”
![Page 36: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/36.jpg)
Image Caption Generation
Tune the hyperparameters of this model
Zaremba, Sutskever & Vinyals, 2015
Iteration #500 1000 1500 2000 2500
Valid
atio
n BL
EU-4
Sco
re
0
5
10
15
20
25
“A person riding a wave in the ocean” “A bird sitting on top of a field”
“A horse riding a horse”
![Page 37: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/37.jpg)
Other Interesting Decisions - Neural Net Basis Functions
tanh ReLU
tanh + ReLU
![Page 38: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters](https://reader035.fdocuments.net/reader035/viewer/2022062602/5edf366fad6a402d666a8fe2/html5/thumbnails/38.jpg)
ThanksOren Rippel (MIT, Harvard)
Kevin Swersky (Toronto)
Ryan P. Adams (Harvard)
Ryan Kiros (Toronto)
Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary (Intel Parallel Labs)
Prabhat (Lawrence Berkeley National Laboratory)