A Tutorial on Bayesian Optimization for Machine Learning...

http://hips.seas.harvard.edu

Ryan P. Adams!School of Engineering and Applied Sciences!

Harvard University

A Tutorial on!Bayesian Optimization!for Machine Learning

Machine Learning Meta-Challenges

‣ Increasing Model Complexity More flexible models have more parameters.!

‣ More Sophisticated Fitting Procedures Non-convex optimization has many knobs to turn.!

‣ Less Accessible to Non-Experts Harder to apply complicated techniques.!

‣ Results Are Less Reproducible Too many important implementation details are missing.

Example: Deep Neural Networks

‣ Resurgent interest in large neural networks.!‣ When well-tuned, very successful for visual object

identification, speech recognition, comp bio, ...!‣ Big investments by Google, Facebook, Microsoft, etc.!‣ Many choices: number of layers, weight regularization,

layer size, which nonlinearity, batch size, learning rate schedule, stopping conditions

Search for Good Hyperparameters?

‣ Define an objective function. Most often, we care about generalization performance. Use cross validation to measure parameter quality.!

‣ How do people currently search? Black magic. Grid search Random search Grad student descent!

‣ Painful!Requires many training cycles. Possibly noisy.

Can We Do Better? Bayesian Optimization

‣ Build a probabilistic model for the objective. Include hierarchical structure about units, etc.!

‣ Compute the posterior predictive distribution. Integrate out all the possible true functions. We use Gaussian process regression.!

‣ Optimize a cheap proxy function instead. The model is much cheaper than that true objective.

The main insight:!Make the proxy function exploit uncertainty to balance

exploration against exploitation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

The Bayesian Optimization Idea

current!best

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

current!best

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

current!best

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

current!best

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

current!best

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

current!best

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

current!best

Where should we evaluate next!in order to improve the most?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

current!best

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

current!best

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

current!best

One idea: “expected improvement”

Today’s Topics

‣ Review of Gaussian process priors!‣ Bayesian optimization basics!‣ Managing covariances and kernel parameters!‣ Accounting for the cost of evaluation!‣ Parallelizing training!‣ Sharing information across related problems!‣ Better models for nonstationary functions!‣ Random projections for high-dimensional problems!‣ Accounting for constraints!‣ Leveraging partially-completed training runs

Gaussian Processes as Function Models

Nonparametric prior on functions specified in terms of a positive definite kernel.

−3 −2 −1 0 1 2 3−2

strings (Haussler 1999)

permutations (Kondor 2008) proteins (Borgwardt 2007)

the quick brown fox jumps over the lazy dog

Gaussian Processes

‣ Gaussian process (GP) is a distribution on functions.!

‣ Allows tractable Bayesian modeling of functions without specifying a particular finite basis.!

‣ Input space (where we’re optimizing) !

‣ Model scalar functions !

‣ Positive definite covariance function!

‣ Mean function

f : X ! R

C : X ⇥ X ! R

m : X ! R

Gaussian Processes

Any finite set of points in , induces a homologous -dimensional Gaussian distribution on , taken to be the distribution on

{xn}Nn=1

{yn = f(xn)}Nn=1

Gaussian Processes

{xn}Nn=1

{yn = f(xn)}Nn=1

!3 !2 !1 0 1 2 3!2

Gaussian Processes

{xn}Nn=1

{yn = f(xn)}Nn=1

−3 −2 −1 0 1 2 3−2

Gaussian Processes

‣ Due to Gaussian form, closed-form solutions for many useful questions about finite data.!

‣ Marginal likelihood:!

‣ Predictive distribution at test points :!

‣ We compute these matrices from the covariance:

ln p(y |X, ✓) = �N

2ln 2⇡ � 1

2ln |K✓|�

2yTK�1

[K✓]n,n0 = C(xn,xn0 ; ✓)

[k✓]n,m = C(xn,xm ; ✓) [✓]m,m0 = C(xm,xm0 ; ✓)

{xm}Mm=1

ytest ⇠ N(m,⌃)

m = kT✓K

�1✓ y ⌃ = ✓ � kT

✓K�1✓ k✓

Examples of GP Covariances

C(x, x0) = � exp

✓xd � x0

Squared-Exponential

C(r) =21�⌫

�(�)

p2� r

Matérn

C(x, x0) =2

�sin�1

:2xT�x0

q(1 + 2xT�x)(1 + 2x0T�x0)

“Neural Network”

C(x, x0) = exp

(�2 sin2

�12 (x� x0)

Periodic

GPs Provide Closed-Form Predictions

−3 −2 −1 0 1 2 3−3

Using Uncertainty in Optimization

‣ Find the minimum:!

‣ We can evaluate the objective pointwise, but do not have an easy functional form or gradients. !

‣ After performing some evaluations, the GP gives us easy closed-form marginal means and variances.!

‣ Exploration: Seek places with high variance.!

‣ Exploitation: Seek places with low mean.!

‣ The acquisition function balances these for our proxy optimization to determine the next evaluation.

x? = argminx2X

Closed-Form Acquisition Functions

‣ The GP posterior gives a predictive mean function and a predictive marginal variance function !

‣ Probability of Improvement (Kushner 1964):!

‣ Expected Improvement (Mockus 1978):!

‣ GP Upper Confidence Bound (Srinivas et al. 2010):

aPI(x) = �(�(x))

�(x) =f(xbest)� µ(x)

�(x)

aEI(x) = �(x)(�(x)�(�(x)) +N(�(x) ; 0, 1))

�2(x)

aLCB(x) = µ(x)� �(x)

Probability of Improvement

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

Expected Improvement

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

GP Upper (Lower) Confidence Bound

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

Distribution Over Minimum (Entropy Search)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

Illustrating Bayesian Optimization

Why Doesn’t Everyone Use This?

‣ Fragility and poor default choices. Getting the function model wrong can be catastrophic.!

‣ There hasn’t been standard software available. It’s a bit tricky to build such a system from scratch.!

‣ Experiments are run sequentially. We want to take advantage of cluster computing.!

‣ Limited scalability in dimensions and evaluations. We want to solve big problems.

These ideas have been around for decades.!Why is Bayesian optimization in broader use?

Fragility and Poor Default Choices

‣ Covariance function selectionThis turns out to be crucial to good performance. The default choice for regression is way too smooth. Instead: use adaptive Matèrn 3/5 kernel.!

‣ Gaussian process hyperparameters Typical empirical Bayes approach can fail horribly. Instead: use Markov chain Monte Carlo integration. Slice sampling means no additional parameters!

Ironic Problem:!Bayesian optimization has its own hyperparameters!

Covariance Function Choice

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

C(r) =21�⌫

�(�)

p2� r

⌫ = 1/2 ⌫ = 3/2

⌫ = 5/2 ⌫ = 1

Choosing Covariance Functions

Structured SVM for Protein Motif Finding Miller et al (2012)

0 10 20 30 40 50 60 70 80 90 1000.24

Function evaluations

Matern 52 ARDExpQuad ISOExpQuad ARDMatern 32 ARD

Snoek, Larochelle & RPA, NIPS 2012

MCMC for GP Hyperparameters‣ Covariance hyperparameters are often optimized rather than

marginalized, typically in the name of convenience and efficiency.!‣ Slice sampling of hyperparameters (e.g., Murray and Adams 2010) is

comparably fast and easy, but accounts for uncertainty in length scale, mean, and amplitude.!

‣ Integrated Acquisition Function:!!

‣ For a theoretical discussion of the implications of inferring hyperparameters with BayesOpt, see recent work by Wang and de Freitas (http://arxiv.org/abs/1406.7758)

a(x) =

Za(x ; ✓) p(✓ | {xn, yn}Nn=1) d✓

a(x ; ✓(k)) ✓(k) ⇠ p(✓ | {xn, yn}Nn=1)

Integrating Out GP Hyperparameters

Posterior samples with three different

length scales

Length scale specific expected improvement

Integrated expected improvement

MCMC for Hyperparameters

0 10 20 30 40 50 60 70 80 90 1000.05

Function Evaluations

GP EI MCMCGP EI ConventionalTree Parzen Algorithm

Logistic regression for handwritten digit recognition

A Tutorial on Bayesian Optimization for Machine Learning...

Documents

Transcript of A Tutorial on Bayesian Optimization for Machine Learning...

CSC2515, Fall 2019 Random Forests & University of Toronto ...rgrosse/courses/csc2515_2019/... · XGBoost Fartash Faghri University of Toronto CSC2515, Fall 2019 1. HW1 - Handles tabular

CSC321 Lecture 6 Backpropagation - Department of Computer …rgrosse/csc321/lec6.pdf · 2015-01-22 · Roger Grosse and Nitish Srivastava CSC321 Lecture 6 Backpropagation January

Tut8 model selection

CSC 411 Lectures 16 17: Expectation-Maximizationrgrosse/courses/csc411_f18/slides/...CSC 411 Lectures 16{17: Expectation-Maximization Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411 Lecture 6: Linear Regressionrgrosse/courses/csc411_f18/slides/lec06-slides.pdf · j 6= 0, you could reduce the cost by changing w j. This turns out to give a system of linear

Tutorial Classiﬁcation - Department of Computer Science ...rgrosse/courses/csc321_2018/tutorials/tut2.pdf(See Duda & Hart, for example.) The data set contains 3 classes of 50 instances

CSC321 Tutorial 8: Assignment 3: Mixture of Gaussiansyueli/CSC321_UTM_2014_files/tut8.pdf · CSC321 Tutorial 8: Assignment 3: Mixture of Gaussians ... Pattern recognition and machine

Tut8 Presentation Ramphul..Hemant

Structure Discovery in Nonparametric Regression …rgrosse/icml2013-gp.pdfStructure Discovery in Nonparametric Regression through Compositional Kernel Search additive. In univariate

ECE521 W17 Tutorial 8 - University of Torontojimmy/ece521/Tut8.pdf · ECE521 W17 Tutorial 8 Eleni Triantafillou and Yuhuai (Tony) Wu Some slides borrowed from last year’s tutorial,

CSC 411 Lecture 18: Matrix Factorizationsrgrosse/courses/csc411_f18/slides/...Ideally recommendations should combine global and session interests, look at your history if available,

CSC 411 Lecture 14: Probabilistic Models IIrgrosse/courses/csc411_f18/slides/lec1… · p( jD) /p( )p(Dj ) / h a 1(1 )b 1 ih N H (1 )N T i = a 1+N H (1 )b 1+N T: This is just a beta

Exploiting compositionality to explore a large space …rgrosse/uai2012-matrix.pdfExploiting compositionality to explore a large space of model structures Roger B. Grosse Comp. Sci.

CSC 411: Introduction to Machine Learning - Lecture 1 - Introductionrgrosse/courses/csc411_f18/slides/lec… · (UofT) CSC411-Lec1 1/28. This course Broad introduction to machine

Measuring the reliability of MCMC inference with ...rgrosse/nips2016-bread.pdf · inference systems and for experienced researchers who develop models and algorithms. In this paper,

CSC 411 Lecture 4: Ensembles Irgrosse/courses/csc411_f18/slides/... · 2018. 9. 13. · kind of ensemble you want, based on your goals. UofT CSC 411: 04-Ensembles I 3/22. This lecture:bagging

A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8_adams... · Machine Learning Meta-Challenges ‣ Increasing Model Complexity

CSC321 Lecture 2 A Simple Learning Algorithm : Linear ...rgrosse/csc321/lec2.pdf · Roger Grosse and Nitish Srivastava CSC321 Lecture 2A Simple Learning Algorithm : Linear RegressionJanuary

Discovering and Exploiting Additive Structure for Bayesian ...rgrosse/aistats2017-additive.pdf · Discovering and Exploiting Additive Structure for Bayesian Optimization 10s or 100s

Linear Algebra - Part IIrgrosse/courses/csc411_f18/... · 2018-10-01 · 2 1 1 2 & The characteristic polynomial is: det(A−λI) = det % 2−λ 1 1 2−λ & = 3−4λ+λ2 = 0 It