[0.2in] High Dimensional Bayesian Optimisation and Bandits ...kandasamy/docs/...Kirthevasan...

Post on 26-Mar-2021

5 views 0 download

Preview:

Click to see full reader

Report this document

Transcript of [0.2in] High Dimensional Bayesian Optimisation and Bandits ...kandasamy/docs/...Kirthevasan...

High Dimensional Bayesian Optimisation andBandits via Additive Models

Kirthevasan Kandasamy, Jeff Schneider, Barnabas Poczos

ICML ’15

July 8 2015

1/20

Bandits & Optimisation

Maximum Likelihood inference in Computational Astrophysics

Cosmological Simulator

Observation

E.g:Hubble ConstantBaryonic Density

2/20

Bandits & Optimisation

Maximum Likelihood inference in Computational Astrophysics

Cosmological Simulator

Observation

E.g:Hubble ConstantBaryonic Density

2/20

Bandits & Optimisation

Expensive Blackbox Function

2/20

Bandits & Optimisation

Expensive Blackbox FunctionExamples:Hyper-parameter tuning in MLOptimal control strategy in Robotics

2/20

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).

x∗

f (x∗)

x

f(x)

3/20

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x

f(x)

3/20

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x

f(x)

Optimisation ∼= Minimise Simple Regret.

ST = f (x∗) − maxxt , t=1,...,T

f (xt).

3/20

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x

f(x)

Bandits ∼= Minimise Cumulative Regret.

RT =T∑t=1

f (x∗) − f (xt).

3/20

Bandits & Optimisation

f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x

f(x)

Optimisation ∼= Minimise Simple Regret.

ST = f (x∗) − maxxt , t=1,...,T

f (xt).

3/20

Gaussian Process (Bayesian) OptimisationModel f ∼ GP(0, κ).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x

f(x)

.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

xt = 0.828

x

ϕt(x)

4/20

Gaussian Process (Bayesian) OptimisationModel f ∼ GP(0, κ).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x

f(x)

Obtain posterior GP.

.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

xt = 0.828

x

ϕt(x)

4/20

Gaussian Process (Bayesian) OptimisationModel f ∼ GP(0, κ).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x

f(x)

Maximise acquisition function ϕt : xt = argmaxx ϕt(x).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

xt = 0.828

x

ϕt(x)

GP-UCB: ϕt(x) = µt−1(x) + β1/2t σt−1(x) (Srinivas et al. 2010)

4/20

Gaussian Process (Bayesian) OptimisationModel f ∼ GP(0, κ).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x

f(x)

Maximise acquisition function ϕt : xt = argmaxx ϕt(x).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

xt = 0.828

x

ϕt(x)

ϕt : Expected Improvement (GP-EI), Thompson Sampling etc.4/20

Scaling to Higher Dimensions

Two Key Challenges:

I Statistical Difficulty:Nonparametric sample complexity exponential in D.

I Computational Difficulty:Optimising ϕt to within ζ accuracy requires O(ζ−D) effort.

Existing Work:Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013.

I Assumes f varies only along a low dimensional subspace.

I Perform BO on a low dimensional subspace.

I Assumption too strong in realistic settings.

5/20

Scaling to Higher Dimensions

Two Key Challenges:

I Statistical Difficulty:Nonparametric sample complexity exponential in D.

I Computational Difficulty:Optimising ϕt to within ζ accuracy requires O(ζ−D) effort.

Existing Work:I (Chen et al. 2012): f depends on a small number of variables.

Find variables and then GP-UCB.

I (Wang et al. 2013): f varies along a lower dimensionalsubspace. GP-EI on a random subspace.

I (Djolonga et al. 2013): f varies along a lower dimensionalsubspace. Find subspace and then GP-UCB.

Existing Work:Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013.

I Assumes f varies only along a low dimensional subspace.

I Perform BO on a low dimensional subspace.

I Assumption too strong in realistic settings.

5/20

Scaling to Higher Dimensions

Two Key Challenges:

I Statistical Difficulty:Nonparametric sample complexity exponential in D.

I Computational Difficulty:Optimising ϕt to within ζ accuracy requires O(ζ−D) effort.

Existing Work:Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013.

I Assumes f varies only along a low dimensional subspace.

I Perform BO on a low dimensional subspace.

I Assumption too strong in realistic settings.

5/20

Additive Functions

Structural assumption:

f (x) = f (1)(x (1)) + f (2)(x (2)) + . . . + f (M)(x (M)).

x (j) ∈ X (j) = [0, 1]d , d � D, x (i) ∩ x (j) = ∅.

Given (X ,Y ) = {(xi , yi )Ti=1}, and test point x†,

f (j)(x(j)† )|X ,Y ∼ N

(µ(j), σ(j)

2).

6/20

Additive Functions

Structural assumption:

f (x) = f (1)(x (1)) + f (2)(x (2)) + . . . + f (M)(x (M)).

x (j) ∈ X (j) = [0, 1]d , d � D, x (i) ∩ x (j) = ∅.

E.g. f (x{1,...,10}) = f (1)(x{1,3,9}) + f (2)(x{2,4,8}) + f (3)(x{5,6,10}) .

1 2 3 4 5 6 ��HH7 8 9 10

Call {X (j)M

j=1} = {(1, 3, 9), (2, 4, 8), (5, 6, 10)} the “decomposition”.

Given (X ,Y ) = {(xi , yi )Ti=1}, and test point x†,

f (j)(x(j)† )|X ,Y ∼ N

(µ(j), σ(j)

2).

6/20

Additive Functions

Structural assumption:

f (x) = f (1)(x (1)) + f (2)(x (2)) + . . . + f (M)(x (M)).

x (j) ∈ X (j) = [0, 1]d , d � D, x (i) ∩ x (j) = ∅.

Assume each f (j) ∼ GP(0, κ(j)). Then f ∼ GP(0, κ) where,

κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).

Given (X ,Y ) = {(xi , yi )Ti=1}, and test point x†,

f (j)(x(j)† )|X ,Y ∼ N

(µ(j), σ(j)

2).

6/20

Additive Functions

Structural assumption:

f (x) = f (1)(x (1)) + f (2)(x (2)) + . . . + f (M)(x (M)).

x (j) ∈ X (j) = [0, 1]d , d � D, x (i) ∩ x (j) = ∅.

Assume each f (j) ∼ GP(0, κ(j)). Then f ∼ GP(0, κ) where,

κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).

Given (X ,Y ) = {(xi , yi )Ti=1}, and test point x†,

f (j)(x(j)† )|X ,Y ∼ N

(µ(j), σ(j)

2).

6/20

Outline

1. GP-UCB2. The Add-GP-UCB algorithm

I Bounds on ST : exponential in D → linear in D.

I An easy-to-optimise acquisition function.

I Performs well even when f is not additive.

3. Experiments

4. Conclusion & some open questions

7/20

GP-UCB

xt = argmaxx∈X

µt−1(x) + β1/2t σt−1(x)

Squared Exponential Kernel

κ(x , x ′) = A exp

(‖x − x ′‖2

2h2

)

Theorem (Srinivas et al. 2010)

Let f ∼ GP(0, κ). Then w.h.p,

ST ∈ O

(√DD(log T )D

T

).

8/20

GP-UCB

xt = argmaxx∈X

µt−1(x) + β1/2t σt−1(x)

Squared Exponential Kernel

κ(x , x ′) = A exp

(‖x − x ′‖2

2h2

)

Theorem (Srinivas et al. 2010)

Let f ∼ GP(0, κ). Then w.h.p,

ST ∈ O

(√DD(log T )D

T

).

8/20

GP-UCB on additive κ

If f ∼ GP(0, κ) where

κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).

κ(j) → SE Kernel.

Can be shown: If each κ(j) is a SE kernel,

ST ∈ O

(√D2dd(log T )d

T

).

But ϕt = µt−1 + β1/2t σt−1 is D-dimensional !

9/20

GP-UCB on additive κ

If f ∼ GP(0, κ) where

κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).

κ(j) → SE Kernel.

Can be shown: If each κ(j) is a SE kernel,

ST ∈ O

(√D2dd(log T )d

T

).

But ϕt = µt−1 + β1/2t σt−1 is D-dimensional !

9/20

GP-UCB on additive κ

If f ∼ GP(0, κ) where

κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).

κ(j) → SE Kernel.

Can be shown: If each κ(j) is a SE kernel,

ST ∈ O

(√D2dd(log T )d

T

).

But ϕt = µt−1 + β1/2t σt−1 is D-dimensional !

9/20

Add-GP-UCB

ϕt(x) =M∑j=1

µ(j)t−1(x) + β

1/2t σ

(j)t−1(x (j)).

Maximise each ϕ(j)t separately.

Requires only O(poly(D)ζ−d) effort (vs O(ζ−D) for GP-UCB).

Theorem

Let f (j) ∼ GP(0, κ(j)) and f =∑

j f (j). Then w.h.p,

ST ∈ O

(√D2dd(log T )d

T

).

10/20

Add-GP-UCB

ϕt(x) =M∑j=1

µ(j)t−1(x) + β

1/2t σ

(j)t−1(x (j))︸︷︷︸

ϕ(j)t (x(j))

.

Maximise each ϕ(j)t separately.

Requires only O(poly(D)ζ−d) effort (vs O(ζ−D) for GP-UCB).

Theorem

Let f (j) ∼ GP(0, κ(j)) and f =∑

j f (j). Then w.h.p,

ST ∈ O

(√D2dd(log T )d

T

).

10/20

Add-GP-UCB

ϕt(x) =M∑j=1

µ(j)t−1(x) + β

1/2t σ

(j)t−1(x (j))︸︷︷︸

ϕ(j)t (x(j))

.

Maximise each ϕ(j)t separately.

Requires only O(poly(D)ζ−d) effort (vs O(ζ−D) for GP-UCB).

Theorem

Let f (j) ∼ GP(0, κ(j)) and f =∑

j f (j). Then w.h.p,

ST ∈ O

(√D2dd(log T )d

T

).

10/20

Summary of Theoretical Results (for SE Kernel)

GP-UCB with no assumption on f :

ST ∈ O(

DD/2(log T )D/2T−1/2)

GP-UCB on additive f :

ST ∈ O(

DT−1/2)

Maximising ϕt : O(ζ−D) effort.

Add-GP-UCB on additive f :

ST ∈ O(

DT−1/2)

Maximising ϕt : O(poly(D)ζ−d) effort.

11/20

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x{1}

f (1)(x{1})

00

.10

.20

.30

.40

.50

.60

.70

.80

.91

x{2}

f(2) (x{2})

112/20

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x{1}

f (1)(x{1})

00

.10

.20

.30

.40

.50

.60

.70

.80

.91

x{2}

f(2) (x{2})

112/20

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x{1}

00

.10

.20

.30

.40

.50

.60

.70

.80

.91

x{2}

112/20

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x(1)t

= 0.869

x{1}

ϕ(1)(x{1})

00

.10

.20

.30

.40

.50

.60

.70

.80

.91

x(2)

t=

0.141

x{2}

ϕ(2)(x

{2})

112/20

Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x(1)t

= 0.869

x{1}

ϕ(1)(x{1})

xt = (0.869, 0.141)

00

.10

.20

.30

.40

.50

.60

.70

.80

.91

x(2)

t=

0.141

x{2}

ϕ(2)(x

{2})

112/20

Additive modeling in non-additive settings

I Additive models common in high dimensional regression.E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc.f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · ·+ f (x{D}).

I Additive models are statistically simpler =⇒ worse bias, butmuch better variance in low sample regime.

I In BO applications queries are expensive. So we usuallycannot afford many queries.

I Observation:Add-GP-UCB does well even when f is not additive.

I Better bias/ variance trade-off in high dimensional regression.

I Easy to maximise acquisition function.

13/20

Additive modeling in non-additive settings

I Additive models common in high dimensional regression.E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc.f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · ·+ f (x{D}).

I Additive models are statistically simpler =⇒ worse bias, butmuch better variance in low sample regime.

I In BO applications queries are expensive. So we usuallycannot afford many queries.

I Observation:Add-GP-UCB does well even when f is not additive.

I Better bias/ variance trade-off in high dimensional regression.

I Easy to maximise acquisition function.

13/20

Additive modeling in non-additive settings

I Additive models common in high dimensional regression.E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc.f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · ·+ f (x{D}).

I Additive models are statistically simpler =⇒ worse bias, butmuch better variance in low sample regime.

I In BO applications queries are expensive. So we usuallycannot afford many queries.

I Observation:Add-GP-UCB does well even when f is not additive.

I Better bias/ variance trade-off in high dimensional regression.

I Easy to maximise acquisition function.

13/20

Additive modeling in non-additive settings

I Additive models common in high dimensional regression.E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc.f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · ·+ f (x{D}).

I Additive models are statistically simpler =⇒ worse bias, butmuch better variance in low sample regime.

I In BO applications queries are expensive. So we usuallycannot afford many queries.

I Observation:Add-GP-UCB does well even when f is not additive.

I Better bias/ variance trade-off in high dimensional regression.

I Easy to maximise acquisition function.

13/20

Unknown Kernel/ Decomposition in practice

Learn kernel hyper-parameters and decomposition {Xj} bymaximising GP marginal likelihood periodically.

14/20

Experiments

0 200 400 600 800

100

101

102 Add-∗: Knowsdecomposition.

Add-d/M:

M groups of

size ≤ d .

Use 1000 DiRect evaluations to maximise acquisition function.DiRect: Dividing Rectangles (Jones et al. 1993)

15/20

Experiments

0 200 400 600 800

100

101

102

Add-∗: Knowsdecomposition.

Add-d/M:

M groups of

size ≤ d .

Use 4000 DiRect evaluations to maximise acquisition function.

15/20

SDSS Luminous Red Galaxies

Cosmological Simulator

Observation

E.g:Hubble ConstantBaryonic Density

I Task: Find maximum likelihood cosmological parameters.

I 20 Dimensions. But only 9 parameters are relevant.

I Each query takes 2-5 seconds.

I Use 500 DiRect evaluations to maximise acquisition function.

16/20

SDSS Luminous Red Galaxies

0 100 200 300 400−103

−102

−101

REMBO: (Wang et al. 2013)

17/20

Viola & Jones Face Detection

A cascade of 22 weak classifiers.Image classified negative if the score < threshold at any stage.

I Task: Find optimal threshold values on a training set of1000 images.

I 22 dimensions.

I Each query takes 30-40 seconds.

I Use 1000 DiRect evaluations to maximise acquisition function.

18/20

Viola & Jones Face Detection

0 100 200 30065

70

75

80

85

90

95

19/20

Summary

I Additive assumption improves regret:exponential in D → linear in D.

I Acquisition function is easy to maximise.

I Even for non-additive f is not additive, Add-GP-UCB doeswell in practice.

I Similar results hold for Matern kernels and in bandit setting.

Some open questions:

I How to choose (d ,M)?

I Can we generalise to other acquisition functions?

Code available: github.com/kirthevasank/add-gp-bandits

Jeff’s Talk: Friday 2pm @ Van GoghThank You.

20/20

Summary

I Additive assumption improves regret:exponential in D → linear in D.

I Acquisition function is easy to maximise.

I Even for non-additive f is not additive, Add-GP-UCB doeswell in practice.

I Similar results hold for Matern kernels and in bandit setting.

Some open questions:

I How to choose (d ,M)?

I Can we generalise to other acquisition functions?

Code available: github.com/kirthevasank/add-gp-bandits

Jeff’s Talk: Friday 2pm @ Van GoghThank You.

20/20

Summary

I Additive assumption improves regret:exponential in D → linear in D.

I Acquisition function is easy to maximise.

I Even for non-additive f is not additive, Add-GP-UCB doeswell in practice.

I Similar results hold for Matern kernels and in bandit setting.

Some open questions:

I How to choose (d ,M)?

I Can we generalise to other acquisition functions?

Code available: github.com/kirthevasank/add-gp-bandits

Jeff’s Talk: Friday 2pm @ Van GoghThank You.

20/20

Summary

I Additive assumption improves regret:exponential in D → linear in D.

I Acquisition function is easy to maximise.

I Even for non-additive f is not additive, Add-GP-UCB doeswell in practice.

I Similar results hold for Matern kernels and in bandit setting.

Some open questions:

I How to choose (d ,M)?

I Can we generalise to other acquisition functions?

Code available: github.com/kirthevasank/add-gp-bandits

Jeff’s Talk: Friday 2pm @ Van GoghThank You.

20/20

Copyright © 2022 FDOCUMENTS