Post on 26-Mar-2021
High Dimensional Bayesian Optimisation andBandits via Additive Models
Kirthevasan Kandasamy, Jeff Schneider, Barnabas Poczos
ICML ’15
July 8 2015
1/20
Bandits & Optimisation
Maximum Likelihood inference in Computational Astrophysics
Cosmological Simulator
Observation
E.g:Hubble ConstantBaryonic Density
2/20
Bandits & Optimisation
Maximum Likelihood inference in Computational Astrophysics
Cosmological Simulator
Observation
E.g:Hubble ConstantBaryonic Density
2/20
Bandits & Optimisation
Expensive Blackbox Function
2/20
Bandits & Optimisation
Expensive Blackbox FunctionExamples:Hyper-parameter tuning in MLOptimal control strategy in Robotics
2/20
Bandits & Optimisation
f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).
x∗
f (x∗)
x
f(x)
3/20
Bandits & Optimisation
f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x
f(x)
3/20
Bandits & Optimisation
f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x
f(x)
Optimisation ∼= Minimise Simple Regret.
ST = f (x∗) − maxxt , t=1,...,T
f (xt).
3/20
Bandits & Optimisation
f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x
f(x)
Bandits ∼= Minimise Cumulative Regret.
RT =T∑t=1
f (x∗) − f (xt).
3/20
Bandits & Optimisation
f : [0, 1]D → R is an expensive, black-box, nonconvex function.Let x∗ = argmaxx f (x).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x
f(x)
Optimisation ∼= Minimise Simple Regret.
ST = f (x∗) − maxxt , t=1,...,T
f (xt).
3/20
Gaussian Process (Bayesian) OptimisationModel f ∼ GP(0, κ).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x
f(x)
.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
xt = 0.828
x
ϕt(x)
4/20
Gaussian Process (Bayesian) OptimisationModel f ∼ GP(0, κ).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x
f(x)
Obtain posterior GP.
.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
xt = 0.828
x
ϕt(x)
4/20
Gaussian Process (Bayesian) OptimisationModel f ∼ GP(0, κ).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x
f(x)
Maximise acquisition function ϕt : xt = argmaxx ϕt(x).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
xt = 0.828
x
ϕt(x)
GP-UCB: ϕt(x) = µt−1(x) + β1/2t σt−1(x) (Srinivas et al. 2010)
4/20
Gaussian Process (Bayesian) OptimisationModel f ∼ GP(0, κ).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x
f(x)
Maximise acquisition function ϕt : xt = argmaxx ϕt(x).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
xt = 0.828
x
ϕt(x)
ϕt : Expected Improvement (GP-EI), Thompson Sampling etc.4/20
Scaling to Higher Dimensions
Two Key Challenges:
I Statistical Difficulty:Nonparametric sample complexity exponential in D.
I Computational Difficulty:Optimising ϕt to within ζ accuracy requires O(ζ−D) effort.
Existing Work:Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013.
I Assumes f varies only along a low dimensional subspace.
I Perform BO on a low dimensional subspace.
I Assumption too strong in realistic settings.
5/20
Scaling to Higher Dimensions
Two Key Challenges:
I Statistical Difficulty:Nonparametric sample complexity exponential in D.
I Computational Difficulty:Optimising ϕt to within ζ accuracy requires O(ζ−D) effort.
Existing Work:I (Chen et al. 2012): f depends on a small number of variables.
Find variables and then GP-UCB.
I (Wang et al. 2013): f varies along a lower dimensionalsubspace. GP-EI on a random subspace.
I (Djolonga et al. 2013): f varies along a lower dimensionalsubspace. Find subspace and then GP-UCB.
Existing Work:Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013.
I Assumes f varies only along a low dimensional subspace.
I Perform BO on a low dimensional subspace.
I Assumption too strong in realistic settings.
5/20
Scaling to Higher Dimensions
Two Key Challenges:
I Statistical Difficulty:Nonparametric sample complexity exponential in D.
I Computational Difficulty:Optimising ϕt to within ζ accuracy requires O(ζ−D) effort.
Existing Work:Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013.
I Assumes f varies only along a low dimensional subspace.
I Perform BO on a low dimensional subspace.
I Assumption too strong in realistic settings.
5/20
Additive Functions
Structural assumption:
f (x) = f (1)(x (1)) + f (2)(x (2)) + . . . + f (M)(x (M)).
x (j) ∈ X (j) = [0, 1]d , d � D, x (i) ∩ x (j) = ∅.
Given (X ,Y ) = {(xi , yi )Ti=1}, and test point x†,
f (j)(x(j)† )|X ,Y ∼ N
(µ(j), σ(j)
2).
6/20
Additive Functions
Structural assumption:
f (x) = f (1)(x (1)) + f (2)(x (2)) + . . . + f (M)(x (M)).
x (j) ∈ X (j) = [0, 1]d , d � D, x (i) ∩ x (j) = ∅.
E.g. f (x{1,...,10}) = f (1)(x{1,3,9}) + f (2)(x{2,4,8}) + f (3)(x{5,6,10}) .
1 2 3 4 5 6 ��HH7 8 9 10
Call {X (j)M
j=1} = {(1, 3, 9), (2, 4, 8), (5, 6, 10)} the “decomposition”.
Given (X ,Y ) = {(xi , yi )Ti=1}, and test point x†,
f (j)(x(j)† )|X ,Y ∼ N
(µ(j), σ(j)
2).
6/20
Additive Functions
Structural assumption:
f (x) = f (1)(x (1)) + f (2)(x (2)) + . . . + f (M)(x (M)).
x (j) ∈ X (j) = [0, 1]d , d � D, x (i) ∩ x (j) = ∅.
Assume each f (j) ∼ GP(0, κ(j)). Then f ∼ GP(0, κ) where,
κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).
Given (X ,Y ) = {(xi , yi )Ti=1}, and test point x†,
f (j)(x(j)† )|X ,Y ∼ N
(µ(j), σ(j)
2).
6/20
Additive Functions
Structural assumption:
f (x) = f (1)(x (1)) + f (2)(x (2)) + . . . + f (M)(x (M)).
x (j) ∈ X (j) = [0, 1]d , d � D, x (i) ∩ x (j) = ∅.
Assume each f (j) ∼ GP(0, κ(j)). Then f ∼ GP(0, κ) where,
κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).
Given (X ,Y ) = {(xi , yi )Ti=1}, and test point x†,
f (j)(x(j)† )|X ,Y ∼ N
(µ(j), σ(j)
2).
6/20
Outline
1. GP-UCB2. The Add-GP-UCB algorithm
I Bounds on ST : exponential in D → linear in D.
I An easy-to-optimise acquisition function.
I Performs well even when f is not additive.
3. Experiments
4. Conclusion & some open questions
7/20
GP-UCB
xt = argmaxx∈X
µt−1(x) + β1/2t σt−1(x)
Squared Exponential Kernel
κ(x , x ′) = A exp
(‖x − x ′‖2
2h2
)
Theorem (Srinivas et al. 2010)
Let f ∼ GP(0, κ). Then w.h.p,
ST ∈ O
(√DD(log T )D
T
).
8/20
GP-UCB
xt = argmaxx∈X
µt−1(x) + β1/2t σt−1(x)
Squared Exponential Kernel
κ(x , x ′) = A exp
(‖x − x ′‖2
2h2
)
Theorem (Srinivas et al. 2010)
Let f ∼ GP(0, κ). Then w.h.p,
ST ∈ O
(√DD(log T )D
T
).
8/20
GP-UCB on additive κ
If f ∼ GP(0, κ) where
κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).
κ(j) → SE Kernel.
Can be shown: If each κ(j) is a SE kernel,
ST ∈ O
(√D2dd(log T )d
T
).
But ϕt = µt−1 + β1/2t σt−1 is D-dimensional !
9/20
GP-UCB on additive κ
If f ∼ GP(0, κ) where
κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).
κ(j) → SE Kernel.
Can be shown: If each κ(j) is a SE kernel,
ST ∈ O
(√D2dd(log T )d
T
).
But ϕt = µt−1 + β1/2t σt−1 is D-dimensional !
9/20
GP-UCB on additive κ
If f ∼ GP(0, κ) where
κ(x , x ′) = κ(1)(x (1), x (1)′) + · · ·+ κ(M)(x (M), x (M)′).
κ(j) → SE Kernel.
Can be shown: If each κ(j) is a SE kernel,
ST ∈ O
(√D2dd(log T )d
T
).
But ϕt = µt−1 + β1/2t σt−1 is D-dimensional !
9/20
Add-GP-UCB
ϕt(x) =M∑j=1
µ(j)t−1(x) + β
1/2t σ
(j)t−1(x (j)).
Maximise each ϕ(j)t separately.
Requires only O(poly(D)ζ−d) effort (vs O(ζ−D) for GP-UCB).
Theorem
Let f (j) ∼ GP(0, κ(j)) and f =∑
j f (j). Then w.h.p,
ST ∈ O
(√D2dd(log T )d
T
).
10/20
Add-GP-UCB
ϕt(x) =M∑j=1
µ(j)t−1(x) + β
1/2t σ
(j)t−1(x (j))︸ ︷︷ ︸
ϕ(j)t (x(j))
.
Maximise each ϕ(j)t separately.
Requires only O(poly(D)ζ−d) effort (vs O(ζ−D) for GP-UCB).
Theorem
Let f (j) ∼ GP(0, κ(j)) and f =∑
j f (j). Then w.h.p,
ST ∈ O
(√D2dd(log T )d
T
).
10/20
Add-GP-UCB
ϕt(x) =M∑j=1
µ(j)t−1(x) + β
1/2t σ
(j)t−1(x (j))︸ ︷︷ ︸
ϕ(j)t (x(j))
.
Maximise each ϕ(j)t separately.
Requires only O(poly(D)ζ−d) effort (vs O(ζ−D) for GP-UCB).
Theorem
Let f (j) ∼ GP(0, κ(j)) and f =∑
j f (j). Then w.h.p,
ST ∈ O
(√D2dd(log T )d
T
).
10/20
Summary of Theoretical Results (for SE Kernel)
GP-UCB with no assumption on f :
ST ∈ O(
DD/2(log T )D/2T−1/2)
GP-UCB on additive f :
ST ∈ O(
DT−1/2)
Maximising ϕt : O(ζ−D) effort.
Add-GP-UCB on additive f :
ST ∈ O(
DT−1/2)
Maximising ϕt : O(poly(D)ζ−d) effort.
11/20
Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x{1}
f (1)(x{1})
00
.10
.20
.30
.40
.50
.60
.70
.80
.91
x{2}
f(2) (x{2})
112/20
Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x{1}
f (1)(x{1})
00
.10
.20
.30
.40
.50
.60
.70
.80
.91
x{2}
f(2) (x{2})
112/20
Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x{1}
00
.10
.20
.30
.40
.50
.60
.70
.80
.91
x{2}
112/20
Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x(1)t
= 0.869
x{1}
ϕ(1)(x{1})
00
.10
.20
.30
.40
.50
.60
.70
.80
.91
x(2)
t=
0.141
x{2}
ϕ(2)(x
{2})
112/20
Add-GP-UCB f (x{1,2}) = f (1)(x{1}) + f (2)(x{2})
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x(1)t
= 0.869
x{1}
ϕ(1)(x{1})
xt = (0.869, 0.141)
00
.10
.20
.30
.40
.50
.60
.70
.80
.91
x(2)
t=
0.141
x{2}
ϕ(2)(x
{2})
112/20
Additive modeling in non-additive settings
I Additive models common in high dimensional regression.E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc.f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · ·+ f (x{D}).
I Additive models are statistically simpler =⇒ worse bias, butmuch better variance in low sample regime.
I In BO applications queries are expensive. So we usuallycannot afford many queries.
I Observation:Add-GP-UCB does well even when f is not additive.
I Better bias/ variance trade-off in high dimensional regression.
I Easy to maximise acquisition function.
13/20
Additive modeling in non-additive settings
I Additive models common in high dimensional regression.E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc.f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · ·+ f (x{D}).
I Additive models are statistically simpler =⇒ worse bias, butmuch better variance in low sample regime.
I In BO applications queries are expensive. So we usuallycannot afford many queries.
I Observation:Add-GP-UCB does well even when f is not additive.
I Better bias/ variance trade-off in high dimensional regression.
I Easy to maximise acquisition function.
13/20
Additive modeling in non-additive settings
I Additive models common in high dimensional regression.E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc.f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · ·+ f (x{D}).
I Additive models are statistically simpler =⇒ worse bias, butmuch better variance in low sample regime.
I In BO applications queries are expensive. So we usuallycannot afford many queries.
I Observation:Add-GP-UCB does well even when f is not additive.
I Better bias/ variance trade-off in high dimensional regression.
I Easy to maximise acquisition function.
13/20
Additive modeling in non-additive settings
I Additive models common in high dimensional regression.E.g.: Backfitting, MARS, COSSO, RODEO, SpAM etc.f (x{1,...,D}) = f (x{1}) + f (x{2}) + · · ·+ f (x{D}).
I Additive models are statistically simpler =⇒ worse bias, butmuch better variance in low sample regime.
I In BO applications queries are expensive. So we usuallycannot afford many queries.
I Observation:Add-GP-UCB does well even when f is not additive.
I Better bias/ variance trade-off in high dimensional regression.
I Easy to maximise acquisition function.
13/20
Unknown Kernel/ Decomposition in practice
Learn kernel hyper-parameters and decomposition {Xj} bymaximising GP marginal likelihood periodically.
14/20
Experiments
0 200 400 600 800
100
101
102 Add-∗: Knowsdecomposition.
Add-d/M:
M groups of
size ≤ d .
Use 1000 DiRect evaluations to maximise acquisition function.DiRect: Dividing Rectangles (Jones et al. 1993)
15/20
Experiments
0 200 400 600 800
100
101
102
Add-∗: Knowsdecomposition.
Add-d/M:
M groups of
size ≤ d .
Use 4000 DiRect evaluations to maximise acquisition function.
15/20
SDSS Luminous Red Galaxies
Cosmological Simulator
Observation
E.g:Hubble ConstantBaryonic Density
I Task: Find maximum likelihood cosmological parameters.
I 20 Dimensions. But only 9 parameters are relevant.
I Each query takes 2-5 seconds.
I Use 500 DiRect evaluations to maximise acquisition function.
16/20
SDSS Luminous Red Galaxies
0 100 200 300 400−103
−102
−101
REMBO: (Wang et al. 2013)
17/20
Viola & Jones Face Detection
A cascade of 22 weak classifiers.Image classified negative if the score < threshold at any stage.
I Task: Find optimal threshold values on a training set of1000 images.
I 22 dimensions.
I Each query takes 30-40 seconds.
I Use 1000 DiRect evaluations to maximise acquisition function.
18/20
Viola & Jones Face Detection
0 100 200 30065
70
75
80
85
90
95
19/20
Summary
I Additive assumption improves regret:exponential in D → linear in D.
I Acquisition function is easy to maximise.
I Even for non-additive f is not additive, Add-GP-UCB doeswell in practice.
I Similar results hold for Matern kernels and in bandit setting.
Some open questions:
I How to choose (d ,M)?
I Can we generalise to other acquisition functions?
Code available: github.com/kirthevasank/add-gp-bandits
Jeff’s Talk: Friday 2pm @ Van GoghThank You.
20/20
Summary
I Additive assumption improves regret:exponential in D → linear in D.
I Acquisition function is easy to maximise.
I Even for non-additive f is not additive, Add-GP-UCB doeswell in practice.
I Similar results hold for Matern kernels and in bandit setting.
Some open questions:
I How to choose (d ,M)?
I Can we generalise to other acquisition functions?
Code available: github.com/kirthevasank/add-gp-bandits
Jeff’s Talk: Friday 2pm @ Van GoghThank You.
20/20
Summary
I Additive assumption improves regret:exponential in D → linear in D.
I Acquisition function is easy to maximise.
I Even for non-additive f is not additive, Add-GP-UCB doeswell in practice.
I Similar results hold for Matern kernels and in bandit setting.
Some open questions:
I How to choose (d ,M)?
I Can we generalise to other acquisition functions?
Code available: github.com/kirthevasank/add-gp-bandits
Jeff’s Talk: Friday 2pm @ Van GoghThank You.
20/20
Summary
I Additive assumption improves regret:exponential in D → linear in D.
I Acquisition function is easy to maximise.
I Even for non-additive f is not additive, Add-GP-UCB doeswell in practice.
I Similar results hold for Matern kernels and in bandit setting.
Some open questions:
I How to choose (d ,M)?
I Can we generalise to other acquisition functions?
Code available: github.com/kirthevasank/add-gp-bandits
Jeff’s Talk: Friday 2pm @ Van GoghThank You.
20/20