LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf ·...

53
LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATION Luiz F. O. Chamon, Santiago Paternain, and Alejandro Ribeiro ASILOMAR 2019 November 4 th , 2019 Chamon et al. Learning GPs with Bayesian Posterior Optimization 1 [email protected]

Transcript of LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf ·...

Page 1: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

LEARNING GPs WITHBAYESIAN POSTERIOR OPTIMIZATION

Luiz F. O. Chamon, Santiago Paternain, and Alejandro Ribeiro

ASILOMAR 2019November 4th, 2019

Chamon et al. Learning GPs with Bayesian Posterior Optimization 1

[email protected]

penn.edu

Page 2: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Dealing with complexity and uncertainty

I Nonparametric methods

I Bayesian methods

Chamon et al. Learning GPs with Bayesian Posterior Optimization 2

[email protected]

penn.edu

Page 3: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Dealing with complexity and uncertainty

I Nonparametric methods

I Bayesian methods

0 2.5 5 7.5 10

x

-20

-10

0

10

20

30

y

0 2.5 5 7.5 10

x

-20

-10

0

10

20

30

y

Chamon et al. Learning GPs with Bayesian Posterior Optimization 2

[email protected]

penn.edu

Page 4: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Dealing with complexity and uncertainty

I Nonparametric methods

I Bayesian methods

Chamon et al. Learning GPs with Bayesian Posterior Optimization 2

[email protected]

penn.edu

Page 5: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Nonparametric Bayesian methods

I All of Bayesian inference

P (model)︸ ︷︷ ︸Prior

+ P (data | model)︸ ︷︷ ︸Likelihood

→ P (model | data)︸ ︷︷ ︸Posterior

I Parametric models are finite dimensional

P (model) = P (parameters)

I Nonparametric models are infinite dimensional

Chamon et al. Learning GPs with Bayesian Posterior Optimization 3

[email protected]

penn.edu

Page 6: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Nonparametric Bayesian methods

I All of Bayesian inference

P (model)︸ ︷︷ ︸Prior

+ P (data | model)︸ ︷︷ ︸Likelihood

→ P (model | data)︸ ︷︷ ︸Posterior

I Parametric models are finite dimensional

P (model) = P (parameters)

I Nonparametric models are infinite dimensional

Chamon et al. Learning GPs with Bayesian Posterior Optimization 3

[email protected]

penn.edu

Page 7: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Gaussian processes

I GPs are priors on “smooth” functions

X easy to specify: choose a covariance function(and hyperparameters)

X flexible: wide variety of covariance functions(degree of smoothness, periodicity. . . )

X tractable

I Still. . . which GP?

× limited access to prior knowledge× hard to interpret hyperparameters× misspecifying GPs can be catastrophic

[Bachoc’13, Beckers et al.’18, Zaytsev et al.’18]

Chamon et al. Learning GPs with Bayesian Posterior Optimization 4

[email protected]

penn.edu

Page 8: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Gaussian processes

I GPs are priors on “smooth” functions

X easy to specify: choose a covariance function(and hyperparameters)

X flexible: wide variety of covariance functions(degree of smoothness, periodicity. . . )

X tractable

I Still. . . which GP?

× limited access to prior knowledge× hard to interpret hyperparameters× misspecifying GPs can be catastrophic

[Bachoc’13, Beckers et al.’18, Zaytsev et al.’18]

Chamon et al. Learning GPs with Bayesian Posterior Optimization 4

[email protected]

penn.edu

Page 9: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Which GP?

I Maximize likelihood w.r.t. hyperparameters [Stein’99, RW’06]

� Ambiguity: multimodal likelihood, local maxima� Indeterminacy: different parameters, same measure

θ? = argmaxθ

logP (y |X,θ)

0 2 4 6

Length scale ( )

-120

-80

-40

0

No

ise

(2)

[dB

]

0 2.5 5 7.5 10

Output scale ( 2 )

0

5

10

15

20

25

Len

gth s

cale

()

Chamon et al. Learning GPs with Bayesian Posterior Optimization 5

[email protected]

penn.edu

Page 10: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Which GP?

I Maximize likelihood w.r.t. hyperparameters [Stein’99, RW’06]

� Ambiguity: multimodal likelihood, local maxima� Indeterminacy: different parameters, same measure

θ? = argmaxθ

logP (y |X,θ)

I Hierarchical models [RG’02, RW’06, Gelman et al.’13]

� Noninformative priors → improper posteriors� Hard to interpret, hard to set priors� Indeterminacy: setting one prior affects the others

θ ∼ P

Chamon et al. Learning GPs with Bayesian Posterior Optimization 5

[email protected]

penn.edu

Page 11: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Bayesian posterior optimization

I Hybrid Bayesian–Optimization approach

� Bayesian:

obtain distribution over hyperparametersinstead of point estimate

� Optimization:

minimize a risk measureinstead of using Bayes rule

X Non-convex risk measures (0-1 loss, truncated MSE. . . )

X Incorporate complex structures in the prior:maximum entropy, sparsity, moments. . .

× Non-convex, infinite dimensional optimization problem

⇒ simple, efficient solution using duality

Chamon et al. Learning GPs with Bayesian Posterior Optimization 6

[email protected]

penn.edu

Page 12: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Bayesian posterior optimization

I Hybrid Bayesian–Optimization approach

� Bayesian: obtain distribution over hyperparametersinstead of point estimate

� Optimization:

minimize a risk measureinstead of using Bayes rule

X Non-convex risk measures (0-1 loss, truncated MSE. . . )

X Incorporate complex structures in the prior:maximum entropy, sparsity, moments. . .

× Non-convex, infinite dimensional optimization problem

⇒ simple, efficient solution using duality

Chamon et al. Learning GPs with Bayesian Posterior Optimization 6

[email protected]

penn.edu

Page 13: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Bayesian posterior optimization

I Hybrid Bayesian–Optimization approach

� Bayesian: obtain distribution over hyperparametersinstead of point estimate

� Optimization: minimize a risk measureinstead of using Bayes rule

X Non-convex risk measures (0-1 loss, truncated MSE. . . )

X Incorporate complex structures in the prior:maximum entropy, sparsity, moments. . .

× Non-convex, infinite dimensional optimization problem

⇒ simple, efficient solution using duality

Chamon et al. Learning GPs with Bayesian Posterior Optimization 6

[email protected]

penn.edu

Page 14: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Bayesian posterior optimization

I Hybrid Bayesian–Optimization approach

� Bayesian: obtain distribution over hyperparametersinstead of point estimate

� Optimization: minimize a risk measureinstead of using Bayes rule

X Non-convex risk measures (0-1 loss, truncated MSE. . . )

X Incorporate complex structures in the prior:maximum entropy, sparsity, moments. . .

× Non-convex, infinite dimensional optimization problem

⇒ simple, efficient solution using duality

Chamon et al. Learning GPs with Bayesian Posterior Optimization 6

[email protected]

penn.edu

Page 15: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Bayesian posterior optimization

I Hybrid Bayesian–Optimization approach

� Bayesian: obtain distribution over hyperparametersinstead of point estimate

� Optimization: minimize a risk measureinstead of using Bayes rule

X Non-convex risk measures (0-1 loss, truncated MSE. . . )

X Incorporate complex structures in the prior:maximum entropy, sparsity, moments. . .

× Non-convex, infinite dimensional optimization problem

⇒ simple, efficient solution using duality

Chamon et al. Learning GPs with Bayesian Posterior Optimization 6

[email protected]

penn.edu

Page 16: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Bayesian posterior optimization

I Hybrid Bayesian–Optimization approach

� Bayesian: obtain distribution over hyperparametersinstead of point estimate

� Optimization: minimize a risk measureinstead of using Bayes rule

X Non-convex risk measures (0-1 loss, truncated MSE. . . )

X Incorporate complex structures in the prior:maximum entropy, sparsity, moments. . .

X Non-convex, infinite dimensional optimization problem⇒ simple, efficient solution using duality

Chamon et al. Learning GPs with Bayesian Posterior Optimization 6

[email protected]

penn.edu

Page 17: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Roadmap

The Bayesian part

The optimization part

Solving Bayesian posterior optimization problems

Chamon et al. Learning GPs with Bayesian Posterior Optimization 7

[email protected]

penn.edu

Page 18: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

GP regression

I Data: (xi, yi) with yi ∼ N (f(xi), σ2ε ) for an unknown f

I Goal: determine (f |X,y)

I How? Bayes’ rule and GP prior

Chamon et al. Learning GPs with Bayesian Posterior Optimization 8

[email protected]

penn.edu

Page 19: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Gaussian processes

I A GP is a stochastic process whose finite dimensionalmarginals are jointly Gaussian

I Formally, GP(m, k) is a distribution over functions g suchthat [g(x1) · · · g(xn)] ∼ N (m,K) for all n ∈ N

m =

m(x1)...

m(xn)

and K =

k(x1,x1) · · · k(x1,xn)...

. . ....

k(xn,x1) · · · k(xn,xn)

I Typically, m ≡ 0

Chamon et al. Learning GPs with Bayesian Posterior Optimization 9

[email protected]

penn.edu

Page 20: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

GP regression

I Data: (xi, yi) with yi ∼ N (f(xi), σ2ε ) for an unknown f

I Goal: determine (f |X,y)

I How? Bayes’ rule and GP prior

(f(x̄) |X,y) ∼ N (µ,Σ)

µ = k̄TK−1y, Σ = k(x̄, x̄)− k̄TK−1k̄

k̄ =

k(x̄,x1)...

k(x̄,xn)

and K =

k(x1,x1) · · · k(x1,xn)...

. . ....

k(xn,x1) · · · k(xn,xn)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 10

[email protected]

penn.edu

Page 21: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Back to Ba(ye)sics

I First level: unknown function f

P (f |X,y) ∝ P (y |X, f)P (f |X)

P (f |X,y) =

∫P (f |X,y,θ) P (θ |X,y) dθ

I Second level: hyperparameters θ

P (θ |X,y) ∝ P (y |X,θ)P (θ |X)

I Issue: choosing P (θ |X)(interpretation, indeterminacy, informativeness)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 11

[email protected]

penn.edu

Page 22: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Back to Ba(ye)sics

I First level: unknown function f

P (f |X,y,θ) ∝ P (y |X, f ,θ)P (f |X,θ)

P (f |X,y) =

∫P (f |X,y,θ) P (θ |X,y) dθ

I Second level: hyperparameters θ

P (θ |X,y) ∝ P (y |X,θ)P (θ |X)

I Issue: choosing P (θ |X)(interpretation, indeterminacy, informativeness)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 11

[email protected]

penn.edu

Page 23: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Back to Ba(ye)sics

I First level: unknown function f

P (f |X,y,θ) ∝ P (y |X, f ,θ)P (f |X,θ)

P (f |X,y) =

∫P (f |X,y,θ) P (θ |X,y)︸ ︷︷ ︸

θ-posterior

I Second level: hyperparameters θ

P (θ |X,y) ∝ P (y |X,θ)P (θ |X)

I Issue: choosing P (θ |X)(interpretation, indeterminacy, informativeness)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 11

[email protected]

penn.edu

Page 24: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Back to Ba(ye)sics

I First level: unknown function f

P (f |X,y,θ) ∝ P (y |X, f ,θ)P (f |X,θ)

P (f |X,y) =

∫P (f |X,y,θ) P (θ |X,y)︸ ︷︷ ︸

θ-posterior

I Second level: hyperparameters θ

P (θ |X,y) ∝ P (y |X,θ)P (θ |X)

I Issue: choosing P (θ |X)(interpretation, indeterminacy, informativeness)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 11

[email protected]

penn.edu

Page 25: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Back to Ba(ye)sics

I First level: unknown function f

P (f |X,y,θ) ∝ P (y |X, f ,θ)P (f |X,θ)

P (f |X,y) =

∫P (f |X,y,θ) P (θ |X,y)︸ ︷︷ ︸

θ-posterior

I Second level: hyperparameters θ

P (θ |X,y) ∝ P (y |X,θ)P (θ |X)

I Issue: choosing P (θ |X)(interpretation, indeterminacy, informativeness)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 11

[email protected]

penn.edu

Page 26: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Back to Ba(ye)sics

I First level: unknown function f

P (f |X,y,θ) ∝ P (y |X, f ,θ)P (f |X,θ)

P (f |X,y) =

∫P (f |X,y,θ) P (θ |X,y)︸ ︷︷ ︸

θ-posterior

I Second level: hyperparameters θ

P (θ |X,y) ∝ P (y |X,θ)P (θ |X)

I Issue: choosing P (θ |X)(interpretation, indeterminacy, informativeness)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 11

[email protected]

penn.edu

Page 27: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Roadmap

The Bayesian part

The optimization part

Solving Bayesian posterior optimization problems

Chamon et al. Learning GPs with Bayesian Posterior Optimization 12

[email protected]

penn.edu

Page 28: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Statistical learning

I Statistical learning:

φ? = argminφ∈F

E(x,y)∼D [`(φ(x), y)] +R(φ)

I Empirical risk minimization:

φ̂? = argminφ∈F

1

n

n∑i=1

`(φ(xi), yi) +R(φ)

� D is an unknown probability distribution over pairs (x, y)� ` : R2 → R+ is a loss function� R is a regularizer� F is a space of functions φ : Rd → R

� Data: (xi, yi) ∼ D

Chamon et al. Learning GPs with Bayesian Posterior Optimization 13

[email protected]

penn.edu

Page 29: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Statistical learning

I Statistical learning:

φ? = argminφ∈F

E(x,y)∼D [`(φ(x), y)] +R(φ)

I Empirical risk minimization:

φ̂? = argminφ∈F

1

n

n∑i=1

`(φ(xi), yi) +R(φ)

� D is an unknown probability distribution over pairs (x, y)� ` : R2 → R+ is a loss function� R is a regularizer� F is a space of functions φ : Rd → R� Data: (xi, yi) ∼ D

Chamon et al. Learning GPs with Bayesian Posterior Optimization 13

[email protected]

penn.edu

Page 30: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Statistical GP learning

I Statistical GP learning:

Γ? = argminΓ∈GP

E(x,y)∼D, f∼Γ [`(f(x), y)] +R(γ)

I Empirical GP-risk minimization:

Γ̂? = argminΓ∈GP

1

n

n∑i=1

Ef∼Γ [`(f(xi), yi)] +R(γ)

� D is an unknown probability distribution over pairs (x, y)� ` : R2 → R+ is a loss function� R is a regularizer� GP is the “space of Gaussian processes”� Data: (xi, yi) ∼ D

Chamon et al. Learning GPs with Bayesian Posterior Optimization 14

[email protected]

penn.edu

Page 31: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Back to the first level

I Empirical GP-risk minimization:

Γ̂? = argminΓ∈GP

1

n

n∑i=1

Ef∼Γ [`(f(xi), yi)] +R(γ)

I Challenge: optimizing over GP(isomorphic to the space of positive semi-definite functions)

I Leverage the first level of the hierarchical model

P (f |X,y) =

∫P (f |X,y,θ)P (θ |X,y) dθ

I “Parameterize” (a subset of) GP using P (θ |X,y)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 15

[email protected]

penn.edu

Page 32: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Back to the first level

I Empirical GP-risk minimization:

Γ̂? = argminΓ∈GP

1

n

n∑i=1

Ef∼Γ [`(f(xi), yi)] +R(γ)

I Challenge: optimizing over GP(isomorphic to the space of positive semi-definite functions)

I Leverage the first level of the hierarchical model

P (f |X,y) =

∫P (f |X,y,θ)P (θ |X,y) dθ

I “Parameterize” (a subset of) GP using P (θ |X,y)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 15

[email protected]

penn.edu

Page 33: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Bayesian posterior optimization

I Bayesian posterior optimization:

p? = argminp∈P

1

n

n∑i=1

∫Ef∼GP(0,kθ) [`(f(xi), yi)] p(θ)dθ +R(p)

Γ̂?p =

∫P (f |X,y,θ) p?(θ)dθ

I Optimization variable: p(θ) = P (θ |X,y)

I Alternative interpretation: mixture of GPs with weights p(θ)

× Non-convex, infinite dimensional optimization problem

Chamon et al. Learning GPs with Bayesian Posterior Optimization 16

[email protected]

penn.edu

Page 34: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Roadmap

The Bayesian part

The optimization part

Solving Bayesian posterior optimization problems

Chamon et al. Learning GPs with Bayesian Posterior Optimization 17

[email protected]

penn.edu

Page 35: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Assumptions

I Bayesian posterior optimization:

minimizep∈P

1

n

n∑i=1

∫Ef∼GP(0,kθ) [`(f(xi), yi)] p(θ)dθ +R(p)

I Measure p is non-atomic and absolutely continuous

I R is a separable functional

I ` and hj are (possibly non-convex) normal integrands

Chamon et al. Learning GPs with Bayesian Posterior Optimization 18

[email protected]

penn.edu

Page 36: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Assumptions

I Bayesian posterior optimization:

minimizep∈P

1

n

n∑i=1

∫Ef∼GP(0,kθ) [`(f(xi), yi)] p(θ)dθ +R(p)

I Measure p is non-atomic and absolutely continuous

I R is a separable functional

I ` and hj are (possibly non-convex) normal integrands

Chamon et al. Learning GPs with Bayesian Posterior Optimization 18

[email protected]

penn.edu

Page 37: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Assumptions

I Bayesian posterior optimization:

minimizep∈L+

1

1

n

n∑i=1

∫Ef∼GP(0,kθ) [`(f(xi), yi)] p(θ)dθ +R(p)

subject to

∫p(θ)dθ = 1

I Measure p is non-atomic and absolutely continuous

I R is a separable functional

I ` and hj are (possibly non-convex) normal integrands

Chamon et al. Learning GPs with Bayesian Posterior Optimization 18

[email protected]

penn.edu

Page 38: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Assumptions

I Bayesian posterior optimization:

minimizep∈L+

1

1

n

n∑i=1

∫Ef∼GP(0,kθ) [`(f(xi), yi)] p(θ)dθ +R(p)

subject to

∫p(θ)dθ = 1

I Measure p is non-atomic and absolutely continuous

I R is a separable functional

I ` and hj are (possibly non-convex) normal integrands

Chamon et al. Learning GPs with Bayesian Posterior Optimization 18

[email protected]

penn.edu

Page 39: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Assumptions

I Bayesian posterior optimization:

minimizep∈L+

1

1

n

n∑i=1

∫Ef∼GP(0,kθ) [`(f(xi), yi)] p(θ)dθ

+

m∑j=1

λj

∫hj [p(θ),θ] dθ

subject to

∫p(θ)dθ = 1

I Measure p is non-atomic and absolutely continuous

I R is a separable functional

I ` and hj are (possibly non-convex) normal integrands

Chamon et al. Learning GPs with Bayesian Posterior Optimization 18

[email protected]

penn.edu

Page 40: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Assumptions

I Bayesian posterior optimization:

minimizep∈L+

1

1

n

n∑i=1

∫Ef∼GP(0,kθ) [`(f(xi), yi)] p(θ)dθ

+

m∑j=1

λj

∫hj [p(θ),θ] dθ

subject to

∫p(θ)dθ = 1

I Measure p is non-atomic and absolutely continuous

I R is a separable functional

I ` and hj are (possibly non-convex) normal integrands

Chamon et al. Learning GPs with Bayesian Posterior Optimization 18

[email protected]

penn.edu

Page 41: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Optimizing posteriors

I Strong duality(BPO problem is an SFP [Chamon’19])

I Exchangeability of the infimum and integral operators(normal integrand + separability of L+

1 [Rockafellar’76])

I Lagrangian can be minimized efficiently, often in closed-form(separability + Gaussian integrals)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 19

[email protected]

penn.edu

Page 42: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Optimizing posteriors

I Strong duality(BPO problem is an SFP [Chamon’19])

I Exchangeability of the infimum and integral operators(normal integrand + separability of L+

1 [Rockafellar’76])

I Lagrangian can be minimized efficiently, often in closed-form(separability + Gaussian integrals)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 19

[email protected]

penn.edu

Page 43: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Optimizing posteriors

I Strong duality(BPO problem is an SFP [Chamon’19])

I Exchangeability of the infimum and integral operators(normal integrand + separability of L+

1 [Rockafellar’76])

I Lagrangian can be minimized efficiently, often in closed-form(separability + Gaussian integrals)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 19

[email protected]

penn.edu

Page 44: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Optimizing posteriors

1) ¯̀(θ) =

∫ [1

n

n∑i=1

`(f(xi), yi)

]P (f(x̄) | θ)︸ ︷︷ ︸N (f(x̄)|µθ ,Σθ)

df

[Gauss-Hermite quadrature]

2) pd(θ, µ) = argminp≥0

(¯̀(θ) + µ)p+

m∑j=1

λjhj (p,θ)

[(often) closed-form]

3) µ? = argmaxµ∈R

L(pd(θ, µ), µ)

[SGD or PBA]

4) p?(θ) = pd(µ?,θ)

Chamon et al. Learning GPs with Bayesian Posterior Optimization 20

[email protected]

penn.edu

Page 45: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Numerical examples

I GP (RBF): σ2 = 1, κ = 1, and σ2ε = 10−1

k(x,x′) = σ2 exp

[−κ‖x− x

′‖2

2

]+ σ2

εI

0 2.5 5 7.5 10

x

-2

-1

0

1

2

y

Chamon et al. Learning GPs with Bayesian Posterior Optimization 21

[email protected]

penn.edu

Page 46: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Numerical examples

I GP (RBF): σ2 = 1, κ = 1, and σ2ε = 10−1

k(x,x′) = σ2 exp

[−κ‖x− x

′‖2

2

]+ σ2

εI

0 2 4 6

Length scale ( )

-120

-80

-40

0

Nois

e (

2)

[dB

]

Chamon et al. Learning GPs with Bayesian Posterior Optimization 21

[email protected]

penn.edu

Page 47: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Numerical examples

I GP (RBF): σ2 = 1, κ = 1, and σ2ε = 10−1

k(x,x′) = σ2 exp

[−κ‖x− x

′‖2

2

]+ σ2

εI

Chamon et al. Learning GPs with Bayesian Posterior Optimization 21

[email protected]

penn.edu

Page 48: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Numerical examples

I Loss: `2-norm

I Regularization: negative entropy

0 2 4 6

Length scale ( )

-60

-40

-20

0

Noi

se (

2)

[dB

]

Chamon et al. Learning GPs with Bayesian Posterior Optimization 22

[email protected]

penn.edu

Page 49: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Numerical examples

I Loss: `2-norm

I Regularization: negative entropy + L0

0 2 4 6

Length scale ( )

-60

-40

-20

0

Noi

se (

2)

[dB

]

Chamon et al. Learning GPs with Bayesian Posterior Optimization 23

[email protected]

penn.edu

Page 50: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Numerical examples

I Loss: `2-norm

I Regularization: negative entropy + L0 + E [κ]

0 2 4 6

Length scale ( )

-60

-40

-20

0

Noi

se (

2)

[dB

]

Chamon et al. Learning GPs with Bayesian Posterior Optimization 24

[email protected]

penn.edu

Page 51: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Numerical examples

I Loss: Leave-one-out `2-norm

I Regularization: negative entropy + L0

0 1 2

Length scale ( )

-100

-75

-50

-25

0

Noi

se (

2)

[dB

]

Chamon et al. Learning GPs with Bayesian Posterior Optimization 25

[email protected]

penn.edu

Page 52: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

Conclusion

I Priors for nonparametric Bayesian methods are hard tospecify and learning them from data is challenging

I Bayesian posterior optimization: replace the prior by astatistical optimization problem

I Despite the non-convexity and infinite dimensionality,posterior optimization problems can be solved efficiently

Chamon et al. Learning GPs with Bayesian Posterior Optimization 26

[email protected]

penn.edu

Page 53: LEARNING GPs WITH BAYESIAN POSTERIOR OPTIMIZATIONluizf/pdf/chamon_asilomar2019_slides.pdf · Non-convex, in nite dimensional optimization problem)simple, e cient solution using duality

LEARNING GPs WITHBAYESIAN POSTERIOR OPTIMIZATION

Luiz F. O. Chamon, Santiago Paternain, and Alejandro Ribeiro

ASILOMAR 2019November 4th, 2019

Chamon et al. Learning GPs with Bayesian Posterior Optimization 27

[email protected]

penn.edu