Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud,...

18
1 Forward and Reverse Gradient-based Hyperparameter Optimization Luca Franceschi 1,2 , Michele Donini 1 , Paolo Frasconi 3 and Massimilano Pontil 1,2 [email protected] (1) Istituto Italiano di Tecnologia, (2) University College London, (3) Universit` a degli Studi di Firenze Optimization and Big Data Summer School, Veroli, July 3-7, 2017 Massimiliano Pontil, Veroli 2017

Transcript of Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud,...

Page 1: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

1

Forward and Reverse Gradient-basedHyperparameter Optimization

Luca Franceschi1,2, Michele Donini1,Paolo Frasconi3 and Massimilano Pontil1,2

[email protected]

(1) Istituto Italiano di Tecnologia, (2) University College London,(3) Universita degli Studi di Firenze

Optimization and Big Data Summer School, Veroli, July 3-7, 2017

Massimiliano Pontil, Veroli 2017

Page 2: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

2

Overview

Training Set

Model Training

HyperparameterOptimization

Training Algorithm Hyperparameters(Algorithm Behavior)

Hyperparameters(Capacity/Design)

Validation Set

Machine Learning Model(Parameters)

+

+

Massimiliano Pontil, Veroli 2017

Page 3: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

3

Gradient-based Hyperparameter optimizationSetting

I Model parameters: w ∈ Rn; hyperparameters: λ ∈ Rm

I Training error: J; validation error: E

I ”Classic” hyperparameter optimization problem:

minλ

E (w(λ)), where w(λ) ∈ arg minw

J(w)

I In practice J is minimized with an iterative scheme

st = Φt(st−1, λ) t = 1, . . . ,T

where st = (wt , . . . ) ∈ Rd contains parameters and accessoryvariables, and Φt : Rd × Rm → Rd is a smooth mapping

I Aim: minimize the response function f (λ) = E (sT (λ))

Massimiliano Pontil, Veroli 2017

Page 4: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

4

Gradient-based Hyperparameter optimizationExample setting: SGD with momentum

I Model parameters: w ∈ Rn

I Accessory variables: v ∈ Rn

I Hyperparameters: λ = (η, µ)

I State: st = (wt , vt);

I Dynamical system: Φt(st−1, λ) defined as

wt = wt−1 − η(µvt−1 −∇Jt(wt−1))vt = µvt−1 +∇Jt(wt−1)

I For µ = 0 this is gradient descent

Massimiliano Pontil, Veroli 2017

Page 5: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

5

Reverse-mode ComputationLinked to BPTT [Werbos, 1990]

HO as constrained optimization problem (fixed horizon T ):

minλ,s1,...,sT

E (sT )

subject to st = Φt(st−1, λ), t ∈ 1, ...,T.

The Lagrangian is L(s, λ, α) = E (sT ) +∑T

t=1 αt(Φt(st−1, λ)− st)

Define: At = ∂Φt(st−1,λ)∂st−1

, Bt = ∂Φt(st−1,λ)∂λ

∂L∂st

= 0 =⇒ αt =

∇E (sT ) if t = T ,

αt+1At+1 if 0 ≤ t ≤ T − 1

∂L∂λ

= ∇E (sT )T∑t=1

(At+1 · · ·AT )Bt

(= ∇f (λ)

)

Massimiliano Pontil, Veroli 2017

Page 6: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

6

Forward-mode ComputationLinked to RTRL [Williams and Zipser, 1989]

Direct calculation of the hypergradient ∇f (λ) using chain rule:

∇f (λ) = ∇E (sT )dsTdλ

;

dstdλ

=∂Φt(st−1, λ)

∂st−1

dst−1

dλ+∂Φt(st−1, λ)

∂λt ∈ 1, . . . ,T.

Define: Z0 = 0; Zt = dstdλ ;

recursive equation: Zt = AtZt−1 + Bt , t ∈ 1, . . . ,T.

Thus ∇f (λ) = ∇E (sT )ZT = ∇E (sT )(∑T

t=1 (At+1 · · ·AT )Bt

).

Massimiliano Pontil, Veroli 2017

Page 7: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

7

Computational ComplexityAssuming Algorithmic Differentiation [Griewank and Walther, 2008]

Time complexity of Φt : CT = g(d ,m)Space complexity of Φt : CS = (d ,m).

Reverse-HG:for t = 1 to T do

st ← Φt(st−1, λ)end forαT ← ∇E(sT )g ← 0for t = T − 1 downto 1 doαt ← αt+1At+1

g ← g + αtBt

end forCT = CS = O(Tg(d ,m))return g

Forward-HG:Z0 ← 0for t = 1 to T do

st ← Φt(st−1, λ)Zt ← AtZt−1 + Bt

end forCT = O(Tmg(d ,m))CS = O(mh(d ,m))return ∇E(s)ZT

Massimiliano Pontil, Veroli 2017

Page 8: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

8

Real-Time Hyperparameter Optimization (RTHO)

With Forward-HG. . .

I partial hypergradients available ∀t: ∇ft(λ) = dE(st)dλ = ∇E (st)Zt ;

I no need to specify final T .

Machine Learning Model Training Dynamics(Parameters)

Training Set

Validation set

Hyperparameters(Algorithm Behavior)

Hyperparameters(Capacity/Design)

RTHO

Massimiliano Pontil, Veroli 2017

Page 9: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

9

RTHO pseudo-code

Choose an hyper-batch size ∆ and an hyper-learning rate α:

RTHO:Z0 ← 0for t = 1 to . . . do

st ← Φt(st−1, λ)Zt ← AtZt−1 + Bt

if t mod ∆ = 0 thenλ← λ− α∇E (st)Zt

end ifend for

Massimiliano Pontil, Veroli 2017

Page 10: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

10

Learning Task Interactions

I Many multitask methods require that a task interactionmatrix is given as input to the learning algorithm.

I We employed the following MTL regularizer

ΩA,ρ(W ) =1

2

T∑t,h=1

At,h‖wt − wh‖22 + ρ

T∑t=1

‖wt‖2 (1)

=T∑

t,h=1

〈wt ,wh〉(Lt,h + ρδt,h

)︸ ︷︷ ︸G−1t,h

(2)

where L is the Laplacian: Lt,h = δt,hdt −At,h, dt =∑T

h=1 At,h

I In real applications, the matrix A is often unknown and it isinteresting to learn it from data.

Massimiliano Pontil, Veroli 2017

Page 11: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

11

Output kernel learning

I Dinuzzo et al. [2011] propose a method to learnsimultaneously a vector-valued function and a kernel amongits components.

I H is the RKHS of a vector-valued function associated withthe kernel H = KG , K is a psd scalar kernel and G is asymmetric psd matrix linked to A and ρ.

I Their optimization problem is

minG∈Sm+

ming∈H

n∑i=1

λ‖g(xi )− yi‖22 + ‖g‖2

H + ‖G‖2F .

Massimiliano Pontil, Veroli 2017

Page 12: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

12

Output kernel learning

I By the representer theorem:

g∗(x) = Gn∑

i=1

ciK (x , xi ).

I The optimization problem becomes the following

minG∈Sm+

minC∈R`×m

λ‖Y − KCG‖2F + 〈CTKC , L〉F + ‖G‖2

F .

I The problem is not jointly convex in G and C . The authorsshow that each local minimizer is also a global minimizer (dueto invex function property of the problem).

Massimiliano Pontil, Veroli 2017

Page 13: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

13

Output kernel learning

I Jawanpuria et al. [2015] exploit the idea of learning a sparseoutput kernel.

I The regularizer enforces the sparsity of the output kernelavoiding the spurious relations among different tasks andleading to interpretable solutions:

Ω(G ) = ‖G‖pp, p ∈ (1, 2]. (3)

I This regularizer can be applied to an arbitrary convex lossfunction solving the minimization problem.

Massimiliano Pontil, Veroli 2017

Page 14: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

14

CIFAR

I We use our method to learn the matrix A from the validationset by using Reverse-HG on CIFAR-10 dataset.

Massimiliano Pontil, Veroli 2017

Page 15: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

15

CIFAR

I Test accuracy±standard deviation on CIFAR-10. The thirdcolumn is the p-norm regularizer for the task interactionmatrix A.

CIFAR-10 p

STL 67.47±2.78Naive MTL 69.41±1.90Dinuzzo 69.96±1.85 2Jawanpuria 70.20±1.05 2Jawanpuria 70.96±1.04 4/3R-HG 70.85±1.87R-HG-Sparse 71.62±1.34 1

Massimiliano Pontil, Veroli 2017

Page 16: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

16

BONUS: data hyper-cleaning

I A dataset with label noise and due to time or resourceconstraints we can only afford to cleanup a subset of theavailable data.

I We use the cleaned data as the validation set, and assign onehyperparameter to each training example, i.e. the training lossbecomes

1

Ntr

∑i∈Training set

λi`(w , xi )

with λi ∈ [0, 1].

I By putting a sparsity constraint on the vector ofhyperparameters ‖λ‖1 < R, we bring to zero the influence ofnoisy examples by using Forward-HG on a ”corrupted”MNIST dataset.

Massimiliano Pontil, Veroli 2017

Page 17: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

17

Reference for this talk

L Franceschi, M Donini, P Frasconi, M Pontil.Forward and Reverse Gradient-Based Hyperparameter Optimization.Proc. International Conference on Machine Learning, 2017.

arXiv:1703.01785.

Massimiliano Pontil, Veroli 2017

Page 18: Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible

18

Additional references

Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation,12(8):1889–1900, 2000.

Francesco Dinuzzo, Cheng S Ong, Gianluigi Pillonetto, and Peter V Gehler. Learningoutput kernels with block coordinate descent. In ICML, pages 49–56, 2011.

Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles andTechniques of Algorithmic Differentiation, Second Edition. Society for Industrialand Applied Mathematics, second edition, 2008.

Pratik Jawanpuria, Maksim Lapin, Matthias Hein, and Bernt Schiele. Efficient outputkernel learning for multiple tasks. In Advances in Neural Information ProcessingSystems, pages 1189–1197, 2015.

Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-basedhyperparameter optimization through reversible learning. In ICML, 2015.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In ICML,volume 48, pages 737–746, 2016.

Paul J Werbos. Backpropagation through time: what it does and how to do it.Proceedings of the IEEE, 78(10):1550–1560, 1990.

Ronald J. Williams and David Zipser. A learning algorithm for continually runningfully recurrent neural networks. Neural computation, 1(2):270–280, 1989.

Massimiliano Pontil, Veroli 2017