Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud,...

1

Forward and Reverse Gradient-basedHyperparameter Optimization

Luca Franceschi1,2, Michele Donini1,Paolo Frasconi3 and Massimilano Pontil1,2

[email protected]

(1) Istituto Italiano di Tecnologia, (2) University College London,(3) Universita degli Studi di Firenze

Optimization and Big Data Summer School, Veroli, July 3-7, 2017

Massimiliano Pontil, Veroli 2017

2

Overview

Training Set

Model Training

HyperparameterOptimization

Training Algorithm Hyperparameters(Algorithm Behavior)

Hyperparameters(Capacity/Design)

Validation Set

Machine Learning Model(Parameters)

+

+


3

Gradient-based Hyperparameter optimizationSetting

I Model parameters: w ∈ Rn; hyperparameters: λ ∈ Rm

I Training error: J; validation error: E

I ”Classic” hyperparameter optimization problem:

minλ

E (w(λ)), where w(λ) ∈ arg minw

J(w)

I In practice J is minimized with an iterative scheme

st = Φt(st−1, λ) t = 1, . . . ,T

where st = (wt , . . . ) ∈ Rd contains parameters and accessoryvariables, and Φt : Rd × Rm → Rd is a smooth mapping

I Aim: minimize the response function f (λ) = E (sT (λ))


4

Gradient-based Hyperparameter optimizationExample setting: SGD with momentum

I Model parameters: w ∈ Rn

I Accessory variables: v ∈ Rn

I Hyperparameters: λ = (η, µ)

I State: st = (wt , vt);

I Dynamical system: Φt(st−1, λ) defined as

wt = wt−1 − η(µvt−1 −∇Jt(wt−1))vt = µvt−1 +∇Jt(wt−1)

I For µ = 0 this is gradient descent


5

Reverse-mode ComputationLinked to BPTT [Werbos, 1990]

HO as constrained optimization problem (fixed horizon T ):

minλ,s1,...,sT

E (sT )

subject to st = Φt(st−1, λ), t ∈ 1, ...,T.

The Lagrangian is L(s, λ, α) = E (sT ) +∑T

t=1 αt(Φt(st−1, λ)− st)

Define: At = ∂Φt(st−1,λ)∂st−1

, Bt = ∂Φt(st−1,λ)∂λ

∂L∂st

= 0 =⇒ αt =

∇E (sT ) if t = T ,

αt+1At+1 if 0 ≤ t ≤ T − 1

∂L∂λ

= ∇E (sT )T∑t=1

(At+1 · · ·AT )Bt

(= ∇f (λ)

)


6

Forward-mode ComputationLinked to RTRL [Williams and Zipser, 1989]

Direct calculation of the hypergradient ∇f (λ) using chain rule:

∇f (λ) = ∇E (sT )dsTdλ

;

dstdλ

=∂Φt(st−1, λ)

∂st−1

dst−1

dλ+∂Φt(st−1, λ)

∂λt ∈ 1, . . . ,T.

Define: Z0 = 0; Zt = dstdλ ;

recursive equation: Zt = AtZt−1 + Bt , t ∈ 1, . . . ,T.

Thus ∇f (λ) = ∇E (sT )ZT = ∇E (sT )(∑T

t=1 (At+1 · · ·AT )Bt

).


7

Computational ComplexityAssuming Algorithmic Differentiation [Griewank and Walther, 2008]

Time complexity of Φt : CT = g(d ,m)Space complexity of Φt : CS = (d ,m).

Reverse-HG:for t = 1 to T do

st ← Φt(st−1, λ)end forαT ← ∇E(sT )g ← 0for t = T − 1 downto 1 doαt ← αt+1At+1

g ← g + αtBt

end forCT = CS = O(Tg(d ,m))return g

Forward-HG:Z0 ← 0for t = 1 to T do

st ← Φt(st−1, λ)Zt ← AtZt−1 + Bt

end forCT = O(Tmg(d ,m))CS = O(mh(d ,m))return ∇E(s)ZT


8

Real-Time Hyperparameter Optimization (RTHO)

With Forward-HG. . .

I partial hypergradients available ∀t: ∇ft(λ) = dE(st)dλ = ∇E (st)Zt ;

I no need to specify final T .

Machine Learning Model Training Dynamics(Parameters)

Training Set

Validation set

Hyperparameters(Algorithm Behavior)

Hyperparameters(Capacity/Design)

RTHO


9

RTHO pseudo-code

Choose an hyper-batch size ∆ and an hyper-learning rate α:

RTHO:Z0 ← 0for t = 1 to . . . do

st ← Φt(st−1, λ)Zt ← AtZt−1 + Bt

if t mod ∆ = 0 thenλ← λ− α∇E (st)Zt

end ifend for


10

Learning Task Interactions

I Many multitask methods require that a task interactionmatrix is given as input to the learning algorithm.

I We employed the following MTL regularizer

ΩA,ρ(W ) =1

2

T∑t,h=1

At,h‖wt − wh‖22 + ρ

T∑t=1

‖wt‖2 (1)

=T∑

t,h=1

〈wt ,wh〉(Lt,h + ρδt,h

)︸︷︷︸G−1t,h

(2)

where L is the Laplacian: Lt,h = δt,hdt −At,h, dt =∑T

h=1 At,h

I In real applications, the matrix A is often unknown and it isinteresting to learn it from data.


11

Output kernel learning

I Dinuzzo et al. [2011] propose a method to learnsimultaneously a vector-valued function and a kernel amongits components.

I H is the RKHS of a vector-valued function associated withthe kernel H = KG , K is a psd scalar kernel and G is asymmetric psd matrix linked to A and ρ.

I Their optimization problem is

minG∈Sm+

ming∈H

n∑i=1

λ‖g(xi )− yi‖22 + ‖g‖2

H + ‖G‖2F .


12


I By the representer theorem:

g∗(x) = Gn∑

i=1

ciK (x , xi ).

I The optimization problem becomes the following

minG∈Sm+

minC∈R`×m

λ‖Y − KCG‖2F + 〈CTKC , L〉F + ‖G‖2

F .

I The problem is not jointly convex in G and C . The authorsshow that each local minimizer is also a global minimizer (dueto invex function property of the problem).


13


I Jawanpuria et al. [2015] exploit the idea of learning a sparseoutput kernel.

I The regularizer enforces the sparsity of the output kernelavoiding the spurious relations among different tasks andleading to interpretable solutions:

Ω(G ) = ‖G‖pp, p ∈ (1, 2]. (3)

I This regularizer can be applied to an arbitrary convex lossfunction solving the minimization problem.


14

CIFAR

I We use our method to learn the matrix A from the validationset by using Reverse-HG on CIFAR-10 dataset.


15

CIFAR

I Test accuracy±standard deviation on CIFAR-10. The thirdcolumn is the p-norm regularizer for the task interactionmatrix A.

CIFAR-10 p

STL 67.47±2.78Naive MTL 69.41±1.90Dinuzzo 69.96±1.85 2Jawanpuria 70.20±1.05 2Jawanpuria 70.96±1.04 4/3R-HG 70.85±1.87R-HG-Sparse 71.62±1.34 1


16

BONUS: data hyper-cleaning

I A dataset with label noise and due to time or resourceconstraints we can only afford to cleanup a subset of theavailable data.

I We use the cleaned data as the validation set, and assign onehyperparameter to each training example, i.e. the training lossbecomes

1

Ntr

∑i∈Training set

λi`(w , xi )

with λi ∈ [0, 1].

I By putting a sparsity constraint on the vector ofhyperparameters ‖λ‖1 < R, we bring to zero the influence ofnoisy examples by using Forward-HG on a ”corrupted”MNIST dataset.


17

Reference for this talk

L Franceschi, M Donini, P Frasconi, M Pontil.Forward and Reverse Gradient-Based Hyperparameter Optimization.Proc. International Conference on Machine Learning, 2017.

arXiv:1703.01785.


18

Additional references

Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation,12(8):1889–1900, 2000.

Francesco Dinuzzo, Cheng S Ong, Gianluigi Pillonetto, and Peter V Gehler. Learningoutput kernels with block coordinate descent. In ICML, pages 49–56, 2011.

Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles andTechniques of Algorithmic Differentiation, Second Edition. Society for Industrialand Applied Mathematics, second edition, 2008.

Pratik Jawanpuria, Maksim Lapin, Matthias Hein, and Bernt Schiele. Efficient outputkernel learning for multiple tasks. In Advances in Neural Information ProcessingSystems, pages 1189–1197, 2015.

Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-basedhyperparameter optimization through reversible learning. In ICML, 2015.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In ICML,volume 48, pages 737–746, 2016.

Paul J Werbos. Backpropagation through time: what it does and how to do it.Proceedings of the IEEE, 78(10):1550–1560, 1990.

Ronald J. Williams and David Zipser. A learning algorithm for continually runningfully recurrent neural networks. Neural computation, 1(2):270–280, 1989.


Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud,...

Documents

Transcript of Forward and Reverse Gradient-based Hyperparameter Optimization · Dougal Maclaurin, David Duvenaud,...