Entropy-SGD: biasing gradient descent into wide...

1
Entropy-SGD: biasing gradient descent into wide valleys Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina Empirical validation Hessian of small-LeNet at an optimum -5 0 10 20 30 40 Eigenvalues 0 10 10 3 10 5 Frequency -0.5 -0.4 -0.3 -0.2 -0.1 0.0 Eigenvalues 0 10 2 10 3 10 4 Frequency Short negative tail Motivation 0 50 100 150 200 Epochs L 5 10 15 20 % Error 7.71% 7.81% SGD Entropy-SGD 0 50 100 150 200 Epochs L 0 0.1 0.2 0.3 0.4 0.5 0.6 Cross-Entropy Loss 0.0353 0.0336 SGD Entropy-SGD 0 10 20 30 40 50 Epochs L 1.2 1.25 1.3 1.35 Perplexity (Test :1.226) 1.224 (Test :1.217) 1.213 Adam Entropy-Adam 0 10 20 30 40 50 Epochs L 75 85 95 105 115 Perplexity (Test :78.6) 81.43 (Test :77.656) 80.116 SGD Entropy-SGD All-CNN on CIFAR-10 PTB and char-RNN Local entropy amplifies wide minima Discrete Perceptrons What is the shape of the energy landscape? Reinforce SGD with properties of the loss function Does geometry connect to generalization? vs. complexity of training Modify the loss function 0.0 0.5 1.0 1.5 x candidate b x F b x f f ( x) F( x, 10 3 ) F( x,2 10 4 ) original global minimum new global minimum Modified energy landscape is smoother by a factor Theorem: Bound generalization error using stability 1 1 + g c if there exists c > 0, such that l ( 2 f (x) ) / 2 [-2g -1 , c] e Entropy-SGD a T h 1- 1 1+g c i b e SGD b -smooth f (x) is a -Lipschitz, Hardt et al., ‘15 Simulated annealing fails Braunstein, Zecchina ’05 Baldassi et al. ‘16 F (x, d )= log x 0 : # mistakes(x 0 )= 0, x - x 0 = d dense clusters provably generalize better, absent in the standard replica analysis isolated solutions Local entropy counts #solutions in a neighborhood slow-down Smooth using a convolution “local entropy” Gradient original loss Gaussian kernel “scope” Expected value of a local Gibbs distribution Estimate the gradient using MCMC F (x, g )= g -1 x - x 0 F (x, g )= - log h G g e - f (x) i x 0 = 1 Z (x) Z x 0 x 0 exp - f (x 0 ) - 1 2g kx - x 0 k 2 dx 0 focuses on a neighborhood extremely general and scalable decrease with training iterations Baldassi et al., ‘15 Langevin dynamics

Transcript of Entropy-SGD: biasing gradient descent into wide...

Page 1: Entropy-SGD: biasing gradient descent into wide valleysvision.ucla.edu/~pratikac/pub/chaudhari.choromanska.ea.iclr17... · Entropy-SGD: biasing gradient descent into wide valleys

Entropy-SGD: biasing gradient descent into wide valleysPratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi,!

Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina

Empirical validation

Hessian of small-LeNet at an optimum

�5 0 10 20 30 40Eigenvalues

0

10

103

105

Freq

uenc

y

�0.5 �0.4 �0.3 �0.2 �0.1 0.0Eigenvalues

0

102

103

104

Freq

uenc

y

Short negative tail

Motivation

0 50 100 150 200Epochs ⇥ L

5

10

15

20

%Er

ror

7.71%

7.81%

SGDEntropy-SGD

0 50 100 150 200Epochs ⇥ L

0

0.1

0.2

0.3

0.4

0.5

0.6

Cro

ss-E

ntro

pyLo

ss

0.03530.0336

SGDEntropy-SGD

0 10 20 30 40 50Epochs ⇥ L

1.2

1.25

1.3

1.35

Perp

lexi

ty

(Test :1.226)1.224

(Test :1.217)1.213

AdamEntropy-Adam

0 10 20 30 40 50Epochs ⇥ L

75

85

95

105

115

Perp

lexi

ty

(Test :78.6)81.43(Test :77.656)

80.116

SGDEntropy-SGD

All-CNN on CIFAR-10

PTB and char-RNN

Local entropy amplifies wide minima

Discrete Perceptrons

‣ What is the shape of the energy landscape?

‣ Reinforce SGD with properties of the loss function

‣ Does geometry connect to generalization?

vs. complexity of training

Modify the loss function

�0.5

0.0

0.5

1.0

1.5

xcandidate

bxF

bx f

f (x)

F(x, 103)

F(x, 2 ⇥ 104)

original global minimum

new global minimum

‣ Modified energy landscape is smoother by a factor

Theorem: Bound generalization error using stability

11+ g c

if there exists c > 0, such that

l�—2

f (x)�/2 [�2g�1, c]

eEntropy�SGD

⇣a

T

⌘h1� 1

1+g c

ib

eSGD b -smooth

f (x) is a-Lipschitz,

Hardt et al., ‘15

‣ Simulated annealing failsBraunstein, Zecchina ’05!

Baldassi et al. ‘16

F(x,d) = log

���x

0: # mistakes(x0) = 0,

��x� x

0��= d

��

dense clusters provably!generalize better, absent in!the standard replica analysis

isolated solutions

‣ Local entropy counts #solutions in a neighborhood

slow-down

‣ Smooth using a convolution “local entropy”

‣ Gradient

original lossGaussian kernel“scope”

Expected value of a local Gibbs distribution

‣ Estimate the gradient using MCMC

—F(x,g) = g�1⇣

x�⌦x

0↵⌘

F(x,g) =� log

hGg ⇤ e

� f (x)i

⌦x

0↵= 1

Z(x)

Z

x

0x

0exp

✓� f (x0)� 1

2gkx� x

0k2

◆dx

0

focuses on a!neighborhood

extremely general and scalable

decrease with!training iterations

Baldassi et al., ‘15

Langevin dynamics