Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical...
Transcript of Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical...
![Page 1: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/1.jpg)
Deep Learning via Hessian-free Optimization
James Martens
University of Toronto
June 29, 2010
UNIVERSITY OF TORONTO
Computer Science
James Martens (U of T) Deep Learning via HF June 29, 2010 1 / 23
![Page 2: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/2.jpg)
Gradient descent is bad at learning deep nets
The common experience:
gradient descent gets much slower as the depth increases
large enough depth → learning to slow to a crawl or even “stops” →severe under-fitting (poor performance on the training set)
“vanishing-gradients problem”: error signal decays as it isbackpropagated
Output LayerInput Layer
the gradient is tiny for weights in early layers
James Martens (U of T) Deep Learning via HF June 29, 2010 2 / 23
![Page 3: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/3.jpg)
Gradient descent is bad at learning deep nets
The common experience:
gradient descent gets much slower as the depth increases
large enough depth → learning to slow to a crawl or even “stops” →severe under-fitting (poor performance on the training set)
“vanishing-gradients problem”: error signal decays as it isbackpropagated
Output LayerInput Layer
the gradient is tiny for weights in early layers
James Martens (U of T) Deep Learning via HF June 29, 2010 2 / 23
![Page 4: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/4.jpg)
Gradient descent is bad at learning deep nets
The common experience:
gradient descent gets much slower as the depth increases
large enough depth → learning to slow to a crawl or even “stops” →severe under-fitting (poor performance on the training set)
“vanishing-gradients problem”: error signal decays as it isbackpropagated
Output LayerInput Layer
the gradient is tiny for weights in early layers
James Martens (U of T) Deep Learning via HF June 29, 2010 2 / 23
![Page 5: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/5.jpg)
Gradient descent is bad at learning deep nets
The common experience:
gradient descent gets much slower as the depth increases
large enough depth → learning to slow to a crawl or even “stops” →severe under-fitting (poor performance on the training set)
“vanishing-gradients problem”: error signal decays as it isbackpropagated
Output LayerInput Layer
the gradient is tiny for weights in early layers
James Martens (U of T) Deep Learning via HF June 29, 2010 2 / 23
![Page 6: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/6.jpg)
Gradient descent is bad at deep learning (cont.)
Two hypotheses for why gradient descent fails:
increased frequency and severity of bad localminima:
pathological curvature, like thetype seen in the well-knownRosenbrock function:
f (x , y) = (1− x)2 + 100(y − x2)2
James Martens (U of T) Deep Learning via HF June 29, 2010 3 / 23
![Page 7: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/7.jpg)
Gradient descent is bad at deep learning (cont.)
Two hypotheses for why gradient descent fails:
increased frequency and severity of bad localminima:
pathological curvature, like thetype seen in the well-knownRosenbrock function:
f (x , y) = (1− x)2 + 100(y − x2)2
James Martens (U of T) Deep Learning via HF June 29, 2010 3 / 23
![Page 8: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/8.jpg)
Pre-training for deep auto-encoders
(from Hinton and Salakhutdinov, 2006)James Martens (U of T) Deep Learning via HF June 29, 2010 4 / 23
![Page 9: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/9.jpg)
Pre-training (cont.)
doesn’t generalize to all the sorts of deep-architectures we might wishto train
does it get full power out of deep auto-encoders?
(from Hinton and Salakhutdinov, 2006)
James Martens (U of T) Deep Learning via HF June 29, 2010 5 / 23
![Page 10: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/10.jpg)
Our contribution
we develop a very powerful and practical 2nd-order optimizationalgorithm based on the “Hessian-free” approach
we show that it can achieve significantly lower test-set reconstructionerrors on the deep auto-encoder problems considered in Hinton andSalakhutdinov
no pre-training required!
using pre-training still lowers generalization error on 2 of the 3problems
but critically there isn’t a significant benefit on the training set
our method provides a better solution to the underfitting problem indeep networks and can be applied to a much larger set of models
James Martens (U of T) Deep Learning via HF June 29, 2010 6 / 23
![Page 11: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/11.jpg)
Our contribution
we develop a very powerful and practical 2nd-order optimizationalgorithm based on the “Hessian-free” approach
we show that it can achieve significantly lower test-set reconstructionerrors on the deep auto-encoder problems considered in Hinton andSalakhutdinov
no pre-training required!
using pre-training still lowers generalization error on 2 of the 3problems
but critically there isn’t a significant benefit on the training set
our method provides a better solution to the underfitting problem indeep networks and can be applied to a much larger set of models
James Martens (U of T) Deep Learning via HF June 29, 2010 6 / 23
![Page 12: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/12.jpg)
Our contribution
we develop a very powerful and practical 2nd-order optimizationalgorithm based on the “Hessian-free” approach
we show that it can achieve significantly lower test-set reconstructionerrors on the deep auto-encoder problems considered in Hinton andSalakhutdinov
no pre-training required!
using pre-training still lowers generalization error on 2 of the 3problems
but critically there isn’t a significant benefit on the training set
our method provides a better solution to the underfitting problem indeep networks and can be applied to a much larger set of models
James Martens (U of T) Deep Learning via HF June 29, 2010 6 / 23
![Page 13: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/13.jpg)
2nd-order optimization
If pathological curvature is the problem, this could be the solution
General framework
model the objective function by the local approximation:
f (θ + p) ≈ qθ(p) ≡ f (θ) +∇f (θ)>p +1
2p>Bp
where B is a matrix which quantifies curvature
in Newton’s method, B = H or H + λI
fully optimizing qθ(p) this w.r.t. p gives: p = −B−1∇f (θ)
update is: θ ← θ + αp for some α ≤ 1 determined by a line search
James Martens (U of T) Deep Learning via HF June 29, 2010 7 / 23
![Page 14: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/14.jpg)
2nd-order optimization
If pathological curvature is the problem, this could be the solution
General framework
model the objective function by the local approximation:
f (θ + p) ≈ qθ(p) ≡ f (θ) +∇f (θ)>p +1
2p>Bp
where B is a matrix which quantifies curvature
in Newton’s method, B = H or H + λI
fully optimizing qθ(p) this w.r.t. p gives: p = −B−1∇f (θ)
update is: θ ← θ + αp for some α ≤ 1 determined by a line search
James Martens (U of T) Deep Learning via HF June 29, 2010 7 / 23
![Page 15: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/15.jpg)
2nd-order optimization
If pathological curvature is the problem, this could be the solution
General framework
model the objective function by the local approximation:
f (θ + p) ≈ qθ(p) ≡ f (θ) +∇f (θ)>p +1
2p>Bp
where B is a matrix which quantifies curvature
in Newton’s method, B = H or H + λI
fully optimizing qθ(p) this w.r.t. p gives: p = −B−1∇f (θ)
update is: θ ← θ + αp for some α ≤ 1 determined by a line search
James Martens (U of T) Deep Learning via HF June 29, 2010 7 / 23
![Page 16: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/16.jpg)
2nd-order optimization
If pathological curvature is the problem, this could be the solution
General framework
model the objective function by the local approximation:
f (θ + p) ≈ qθ(p) ≡ f (θ) +∇f (θ)>p +1
2p>Bp
where B is a matrix which quantifies curvature
in Newton’s method, B = H or H + λI
fully optimizing qθ(p) this w.r.t. p gives: p = −B−1∇f (θ)
update is: θ ← θ + αp for some α ≤ 1 determined by a line search
James Martens (U of T) Deep Learning via HF June 29, 2010 7 / 23
![Page 17: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/17.jpg)
Vanishing Curvature
define the direction d by dk = δik
low reduction along d : −∇f >d = −(∇f )i ≈ 0
but also low curvature: d>Hd = −Hii = ∂2f∂θ2
i≈ 0
Output LayerInput Layer
Backprop directionweight i
so a 2nd-order optimizer will pursue d at a reasonable rate, an elegantsolution to the vanishing gradient problem of 1st-order optimizers
James Martens (U of T) Deep Learning via HF June 29, 2010 8 / 23
![Page 18: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/18.jpg)
Practical Considerations for 2nd-order optimization
Hessian size problem
for machine learning models the number of parameter N can be verylarge
we can’t possibly calculate or even store a N × N matrix, let aloneinvert one
Quasi-Newton Methods
non-linear conjugate gradient (NCG) - a hacked version of thequadratic optimizer linear CG
limited-memory BFGS (L-BFGS) - a low rank Hessian approximation
approximate diagonal or block-diagonal Hessian
Unfortunately these don’t seem to resolve the deep-learning problem
James Martens (U of T) Deep Learning via HF June 29, 2010 9 / 23
![Page 19: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/19.jpg)
Practical Considerations for 2nd-order optimization
Hessian size problem
for machine learning models the number of parameter N can be verylarge
we can’t possibly calculate or even store a N × N matrix, let aloneinvert one
Quasi-Newton Methods
non-linear conjugate gradient (NCG) - a hacked version of thequadratic optimizer linear CG
limited-memory BFGS (L-BFGS) - a low rank Hessian approximation
approximate diagonal or block-diagonal Hessian
Unfortunately these don’t seem to resolve the deep-learning problem
James Martens (U of T) Deep Learning via HF June 29, 2010 9 / 23
![Page 20: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/20.jpg)
Hessian-free optimization
a quasi-newton method that uses no low-rank approximations
named ’free’ because we never explicitly compute B
First motivating observation
it is relatively easy to compute the matrix-vector product Hv for anarbitrary vectors v
e.g. use finite differences to approximate the limit:
Hv = limε→0
∇f (θ + εv)−∇f (θ)
ε
Hv is computed for the exact value of H, there is no low-rank ordiagonal approximation here!
James Martens (U of T) Deep Learning via HF June 29, 2010 10 / 23
![Page 21: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/21.jpg)
Hessian-free optimization
a quasi-newton method that uses no low-rank approximations
named ’free’ because we never explicitly compute B
First motivating observation
it is relatively easy to compute the matrix-vector product Hv for anarbitrary vectors v
e.g. use finite differences to approximate the limit:
Hv = limε→0
∇f (θ + εv)−∇f (θ)
ε
Hv is computed for the exact value of H, there is no low-rank ordiagonal approximation here!
James Martens (U of T) Deep Learning via HF June 29, 2010 10 / 23
![Page 22: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/22.jpg)
Hessian-free optimization
a quasi-newton method that uses no low-rank approximations
named ’free’ because we never explicitly compute B
First motivating observation
it is relatively easy to compute the matrix-vector product Hv for anarbitrary vectors v
e.g. use finite differences to approximate the limit:
Hv = limε→0
∇f (θ + εv)−∇f (θ)
ε
Hv is computed for the exact value of H, there is no low-rank ordiagonal approximation here!
James Martens (U of T) Deep Learning via HF June 29, 2010 10 / 23
![Page 23: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/23.jpg)
Hessian-free optimization
a quasi-newton method that uses no low-rank approximations
named ’free’ because we never explicitly compute B
First motivating observation
it is relatively easy to compute the matrix-vector product Hv for anarbitrary vectors v
e.g. use finite differences to approximate the limit:
Hv = limε→0
∇f (θ + εv)−∇f (θ)
ε
Hv is computed for the exact value of H, there is no low-rank ordiagonal approximation here!
James Martens (U of T) Deep Learning via HF June 29, 2010 10 / 23
![Page 24: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/24.jpg)
Hessian-free optimization (cont.)
Second motivating observation
linear conjugate gradient (CG) minimizes positive definite quadraticcost functions using only matrix-vector products
more often seen in the context of solving large sparse systems
directly minimizes the the quadratic q ≡ p>Bp/2 + g>p and not theresidual ‖Bp + g‖2 → these are related but different!
but we actually care about the quadratic, so this is good
requires N = dim(θ) iterations to converge in general, but makes a lotof progress in far fewer iterations than that
James Martens (U of T) Deep Learning via HF June 29, 2010 11 / 23
![Page 25: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/25.jpg)
Hessian-free optimization (cont.)
Second motivating observation
linear conjugate gradient (CG) minimizes positive definite quadraticcost functions using only matrix-vector products
more often seen in the context of solving large sparse systems
directly minimizes the the quadratic q ≡ p>Bp/2 + g>p and not theresidual ‖Bp + g‖2 → these are related but different!
but we actually care about the quadratic, so this is good
requires N = dim(θ) iterations to converge in general, but makes a lotof progress in far fewer iterations than that
James Martens (U of T) Deep Learning via HF June 29, 2010 11 / 23
![Page 26: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/26.jpg)
Hessian-free optimization (cont.)
Second motivating observation
linear conjugate gradient (CG) minimizes positive definite quadraticcost functions using only matrix-vector products
more often seen in the context of solving large sparse systems
directly minimizes the the quadratic q ≡ p>Bp/2 + g>p and not theresidual ‖Bp + g‖2 → these are related but different!
but we actually care about the quadratic, so this is good
requires N = dim(θ) iterations to converge in general, but makes a lotof progress in far fewer iterations than that
James Martens (U of T) Deep Learning via HF June 29, 2010 11 / 23
![Page 27: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/27.jpg)
Hessian-free optimization (cont.)
Second motivating observation
linear conjugate gradient (CG) minimizes positive definite quadraticcost functions using only matrix-vector products
more often seen in the context of solving large sparse systems
directly minimizes the the quadratic q ≡ p>Bp/2 + g>p and not theresidual ‖Bp + g‖2 → these are related but different!
but we actually care about the quadratic, so this is good
requires N = dim(θ) iterations to converge in general, but makes a lotof progress in far fewer iterations than that
James Martens (U of T) Deep Learning via HF June 29, 2010 11 / 23
![Page 28: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/28.jpg)
Hessian-free optimization (cont.)
Second motivating observation
linear conjugate gradient (CG) minimizes positive definite quadraticcost functions using only matrix-vector products
more often seen in the context of solving large sparse systems
directly minimizes the the quadratic q ≡ p>Bp/2 + g>p and not theresidual ‖Bp + g‖2 → these are related but different!
but we actually care about the quadratic, so this is good
requires N = dim(θ) iterations to converge in general, but makes a lotof progress in far fewer iterations than that
James Martens (U of T) Deep Learning via HF June 29, 2010 11 / 23
![Page 29: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/29.jpg)
Standard Hessian-free Optimization
Pseudo-code for a simple variant of damped Hessian-free optimization:
1: for n = 1 to max-epochs do2: compute gradient gn = ∇f (θn)3: choose/adapt λn according to some heuristic4: define the function Bn(v) = Hv + λnv5: pn = CGMinimize(Bn,−gn)6: θn+1 = θn + pn
7: end for
In addition to choosing λn, the stopping criterion for the CG algorithm is acritical detail.
James Martens (U of T) Deep Learning via HF June 29, 2010 12 / 23
![Page 30: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/30.jpg)
A new variant is required
the bad news: common variants of HF (e.g. Steihaug) don’t workparticular well for neural networks
there are many aspects of the algorithm that are ill-defined in thebasic approach which we need to address:
how can deal with negative curvature?
how should we choose λ?
how can we handle large data-sets
when should we stop the CG iterations?
can CG be accelerated?
James Martens (U of T) Deep Learning via HF June 29, 2010 13 / 23
![Page 31: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/31.jpg)
A new variant is required
the bad news: common variants of HF (e.g. Steihaug) don’t workparticular well for neural networks
there are many aspects of the algorithm that are ill-defined in thebasic approach which we need to address:
how can deal with negative curvature?
how should we choose λ?
how can we handle large data-sets
when should we stop the CG iterations?
can CG be accelerated?
James Martens (U of T) Deep Learning via HF June 29, 2010 13 / 23
![Page 32: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/32.jpg)
Pearlmutter’s R-operator method
finite-difference approximations are undesirable for many reasons
there is a better way to compute Hv due to Pearlmutter (1994)
similar cost to a gradient computation
for neural nets, no extra non-linear functions need to be evaluated
technique generalizes to almost any twice-differentiable function thatis tractable to compute
can be automated (like automatic differentiation)
James Martens (U of T) Deep Learning via HF June 29, 2010 14 / 23
![Page 33: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/33.jpg)
Pearlmutter’s R-operator method
finite-difference approximations are undesirable for many reasons
there is a better way to compute Hv due to Pearlmutter (1994)
similar cost to a gradient computation
for neural nets, no extra non-linear functions need to be evaluated
technique generalizes to almost any twice-differentiable function thatis tractable to compute
can be automated (like automatic differentiation)
James Martens (U of T) Deep Learning via HF June 29, 2010 14 / 23
![Page 34: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/34.jpg)
Pearlmutter’s R-operator method
finite-difference approximations are undesirable for many reasons
there is a better way to compute Hv due to Pearlmutter (1994)
similar cost to a gradient computation
for neural nets, no extra non-linear functions need to be evaluated
technique generalizes to almost any twice-differentiable function thatis tractable to compute
can be automated (like automatic differentiation)
James Martens (U of T) Deep Learning via HF June 29, 2010 14 / 23
![Page 35: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/35.jpg)
Pearlmutter’s R-operator method
finite-difference approximations are undesirable for many reasons
there is a better way to compute Hv due to Pearlmutter (1994)
similar cost to a gradient computation
for neural nets, no extra non-linear functions need to be evaluated
technique generalizes to almost any twice-differentiable function thatis tractable to compute
can be automated (like automatic differentiation)
James Martens (U of T) Deep Learning via HF June 29, 2010 14 / 23
![Page 36: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/36.jpg)
Pearlmutter’s R-operator method
finite-difference approximations are undesirable for many reasons
there is a better way to compute Hv due to Pearlmutter (1994)
similar cost to a gradient computation
for neural nets, no extra non-linear functions need to be evaluated
technique generalizes to almost any twice-differentiable function thatis tractable to compute
can be automated (like automatic differentiation)
James Martens (U of T) Deep Learning via HF June 29, 2010 14 / 23
![Page 37: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/37.jpg)
Pearlmutter’s R-operator method
finite-difference approximations are undesirable for many reasons
there is a better way to compute Hv due to Pearlmutter (1994)
similar cost to a gradient computation
for neural nets, no extra non-linear functions need to be evaluated
technique generalizes to almost any twice-differentiable function thatis tractable to compute
can be automated (like automatic differentiation)
James Martens (U of T) Deep Learning via HF June 29, 2010 14 / 23
![Page 38: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/38.jpg)
The Gauss-Newton Matrix (G)
a well-known alternative to the Hessian that is guaranteed to bepositive semi-definite - thus no negative curvature!
usually is applied to non-linear least squares problems
Schraudolph showed in 2002 that it can be generalized beyond justleast squares to neural nets with “matching” loss functions andoutput non-linearities
e.g. logistic units with cross-entropy error
works better in practice than Hessian or other curvature matrices(e.g. empirical Fisher)
and we can compute Gv using an algorithm similar to the one for Hv
James Martens (U of T) Deep Learning via HF June 29, 2010 15 / 23
![Page 39: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/39.jpg)
The Gauss-Newton Matrix (G)
a well-known alternative to the Hessian that is guaranteed to bepositive semi-definite - thus no negative curvature!
usually is applied to non-linear least squares problems
Schraudolph showed in 2002 that it can be generalized beyond justleast squares to neural nets with “matching” loss functions andoutput non-linearities
e.g. logistic units with cross-entropy error
works better in practice than Hessian or other curvature matrices(e.g. empirical Fisher)
and we can compute Gv using an algorithm similar to the one for Hv
James Martens (U of T) Deep Learning via HF June 29, 2010 15 / 23
![Page 40: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/40.jpg)
The Gauss-Newton Matrix (G)
a well-known alternative to the Hessian that is guaranteed to bepositive semi-definite - thus no negative curvature!
usually is applied to non-linear least squares problems
Schraudolph showed in 2002 that it can be generalized beyond justleast squares to neural nets with “matching” loss functions andoutput non-linearities
e.g. logistic units with cross-entropy error
works better in practice than Hessian or other curvature matrices(e.g. empirical Fisher)
and we can compute Gv using an algorithm similar to the one for Hv
James Martens (U of T) Deep Learning via HF June 29, 2010 15 / 23
![Page 41: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/41.jpg)
The Gauss-Newton Matrix (G)
a well-known alternative to the Hessian that is guaranteed to bepositive semi-definite - thus no negative curvature!
usually is applied to non-linear least squares problems
Schraudolph showed in 2002 that it can be generalized beyond justleast squares to neural nets with “matching” loss functions andoutput non-linearities
e.g. logistic units with cross-entropy error
works better in practice than Hessian or other curvature matrices(e.g. empirical Fisher)
and we can compute Gv using an algorithm similar to the one for Hv
James Martens (U of T) Deep Learning via HF June 29, 2010 15 / 23
![Page 42: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/42.jpg)
CG stopping conditions
CG is only guaranteed to converge after N (size of parameter space)iterations −→ we can’t always run it to convergence
the standard stopping criterion used in most versions of HF is
‖r‖ < min(12 , ‖g‖
12 )‖g‖ where r = Bp + g is the “residual”
strictly speaking ‖r‖ is not the quantity that CG minimizes, nor is itthe one we really care about
0 50 100 150 200 250−2
0
2
4
6x 10
4
iteration
qθ(p) vs iteration
50 100 150 200 2500
2
4
6
x 1010
iteration
|| Bp + g ||2 vs iteration
James Martens (U of T) Deep Learning via HF June 29, 2010 16 / 23
![Page 43: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/43.jpg)
CG stopping conditions
CG is only guaranteed to converge after N (size of parameter space)iterations −→ we can’t always run it to convergence
the standard stopping criterion used in most versions of HF is
‖r‖ < min(12 , ‖g‖
12 )‖g‖ where r = Bp + g is the “residual”
strictly speaking ‖r‖ is not the quantity that CG minimizes, nor is itthe one we really care about
0 50 100 150 200 250−2
0
2
4
6x 10
4
iteration
qθ(p) vs iteration
50 100 150 200 2500
2
4
6
x 1010
iteration
|| Bp + g ||2 vs iteration
James Martens (U of T) Deep Learning via HF June 29, 2010 16 / 23
![Page 44: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/44.jpg)
CG stopping conditions
CG is only guaranteed to converge after N (size of parameter space)iterations −→ we can’t always run it to convergence
the standard stopping criterion used in most versions of HF is
‖r‖ < min(12 , ‖g‖
12 )‖g‖ where r = Bp + g is the “residual”
strictly speaking ‖r‖ is not the quantity that CG minimizes, nor is itthe one we really care about
0 50 100 150 200 250−2
0
2
4
6x 10
4
iteration
qθ(p) vs iteration
50 100 150 200 2500
2
4
6
x 1010
iteration
|| Bp + g ||2 vs iteration
James Martens (U of T) Deep Learning via HF June 29, 2010 16 / 23
![Page 45: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/45.jpg)
CG stopping conditions (cont.)
we found that terminating CG once the relative per-iterationreduction rate fell below some tolerance ε worked best
∆q
q< ε
(∆q is the change in the quadratic model averaged over some windowof the last k iterations of CG)
James Martens (U of T) Deep Learning via HF June 29, 2010 17 / 23
![Page 46: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/46.jpg)
Handling large datasets
each iteration of CG requires the evaluation of the product Bv forsome v
naively this requires a pass over the training data-set
but for a sufficiently large subset of the training data - sufficient tocapture enough useful curvature information
size is related to model and qualitative aspects of the dataset, butcritically not its size
for very large datasets, mini-batches might be a tiny fraction of thewhole
gradient and line-searches can be computed using even largermini-batches since they are needed much less often
James Martens (U of T) Deep Learning via HF June 29, 2010 18 / 23
![Page 47: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/47.jpg)
Handling large datasets
each iteration of CG requires the evaluation of the product Bv forsome v
naively this requires a pass over the training data-set
but for a sufficiently large subset of the training data - sufficient tocapture enough useful curvature information
size is related to model and qualitative aspects of the dataset, butcritically not its size
for very large datasets, mini-batches might be a tiny fraction of thewhole
gradient and line-searches can be computed using even largermini-batches since they are needed much less often
James Martens (U of T) Deep Learning via HF June 29, 2010 18 / 23
![Page 48: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/48.jpg)
Handling large datasets
each iteration of CG requires the evaluation of the product Bv forsome v
naively this requires a pass over the training data-set
but for a sufficiently large subset of the training data - sufficient tocapture enough useful curvature information
size is related to model and qualitative aspects of the dataset, butcritically not its size
for very large datasets, mini-batches might be a tiny fraction of thewhole
gradient and line-searches can be computed using even largermini-batches since they are needed much less often
James Martens (U of T) Deep Learning via HF June 29, 2010 18 / 23
![Page 49: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/49.jpg)
Handling large datasets
each iteration of CG requires the evaluation of the product Bv forsome v
naively this requires a pass over the training data-set
but for a sufficiently large subset of the training data - sufficient tocapture enough useful curvature information
size is related to model and qualitative aspects of the dataset, butcritically not its size
for very large datasets, mini-batches might be a tiny fraction of thewhole
gradient and line-searches can be computed using even largermini-batches since they are needed much less often
James Martens (U of T) Deep Learning via HF June 29, 2010 18 / 23
![Page 50: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/50.jpg)
Handling large datasets
each iteration of CG requires the evaluation of the product Bv forsome v
naively this requires a pass over the training data-set
but for a sufficiently large subset of the training data - sufficient tocapture enough useful curvature information
size is related to model and qualitative aspects of the dataset, butcritically not its size
for very large datasets, mini-batches might be a tiny fraction of thewhole
gradient and line-searches can be computed using even largermini-batches since they are needed much less often
James Martens (U of T) Deep Learning via HF June 29, 2010 18 / 23
![Page 51: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/51.jpg)
Other enhancements
using a Levenburg-Marquardt style heuristic for adjusting thedamping parameter λ
using M-preconditioned CG with the diagonal preconditioner:
M =
[diag
(∑i
∇fi �∇fi
)+ λI
]α
initializing each run of the inner CG-loop from the solution found bythe previous run
carefully bounding and “back-tracking” the maximum number of CGsteps to compensate for the effect of using mini-batches to computethe Bv products
(see the paper for further details)
James Martens (U of T) Deep Learning via HF June 29, 2010 19 / 23
![Page 52: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/52.jpg)
Other enhancements
using a Levenburg-Marquardt style heuristic for adjusting thedamping parameter λ
using M-preconditioned CG with the diagonal preconditioner:
M =
[diag
(∑i
∇fi �∇fi
)+ λI
]α
initializing each run of the inner CG-loop from the solution found bythe previous run
carefully bounding and “back-tracking” the maximum number of CGsteps to compensate for the effect of using mini-batches to computethe Bv products
(see the paper for further details)
James Martens (U of T) Deep Learning via HF June 29, 2010 19 / 23
![Page 53: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/53.jpg)
Other enhancements
using a Levenburg-Marquardt style heuristic for adjusting thedamping parameter λ
using M-preconditioned CG with the diagonal preconditioner:
M =
[diag
(∑i
∇fi �∇fi
)+ λI
]α
initializing each run of the inner CG-loop from the solution found bythe previous run
carefully bounding and “back-tracking” the maximum number of CGsteps to compensate for the effect of using mini-batches to computethe Bv products
(see the paper for further details)
James Martens (U of T) Deep Learning via HF June 29, 2010 19 / 23
![Page 54: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/54.jpg)
Other enhancements
using a Levenburg-Marquardt style heuristic for adjusting thedamping parameter λ
using M-preconditioned CG with the diagonal preconditioner:
M =
[diag
(∑i
∇fi �∇fi
)+ λI
]α
initializing each run of the inner CG-loop from the solution found bythe previous run
carefully bounding and “back-tracking” the maximum number of CGsteps to compensate for the effect of using mini-batches to computethe Bv products
(see the paper for further details)
James Martens (U of T) Deep Learning via HF June 29, 2010 19 / 23
![Page 55: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/55.jpg)
Other enhancements
using a Levenburg-Marquardt style heuristic for adjusting thedamping parameter λ
using M-preconditioned CG with the diagonal preconditioner:
M =
[diag
(∑i
∇fi �∇fi
)+ λI
]α
initializing each run of the inner CG-loop from the solution found bythe previous run
carefully bounding and “back-tracking” the maximum number of CGsteps to compensate for the effect of using mini-batches to computethe Bv products
(see the paper for further details)
James Martens (U of T) Deep Learning via HF June 29, 2010 19 / 23
![Page 56: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/56.jpg)
Results
Experimental parameters (K = mini-batch size)
Name size K encoder dimsCURVES 20000 5000 784-400-200-100-50-25-6MNIST 60000 7500 784-1000-500-250-30FACES 103500 5175 625-2000-1000-500-30
Deep auto-encoder experiments
used precisely the same model architectures and datasets as in Hintonand Salakhutdinov, 2006
CURVES, MNIST and FACES are all image datasets
trained with cross-entropy but performance measured with squarederror
all methods were run using GPU implementations. H&S’s pre-trainingplus NCG fine-tuning method was run for a lot longer
James Martens (U of T) Deep Learning via HF June 29, 2010 20 / 23
![Page 57: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/57.jpg)
Results
Experimental parameters (K = mini-batch size)
Name size K encoder dimsCURVES 20000 5000 784-400-200-100-50-25-6MNIST 60000 7500 784-1000-500-250-30FACES 103500 5175 625-2000-1000-500-30
Deep auto-encoder experiments
used precisely the same model architectures and datasets as in Hintonand Salakhutdinov, 2006
CURVES, MNIST and FACES are all image datasets
trained with cross-entropy but performance measured with squarederror
all methods were run using GPU implementations. H&S’s pre-trainingplus NCG fine-tuning method was run for a lot longer
James Martens (U of T) Deep Learning via HF June 29, 2010 20 / 23
![Page 58: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/58.jpg)
Results
Experimental parameters (K = mini-batch size)
Name size K encoder dimsCURVES 20000 5000 784-400-200-100-50-25-6MNIST 60000 7500 784-1000-500-250-30FACES 103500 5175 625-2000-1000-500-30
Deep auto-encoder experiments
used precisely the same model architectures and datasets as in Hintonand Salakhutdinov, 2006
CURVES, MNIST and FACES are all image datasets
trained with cross-entropy but performance measured with squarederror
all methods were run using GPU implementations. H&S’s pre-trainingplus NCG fine-tuning method was run for a lot longer
James Martens (U of T) Deep Learning via HF June 29, 2010 20 / 23
![Page 59: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/59.jpg)
Results
Experimental parameters (K = mini-batch size)
Name size K encoder dimsCURVES 20000 5000 784-400-200-100-50-25-6MNIST 60000 7500 784-1000-500-250-30FACES 103500 5175 625-2000-1000-500-30
Deep auto-encoder experiments
used precisely the same model architectures and datasets as in Hintonand Salakhutdinov, 2006
CURVES, MNIST and FACES are all image datasets
trained with cross-entropy but performance measured with squarederror
all methods were run using GPU implementations. H&S’s pre-trainingplus NCG fine-tuning method was run for a lot longer
James Martens (U of T) Deep Learning via HF June 29, 2010 20 / 23
![Page 60: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/60.jpg)
Results (cont.)
PT + NCG = pre-trained initialization with non-linear CG optimizerRAND+HF = random initialization with our Hessian-free methodPT + HF = pre-trained initialization with our Hessian-free method* indicates an `2 prior was used
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0
0.5
1
1.5
2
2.5
0
0.5
1
1.5
2
2.5
0
10
20
30
40
50
60
0
10
20
30
40
50
60
70Training−set errors
FACES FACES*MNIST*MNISTCURVES0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0.5
1
1.5
2
2.5
3
0
0.5
1
1.5
2
2.5
3
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140Test−set errors
CURVES MNIST MNIST* FACES FACES*
PT + NCG RAND+HF PT + HF no early stopCURVES 0.74, 0.82 0.11, 0.20 0.10, 0.21 0.1MNIST 2.31, 2.72 1.64, 2.78 1.63, 2.46 1.4MNIST* 2.07, 2.61 1.75, 2.55 1.60, 2.28FACES -, 124 55.4, 139 -,- 12.9!FACES* -,- 60.6, 122 -,-
James Martens (U of T) Deep Learning via HF June 29, 2010 21 / 23
![Page 61: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/61.jpg)
Our HF method is practical
error on the CURVES task versus GPU time:
103
104
0
1
2
3
4
time in seconds (log−scale)
erro
r
PT+NCGRAND+HFPT+HF
James Martens (U of T) Deep Learning via HF June 29, 2010 22 / 23
![Page 62: Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning](https://reader033.fdocuments.net/reader033/viewer/2022050204/5f577c938ca92f04b86ca02f/html5/thumbnails/62.jpg)
Thank you for your attention
James Martens (U of T) Deep Learning via HF June 29, 2010 23 / 23