On Stochastic Optimization Techniques - CS229:...

5
On Stochastic Optimization Techniques Billy Jun [email protected] Lita Yang [email protected] Qian Liu [email protected] Abstract In recent years, there has been a resurgence of interest in the use of stochastic gradi- ent descent (SGD) methods for numerous reasons: simplicity of the method, theoret- ically appealing generalization bound inde- pendent of number of training data, online learning from streaming information and ap- plication to large-scale learning. In this project, we explored several popular and re- cent SGD methods on a deep learning topol- ogy, and built on this survey to motivate a new hyperparameter-free variant that im- proves upon its original algorithm. 1. Introduction Although similar with batch gradient descent (GD) methods in form, stochastic gradient descent (SGD) has many interesting and nonintuitive theoretical lim- its (Bottou, 2010). For example, the time it takes to reach accuracy ρ is O(1), which unlike batch tech- niques is independent of the number of training ex- amples. Even more nonintuitive is that second order information only improves the constant but not the asymptotic behavior. Lastly, despite its worse opti- mization performance on the empirical error than the batch GD, SGD offers a stronger generalization bound (asymptotically more efficient), which makes the study of SGD very interesting. Recently, SGD was used to very successfully train a deep neural network (DNN) from random weights with the addition of classical momentum (CM) and Nes- terov’s Accelerated Gradient (NAG) (Sutskever et al., 2013). In a slightly different approach, SGD was re- vamped by scheduling adaptive learning rates by pre- conditioning (Duchi et al., 2011) and minimizing ex- pected loss (Schaul et al., 2013). The main advantage of these adaptive techniques is that they offer stronger convergence near an optimum, which first-order SGD lacks. However, given the numerous promising variants, it is unclear which method to use in practice. To resolve this confusion, the first part of this project will train the same architecture across several selected SGD al- gorithms and present optimization properties such as the learning rate curves and error statistics as a survey. Another problem with methods such as SGD’s is that there are hyperparameters that depend on a partic- ular dataset and/or an architecture. This means a user has to sweep through a set of values and man- ually decide the hyperparameter setting before train- ing. Not only is this cumbersome, but also takes a lot of time. Variance-based SGD (Schaul et al., 2013) is a hyperparameter free method, but we found that it converges relatively slowly, compared to other SGD algorithms when properly tuned. In the second part of the work, we incorporate the recent NAG idea to this variance-based method to accelerate convergence. The new hyperparameter-free variant showed a noticeable improvement over the original algorithm. 2. Methods 2.1. Architecture As a first step, we wanted to choose a deep learning topology that would be a good model to test the SGD algorithms on. It would be ideal to implement and compare various deep learning algorithms, but due to pressing time, we only implemented a convolutional neural network (CNN) with softmax activation to clas- sify the MNIST handwritten digits dataset. Our archi- tecture, illustrated in Figure 1, is adopted from simi- lar work (Le et al., 2011) that compared the conjugate gradient, L-BFGS and first-order stochastic gradient descent methods. Initially, we implemented everything on Matlab with vectorization. This deemed valuable in verifying the correctness of implementation. However, due to slow runtime, computation-heavy code was rewritten in CUDA. This resulted in a significant performance in- crease.

Transcript of On Stochastic Optimization Techniques - CS229:...

Page 1: On Stochastic Optimization Techniques - CS229: …cs229.stanford.edu/proj2013/JunYangLiu-OnStochasticOptimization... · On Stochastic Optimization Techniques Billy Jun billyjun@stanford.edu

On Stochastic Optimization Techniques

Billy Jun [email protected] Yang [email protected] Liu [email protected]

Abstract

In recent years, there has been a resurgenceof interest in the use of stochastic gradi-ent descent (SGD) methods for numerousreasons: simplicity of the method, theoret-ically appealing generalization bound inde-pendent of number of training data, onlinelearning from streaming information and ap-plication to large-scale learning. In thisproject, we explored several popular and re-cent SGD methods on a deep learning topol-ogy, and built on this survey to motivatea new hyperparameter-free variant that im-proves upon its original algorithm.

1. Introduction

Although similar with batch gradient descent (GD)methods in form, stochastic gradient descent (SGD)has many interesting and nonintuitive theoretical lim-its (Bottou, 2010). For example, the time it takes toreach accuracy ρ is O(1/ρ), which unlike batch tech-niques is independent of the number of training ex-amples. Even more nonintuitive is that second orderinformation only improves the constant but not theasymptotic behavior. Lastly, despite its worse opti-mization performance on the empirical error than thebatch GD, SGD offers a stronger generalization bound(asymptotically more efficient), which makes the studyof SGD very interesting.

Recently, SGD was used to very successfully train adeep neural network (DNN) from random weights withthe addition of classical momentum (CM) and Nes-terov’s Accelerated Gradient (NAG) (Sutskever et al.,2013). In a slightly different approach, SGD was re-vamped by scheduling adaptive learning rates by pre-conditioning (Duchi et al., 2011) and minimizing ex-pected loss (Schaul et al., 2013). The main advantageof these adaptive techniques is that they offer strongerconvergence near an optimum, which first-order SGDlacks.

However, given the numerous promising variants, it isunclear which method to use in practice. To resolvethis confusion, the first part of this project will trainthe same architecture across several selected SGD al-gorithms and present optimization properties such asthe learning rate curves and error statistics as a survey.

Another problem with methods such as SGD’s is thatthere are hyperparameters that depend on a partic-ular dataset and/or an architecture. This means auser has to sweep through a set of values and man-ually decide the hyperparameter setting before train-ing. Not only is this cumbersome, but also takes alot of time. Variance-based SGD (Schaul et al., 2013)is a hyperparameter free method, but we found thatit converges relatively slowly, compared to other SGDalgorithms when properly tuned. In the second part ofthe work, we incorporate the recent NAG idea to thisvariance-based method to accelerate convergence. Thenew hyperparameter-free variant showed a noticeableimprovement over the original algorithm.

2. Methods

2.1. Architecture

As a first step, we wanted to choose a deep learningtopology that would be a good model to test the SGDalgorithms on. It would be ideal to implement andcompare various deep learning algorithms, but due topressing time, we only implemented a convolutionalneural network (CNN) with softmax activation to clas-sify the MNIST handwritten digits dataset. Our archi-tecture, illustrated in Figure 1, is adopted from simi-lar work (Le et al., 2011) that compared the conjugategradient, L-BFGS and first-order stochastic gradientdescent methods.

Initially, we implemented everything on Matlab withvectorization. This deemed valuable in verifying thecorrectness of implementation. However, due to slowruntime, computation-heavy code was rewritten inCUDA. This resulted in a significant performance in-crease.

Page 2: On Stochastic Optimization Techniques - CS229: …cs229.stanford.edu/proj2013/JunYangLiu-OnStochasticOptimization... · On Stochastic Optimization Techniques Billy Jun billyjun@stanford.edu

Submission and Formatting Instructions for ICML 2013

32

32

1

28

28

32

10

32

6

6

64

33

64

55

33

55 22

10

10

Legendconvolutionmax poolingfull connectionsoftmax activation

10

Figure 1. Illustration of the CNN architecture used in thiswork. We chose max pooling over uniform downsamplingand mean pooling, because this captures the highest activ-ity in the respective receptive field and backpropagation ismore computationally efficient.

2.2. Stochastic gradient descent algorithms

We explored the following popular/recent algorithmsfrom the SGD literature.

• Stochastic gradient descent (SGD1) with learningrate (Le et al., 2011): η(t) = α

β+t . This is popularbecause of its simple implementation.

• SGD with classical momentum (SGD-CM):

v(t+1) = µv(t) − ε∇f(θ(t)) (1)

θ(t+1) = θ(t) + v(t+1). (2)

• SGD with classical momentum and Nesterov’s Ac-celerated Gradient (SGD-CM-NAG) (Sutskeveret al., 2013):

µ(t) = min{1− 2−1−log2(1+bt/250c), µmax} (3)

v(t+1) = µ(t)v(t) − ε∇f(θ(t) + µ(t)v(t)) (4)

θ(t+1) = θ(t) + v(t+1). (5)

The authors made a clear distinction between thetwo terms in (4), calling the first classical, asit turned out that NAG is another form of mo-mentum that offers quadratic convergence rate forconvex problems, which is unprecedented in otherSGD methods. We used the same terminology inthis work.

• Adaptive Subgradient SGD (ADAGRAD) (Duchiet al., 2011):

η(t)i = ε

(µ+

t∑s=0

(∇θ(s)i

)2)−1/2

(6)

The original learning rate from the paper does nothave the hyperparameter µ. This was added forother similar algorithms to prevent division-by-zero problem.

• Stochastic Diagonal Levenberg-Marquardt (SGD-LM) (LeCun):

η(t)i = ε

(µ+∇2

θ(t)i

)−1

(7)

This is a popular algorithm for training convolu-tional neural networks (CNN), so was included inthis work.

• Local variance-based SGD (vSGD-l) (Schaulet al., 2013):

η(t)i =

(E[∇θi ])2

µ+∇2

θ(t)i

E[(∇θi)2], (8)

This algorithm is derived from the principle ofminimum mean squared error (MMSE) that aimsto greedily minimize per-sample quadratic loss.It is completely hyperparameter free (µ is just asmall constant for preventing division by zero).

2.3. Training

The MNIST digits dataset was first preprocessed sothat each feature is a Gaussian random variable withzero mean and unit variance. Then, the images werepadded with 0 around the perimeter to have a newdimension of 32×32 pixels. Because we were interestedin measuring the raw performance of SGD methods,techniques such as minibatching and pretraining werenot performed. For each algorithm, we trained for 15epochs on NVIDIA GeForce GTX 480 with randomweight initialization (∼ N (0, 10−4)). Then, twice perepoch, we checked its mean classification error on a20% hold-out validation set (12000 images) and savedthe weights when a better model is learned. This finalmodel is then used to make predictions on the test set.Since our primary focus was on how well optimizationtechniques reduce empirical errors, early stopping andthe natural sparsity of the CNN were the only form ofregularization used to improve test error.

For simplicity, backpropagation of diagonal Hessian es-timates in our CUDA implementation is performed af-ter the gradient signals are backpropagated. However,in our report, we assumed that this can be performedin parallel, and thus we rescaled run-time to that ofthe earliest finishing algorithm.

3. Survey of popular/recent work

First, we verified that our CUDA implementation iscorrect by visualizing the kernels it learned (Figure2). Then, to make certain all optimization techniquesare performing at their best, we ran through a sweep

Page 3: On Stochastic Optimization Techniques - CS229: …cs229.stanford.edu/proj2013/JunYangLiu-OnStochasticOptimization... · On Stochastic Optimization Techniques Billy Jun billyjun@stanford.edu

Submission and Formatting Instructions for ICML 2013

Figure 2. Verification of the CNN implementation was per-formed with SGD1 due to its simplicity and obvious cor-rectness.

Algorithms ParametersSGD1 α = 0.01× |D|, β = |D|

SGD-CM ε = 10−4, µ = 0.99SGD-CM-NAG ε = 10−4, µmax = 0.99

ADAGRAD ε = 10−2, µ = 10−6

SGD-LM ε = 10−3, µ = 10−2

vSGD-l µ = 10−6

Table 1. Results of hyperparameter selection.

of values to find a set of good hyperparameters andchose the settings summarized in Table 1.

As an interesting practical guide to hyperparamter se-lection, we found that ADAGRAD and vSGD-l areinsensitive to the choice of epsilon value µ unlike SGD-LM (Figure 3). If the model is regularized, for exam-ple, with an L2 norm, then division-by-zero error is nolonger an issue with SGD-LM. But, this would meanthat the optimal hyperparameter ε becomes a functionof the regularization hyperparameters as well.

0 500 1000 15000

0.5

1

1.5

2

2.5

Training time (s)

SGDLM

softm

axloss

eps=1e−2, mu=1e−6eps=1e−2, mu=1e−4eps=1e−2, mu=1e−2eps=1e−3, mu=1e−6eps=1e−3, mu=1e−4eps=1e−3, mu=1e−2eps=1e−5, mu=1e−6eps=1e−5, mu=1e−4eps=1e−5, mu=1e−2eps=1e−6, mu=1e−6eps=1e−6, mu=1e−4eps=1e−6, mu=1e−2

0 100 200 300 400 500 600 700 800 9000

0.5

1

1.5

2

2.5

Training time (s)

ADAGRADsoftm

axloss

eps=1e−2, mu=1e−6eps=1e−2, mu=1e−4eps=1e−2, mu=1e−2eps=1e−3, mu=1e−6eps=1e−3, mu=1e−4eps=1e−3, mu=1e−2eps=1e−4, mu=1e−6eps=1e−4, mu=1e−4eps=1e−4, mu=1e−2eps=1e−5, mu=1e−6eps=1e−5, mu=1e−4eps=1e−5, mu=1e−2

Figure 3. Increasing µ tends to help SGD-LM (bottom) be-come more numerically stable, although it can partiallyreduce the convergence rate.

After training the same architecture with sweep-selected hyperparameters, we obtained the learningrate curves in Figure 4, and their respective errorstatistics in Table 2. Apart from vSGD-l, which isof different nature, SGD1 as expected had the worstempirical error. This was mainly because of the prede-

termined learning schedule, which allows for fast initialconvergence but later suffers from a slow-down near alocal optimum. This also explains why SGD1 had thebest test error (Table 2); it just overfit less. In thisrespect, the adaptive rates of ADAGRAD and SGD-LM led to better empirical errors. However, for purelyoptimizing a function, SGD methods with momentumhad the best performance. In particular, NAG, whichenjoys quadratic convergence rate for convex problems(Sutskever et al., 2013), converges increasingly betteras it approaches the minimum, because the problembecomes more convex locally.

0 500 1000 1500 2000 2500 3000 3500 40000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Training time (s)

Validationsetsoftmaxloss

SGD1SGD−CMSGD−CM−NAGADAGRADSGD−LMvSGD−l

Figure 4. The actual computation time for vSGD-l andSGD-LM on CUDA took longer because of the backpropa-gation of diagonal Hessian estimates, but we assumed thatthis can be performed in parallel in this plot.

Algorithms Validation set MCE Test set MCESGD1 1.08% 0.91%

SGD-CM 0.84% 1.00%SGD-NAG 0.07% 1.28%ADAGRAD 0.91% 0.97%

SGD-LM 0.99% 0.93%vSGD-l 1.91% 1.70%

Table 2. Early stopping and sparsity of the CNN are theonly form of regularization used to generate these results.

In the original paper (Schaul et al., 2013), it was shownthat SGD1 has a limited lower bound to which objec-tive can be reduced, and that vSGD methods outper-form on the MNIST dataset. However, this was be-cause the decay rate of SGD1 was chosen to be λ = 1.With our choice of hyperparameters, α and β (Table1), the learning rate schedule becomes

η(t) =α

β + t=

0.01|D||D|+ t

=0.01

1 + (1/|D|)t=:

η0

1 + λt

The decay factor λ = 1|D| < 1 allows the learning rate

to decrease slowly over the number of epochs rather

Page 4: On Stochastic Optimization Techniques - CS229: …cs229.stanford.edu/proj2013/JunYangLiu-OnStochasticOptimization... · On Stochastic Optimization Techniques Billy Jun billyjun@stanford.edu

Submission and Formatting Instructions for ICML 2013

than the total number of iterations. This allowedSGD1 to actually converge to a smaller objective valueand result in lower mean classification error rates thanvSGD-l.

Although vSGD-l has the advantage of beinghyperparameter-free, it does not perform as well as theother methods when their hyperparameters are man-ually tuned to be optimal. We speculate that thisis because vSGD-l chooses the optimal learning ratebased on the minimum mean squared error (MMSE)like the Kalman filter, but the squared loss assump-tion does not hold strongly for this particular prob-lem. Furthermore, upon probing various dependenciesof the learning rate computation, it was found that theinverse of memory size, which ultimately decides howmuch of the current gradient information gets incorpo-rated into the current estimates of E[∇θi ] and E[∇2

θi],

diminshes very rapidly (Figure 5).

0 500 1000 1500 2000 2500 3000 3500 4000 45000

0.2

0.4

weight indices

after 500 iterations

0 500 1000 1500 2000 2500 3000 3500 4000 45004

6

8

10x 10−5

weight indices

after 20000 iterations

Figure 5. y-axis represents the ratio of the current gradientinformation used in estimating E[∇θi ] and E[∇2

θi]. About a

half way through the first epoch, this fraction is practically0.

4. Improving vSGD-l

In the previous section, we saw that vSGD-l convergesrelatively weakly, compared to other techniques withcareful hyperparameter selection. However, vSGD-l isappealing because it is truly hyperparameter free. Ob-serving that classical momentum and Nesterov’s Accel-erated Gradient offers quadratic convergence near anoptimum (due to increased convexity), we applied thisidea to accelerate vSGD-l.

As a first attempt, the original NAG formulation(Sutskever et al., 2013) with momentum schedule (3)was applied directly without any modifications. Asshown in Figure 6, NAG in its unaltered form actu-ally degrades vSGD-l. This is because vSGD-l imple-ments the slow-start rule (Schaul et al., 2013) so thatthe algorithm better estimates gradient and hessianexpectations initially. However, because of the hard-coded value of 250 in (3), µ(t) quickly saturates to

µmax. Consequently, the initial expectation estimateslose their accuracies due to the distortion induced bya quickly growing momentum; learning rates end upbeing suboptimal.

0 500 1000 1500 2000 2500

0.1

0.2

0.3

0.4

0.5

0.6

Training time (s)

Validationsetsoftmaxloss

vSGD−l NAG (mumax

=0.9)

vSGD−l NAG (mumax

=0.95)

vSGD−l NAG (mumax

=0.99)

vSGD−l NAG (mumax

=0.995)

vSGD−lSGD−CM−NAG

Figure 6. NAG destroys the initial local estimates of E[∇θi ]and E[(∇θi)

2], and results in poorer performance. For thissection, our CUDA implementation resulted in NaN errorswhen running on the full architecture in Figure 1. Allexperiments are performed with 8 and 16 kernel maps inthe first two convolutional layers.

To cope with this, a new momentum schedule is pro-posed in Algorithm 1. The two major changes are thereplacement of 250 by |D|, the size of training data,and the removal of µmax. This choice of the denomina-tor no longer hard-codes a particular value and makescertain that the rate of momentum accumulation iskept small in the first epoch, allowing vSGD-l to gatheraccurate statistics for good initial estimates of expec-tation values. Here, we are implicitly assuming thatthere are much more than 250 training examples. Ifthere are a small number of training data, then hy-perparameter selection doesn’t take long; other, moreviable learning techniques such as SGD-CM-NAG canbe applied without too much overhead. Under thisscheme, it takes about 500 epochs for any training setsto reach a µ value of 0.999. Practically, by then, thevSGD-l learning rates become negligible and the classi-cal momentum term never diverges. However, we haveyet to prove theoretically that this is so.

We tested this idea for 15 epochs on a reduced ar-chitecture of 8 and 16 kernel maps in the first twoconvolutional layers, because the CUDA implementa-tion of vSGD-l NAG resulted in an NaN error on thefull-sized CNN. In summary, the new hyperparameter-free algorithm reduced the objective value of vSGD-lby a factor of about 35% (Figure 7) and decreased theempirical and validation errors by an additional 1%.

Page 5: On Stochastic Optimization Techniques - CS229: …cs229.stanford.edu/proj2013/JunYangLiu-OnStochasticOptimization... · On Stochastic Optimization Techniques Billy Jun billyjun@stanford.edu

Submission and Formatting Instructions for ICML 2013

Algorithm 1 vSGD-l NAG with a new momentumschedule; initialization is the same as vSGD-l.

1: repeat2: draw a sample {x(t), y(t)} ∼ D3: µ← 1− 2−1−log2(1+bt/|D|c)

4: g(t) ← ∇f(θ + µv;x(t), y(t))5: h(t) ← ∇2f(θ + µv;x(t), y(t))6: for i ∈ [1, . . . , d] do

7: gi ← (1− τ−1i )gi + τ−1

i g(t)i

8: vi ← (1− τ−1i )vi + τ−1

i

(g

(t)i

)2

9: hi ← (1− τ−1i )hi + τ−1

i |h(t)i |

10: τi ←(

1− (gi)2

vi

)τi + 1

11: vi ← µvi − (gi)2

hivig

(t)i

12: θi ← θi + vi13: end for14: until convergence requirement is met

0 500 1000 1500 2000 25000.08

0.1

0.12

0.14

0.16

0.18

0.2

Training time (s)

Validationsetsoftmaxloss

vSGD−lvSGD−l NAG (new)SGD−CM−NAG

Figure 7. SGD-CM-NAG still outperforms both algo-rithms, but vSGD-l NAG closes the gap considerably.

Algorithms Validation set MCE Test set MCEvSGD-l 3.71% 3.14%

vSGD-l NAG 2.78% 2.41%SGD-CM-NAG 2.27% 2.13%

Table 3. Summary of error statistics

5. Summary

In this project, a CNN-softmax topology was trainedby several popular and recent SGD variants to clas-sify the MNIST handwritten digits dataset. Fromthis survey, it was experimentally confirmed that NAGmomentum minimizes the objective the best with itsquadratic convergence rate property for convex prob-lems; this effect shows up more dominantly, as it getscloser to a local optimum in which the problem seemsmore convex. The local-variance-based SGD (vSGD-l)is hyperparameter-free but it was found that it doesnot enjoy the same level of convergence and generaliza-tion as other methods with manual tuning. In the sec-ond part of this work, a new momentum schedule wasproposed to combine the strengths of local-variance

estimation and Nesterov’s acceleration, which resultedin a new hyperparameter-free algorithm that outper-formed the original vSGD-l. However, there still re-main some practical difficulties in using vSGD-l NAG;1. it is not trivial to implement, 2. it requires diago-nal Hessian estimates, and 3. the memory overhead ishuge. In particular, it needs to additionally store 5-6times the number of parameters. In a deep learningarchitecture with millions of parameters, this may beimpractical, or will require more hardware to run.

Acknowledgments

The authors would like to thank Andrew Maas, BrodyHuval and Kyle Anderson for helpful discussions andadvice on function optimization techniques to con-sider.

References

Bottou, Leon. Large-scale machine learning withstochastic gradient descent. In Lechevallier, Yvesand Saporta, Gilbert (eds.), Proceedings of the19th International Conference on ComputationalStatistics (COMPSTAT’2010), pp. 177–187, Paris,France, August 2010. Springer.

Duchi, John, Hazan, Elad, and Singer, Yoram. Adap-tive subgradient methods for online learning andstochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. ISSN 1532-4435.

Le, Quoc, Ngiam, Jiquan, Coates, Adam, Lahiri, Ah-bik, Prochnow, Bobby, and Ng, Andrew. On op-timization methods for deep learning. In Getoor,Lise and Scheffer, Tobias (eds.), Proceedings of the28th International Conference on Machine Learning(ICML-11), ICML ’11, pp. 265–272, New York, NY,USA, June 2011. ACM. ISBN 978-1-4503-0619-5.

LeCun, Yann. More optimization methods. pp.38. URL http://cilvr.cs.nyu.edu/diglib/

lsml/lecture02-optimi.pdf.

Schaul, Tom, Zhang, Sixin, and LeCun, Yann. NoMore Pesky Learning Rates. In International Con-ference on Machine Learning (ICML), 2013.

Sutskever, Ilya, Martens, James, Dahl, George, andHinton, Geoffrey. On the importance of initializa-tion and momentum in deep learning. In Dasgupta,Sanjoy and McAllester, David (eds.), Proceedingsof the 30th International Conference on MachineLearning (ICML-13), volume 28, pp. 1139–1147.JMLR Workshop and Conference Proceedings, May2013.