arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a)...

10
Learning Rate Dropout Huangxing Lin 1 , Weihong Zeng 1 , Xinghao Ding 1* , Yue Huang 1 , Chenxi Huang 1 , John Paisley 2 1 Xiamen University, China 2 Columbia University, New York, NY, USA {hxlin, zengwh}@stu.xmu.edu.cn, [email protected], [email protected], [email protected], [email protected] Abstract The performance of a deep neural network is highly dependent on its training, and finding better local optimal solutions is the goal of many optimization algorithms. How- ever, existing optimization algorithms show a preference for descent paths that converge slowly and do not seek to avoid bad local optima. In this work, we propose Learning Rate Dropout (LRD), a simple gradient descent technique for training related to coordinate descent. LRD empirically aids the optimizer to actively explore in the parameter space by randomly setting some learning rates to zero; at each iteration, only parameters whose learning rate is not 0 are updated. As the learning rate of different parameters is dropped, the optimizer will sample a new loss descent path for the current update. The uncertainty of the descent path helps the model avoid saddle points and bad local minima. Experiments show that LRD is surprisingly effec- tive in accelerating training while preventing overfitting. (Code:https://github.com/HuangxingLin123/Learning- Rate-Dropout). 1. Introduction Deep neural networks are trained by optimizing high- dimensional non-convex loss functions. The success of training hinges on how well we can minimize these loss functions, both in terms of the quality of the convergence points and the time it takes to find them. These loss func- tions are usually optimized using gradient-descent-based algorithms. These optimization algorithms search the de- scending path of the loss function in the parameter space ac- cording to a predefined paradigm. Existing optimization al- gorithms can be roughly categorized into adaptive and non- adaptive methods. Specifically, adaptive algorithms (Adam [15], Amsgrad [27], RMSprop [31], RAdam [19], etc) tend to find paths that descend the loss quickly, but usually con- verge to bad local minima. In contrast, non-adaptive al- gorithms (e.g. SGD-momentum) converge better, but the (a) (b) Figure 1. (a) Gradient updates are trapped in a saddle point. (b) Applying learning rate dropout to training, the optimizer escapes from saddle points more quickly. Red arrow: the update of each parameter in current iteration. Red dot: the initial state. Yellow dots: the subsequent states. ×: randomly dropped learning rate. training speed is slow. Helping the optimizer maintain fast training speed and good convergence is an open problem. Implicit regularization techniques (e.g. dropout [30], weight decay [16], noisy label [37]) are widely used to help training. The most popular one is dropout, which can pre- vent feature co-adaptation (a sign of overfitting) effectively by randomly dropping the hidden units (i.e. their activation is zeroed). Dropout can be interpreted as a way of regu- larizing training by adding noise to the hidden units. Other methods can achieve similar effects by injecting noise into gradients [24], label [37] and activation functions [10]. The principle of these methods is to force the optimizer to ran- domly change the loss descent path in the high-dimensional parameter space by injecting interference into the training. The uncertainty of the loss descent path gives the model more opportunities to escape local minima and find a better result. However, many researchers have found that these regularization methods can improve generalization at the cost of training time [30, 34]. This is because random noise in training may create an erroneous loss descent path (i.e. current parameter updates may increase the loss), resulting in slow convergence. In fact, what is needed is a technique that finds good local optima and converges quickly. A possible solution is to inject uncertainty into the loss descent path while ensuring that the resulting path is cor- rect. A correct loss descent path means it decreases the loss. 1 arXiv:1912.00144v2 [cs.CV] 5 Dec 2019

Transcript of arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a)...

Page 1: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

Learning Rate Dropout

Huangxing Lin1, Weihong Zeng1, Xinghao Ding1*, Yue Huang1, Chenxi Huang 1, John Paisley2

1Xiamen University, China2Columbia University, New York, NY, USA

{hxlin, zengwh}@stu.xmu.edu.cn, [email protected],

[email protected], [email protected], [email protected]

Abstract

The performance of a deep neural network is highlydependent on its training, and finding better local optimalsolutions is the goal of many optimization algorithms. How-ever, existing optimization algorithms show a preferencefor descent paths that converge slowly and do not seek toavoid bad local optima. In this work, we propose LearningRate Dropout (LRD), a simple gradient descent techniquefor training related to coordinate descent. LRD empiricallyaids the optimizer to actively explore in the parameterspace by randomly setting some learning rates to zero; ateach iteration, only parameters whose learning rate is not0 are updated. As the learning rate of different parametersis dropped, the optimizer will sample a new loss descentpath for the current update. The uncertainty of the descentpath helps the model avoid saddle points and bad localminima. Experiments show that LRD is surprisingly effec-tive in accelerating training while preventing overfitting.(Code:https://github.com/HuangxingLin123/Learning-Rate-Dropout).

1. IntroductionDeep neural networks are trained by optimizing high-

dimensional non-convex loss functions. The success oftraining hinges on how well we can minimize these lossfunctions, both in terms of the quality of the convergencepoints and the time it takes to find them. These loss func-tions are usually optimized using gradient-descent-basedalgorithms. These optimization algorithms search the de-scending path of the loss function in the parameter space ac-cording to a predefined paradigm. Existing optimization al-gorithms can be roughly categorized into adaptive and non-adaptive methods. Specifically, adaptive algorithms (Adam[15], Amsgrad [27], RMSprop [31], RAdam [19], etc) tendto find paths that descend the loss quickly, but usually con-verge to bad local minima. In contrast, non-adaptive al-gorithms (e.g. SGD-momentum) converge better, but the

(a) (b)

Figure 1. (a) Gradient updates are trapped in a saddle point. (b)Applying learning rate dropout to training, the optimizer escapesfrom saddle points more quickly. Red arrow: the update of eachparameter in current iteration. Red dot: the initial state. Yellowdots: the subsequent states. ×: randomly dropped learning rate.

training speed is slow. Helping the optimizer maintain fasttraining speed and good convergence is an open problem.

Implicit regularization techniques (e.g. dropout [30],weight decay [16], noisy label [37]) are widely used to helptraining. The most popular one is dropout, which can pre-vent feature co-adaptation (a sign of overfitting) effectivelyby randomly dropping the hidden units (i.e. their activationis zeroed). Dropout can be interpreted as a way of regu-larizing training by adding noise to the hidden units. Othermethods can achieve similar effects by injecting noise intogradients [24], label [37] and activation functions [10]. Theprinciple of these methods is to force the optimizer to ran-domly change the loss descent path in the high-dimensionalparameter space by injecting interference into the training.The uncertainty of the loss descent path gives the modelmore opportunities to escape local minima and find a betterresult. However, many researchers have found that theseregularization methods can improve generalization at thecost of training time [30, 34]. This is because random noisein training may create an erroneous loss descent path (i.e.current parameter updates may increase the loss), resultingin slow convergence. In fact, what is needed is a techniquethat finds good local optima and converges quickly.

A possible solution is to inject uncertainty into the lossdescent path while ensuring that the resulting path is cor-rect. A correct loss descent path means it decreases the loss.

1

arX

iv:1

912.

0014

4v2

[cs

.CV

] 5

Dec

201

9

Page 2: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

(a) Back propagation (b) Applying dropout (c) Applying learning rate dropout

Figure 2. (a) BP algorithm for a neural network. Black lines represent the gradient updates to each weight parameter (e.g. wqs, wsv ,wvz). (b) An example of applying standard dropout, the dropped units do not appear in both forward and back propagation during training.(c) The red line indicates that the learning rate is dropped, so the corresponding weight parameter is not updated. Note that the droppedlearning rate does not affect forward propagation and gradient back propagation. At each iteration, different learning rates are dropped.

On the other hand, inspired by the coordinate descent algo-rithm [36, 25, 23], we realize a fundamental fact that thedecline of loss does not depend solely on the simultaneousupdate of all weight parameters. According to coordinatedescent, even if only one parameter is updated, the loss isreduced. In particular, the loss descent path is determinedby all updated parameters. This means that we can sam-ple a new descent path by randomly pausing the update ofcertain parameters. Based on these observations, we pro-pose a new regularization technique, learning rate dropout(LRD), to help the optimizer randomly sample the correctloss descent path at each iteration.

Learning rate dropout provides a way to inject random-ness into the loss descent path while maintaining a descentdirection. The key difference from standard dropout is torandomly drop the learning rate of the model parameters in-stead of the hidden units. During training, the optimizer dy-namically calculates the update and allocates a learning ratefor each parameter. Learning rate dropout works by ran-domly determining which parameters are not updated (i.e.the learning rate is set to zero) for the current training itera-tion. In the simplest case, the learning rate for each parame-ter is retained with a fixed probability p, independently fromother parameters, or is dropped with probability 1 − p. Adropped learning rate is temporarily set to zero, which doesnot affect the update of other parameters or subsequent up-dates. A neural network with n parameters has 2n feasibledescent paths. Since each path is correct in that it decreasesthe objective function, our learning rate dropout does nothinder training unlike previous methods [30, 37]. However,convergence issues such as saddle points and bad local op-tima are avoided more easily using the proposed method.Furthermore, LRD can be trivially integrated into existinggradient-based optimization algorithm, such as Adam.

2. Related work

Noise injection: Learning rate dropout can be interpretedas a regularization technique for loss descent path by addingnoise to the learning rate. The idea of adding noise [1, 13,9, 32] to the training of neural networks has drawn much at-tention. Adilova, et al. [1] demonstrate that noise injectionsubstantially improves model quality for non-linear neuralnetwork. An [2] explored the effects of noise injection tothe inputs, outputs, and weights of multilayer feedforwardneural networks. Blundell, et al. [4] regularize the weightsby minimizing a variational free energy. Neelakantan, et al.[24] also find that adding noise to gradients can help avoidoverfitting and result in lower training loss. Xie, et al. [37]imposes regularization within the loss layer by randomlysetting the labels to be incorrect. Similarly, the standarddropout [30] is a way of regularizing a neural network byadding noise to its hidden units. Wan, et al. [33] further pro-posed DropConnect, which is a generalization of Dropout.DropConnect sets a randomly selected subset of weights tozero, rather than hidden units. On the other hand, addingnoise to activation functions can prevent the early satura-tion. Gulcehre, et al. [10] found that noisy activation func-tions (e.g., sigmoid and tanh) are easier to optimize. Re-placing the non-linearities by their noisy counterparts usu-ally leads to better results. Chen, et al. [5] propose a noisysoftmax to mitigate the early saturation issue by injectingannealed noise to the softmax input. All of these methodscan effectively prevent overfitting. However, they result inslower convergence.

Optimization algorithms: Deep neural networks are op-timized using gradient-descent-based algorithms. Stochas-tic gradient descent (SGD) [28] is a widely used approach,which performs well in many research fields. However, ithas been empirically observed that SGD has slow conver-gence since it scales the gradient uniformly in all direc-

2

Page 3: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

Algorithm 1 Generic framework of optimization with learning rate dropout. ◦ indicates element-wise multiplication.

Require: α : learning rate, {φt, ψt}Tt=1: function to calculate momentum and adaptive rate, W0 : initial parameters,f(W ) : stochastic objective function, p : dropout rate, AdaptiveMethod : False or True.

Ensure: WT : resulting parameters.1: for t = 1 to T do2: Gt = Oft(Wt−1) (Calculate gradients w.r.t. stochastic objective at timestep t)3: Mt = φt(G1, · · ·, Gt) (Accumulation of past and current gradients)4: if AdaptiveMethod is True then5: Vt = ψt(G1, · · ·, Gt) (Accumulation of squared gradients)6: 4Wt =Mt/

√Vt

7: else8: 4Wt =Mt

9: end if10: Random sample learning rate dropout mask Dt with each element dij,t ∼ Bernoulli(p)11: At = αDt (Randomly drop learning rates at timestep t)12: Ut = At ◦ 4Wt (Calculate the update for each parameter)13: Wt =Wt−1 − Ut

14: end for

tions. To address this issue, variants of SGD that adaptivelyrescale or average the gradient direction have achievedsome success. Examples include RMSprop [31], Adadelta[38] and Adam [15]. In particular, Adam is the most pop-ular adaptive optimization algorithm due to its rapid train-ing speed. However, many publications [35, 27] indicatethat Adam has poor convergence. Recently, Amsgrad [27]and Adabound [22], two variants of Adam, were proposedto solve the convergence issues of Adam by bounding thelearning rates. Furthermore, RAdam [19] also aims to solvethe convergence issue of Adam by rectifying the varianceof the adaptive learning rate. While these adaptive methodsoften display faster progress in training, they have also beenobserved to fail to converge well in many cases.

3. Method description

3.1. Online optimization

We start our discussion on learning rate dropout by in-tegrating it into an online optimization problem [40]. Weprovide a generic framework of optimization methods withlearning rate dropout in Algorithm 1 (all multiplications areelement-wise). Consider a neural network with weight pa-rameters W . We assume W ∈ Rh×k. In the online setup,at each time step t, the optimization algorithm modifies thecurrent parameters Wt−1 using the loss function over datatime t, ft(Wt−1) and gradient Gt = Oft(Wt−1). Finally,the optimizer calculates the update for Wt−1 using Gt andpossibly other terms including earlier gradients.

Algorithm 1 encapsulates many popular adaptive andnon-adaptive methods by the definition of gradient accu-mulation terms φ(·) and ψ(·). In particular, the adaptivemethods are distinguished by the choice of ψ(·), which is

absent in non-adaptive methods. Most methods contain thesimilar momentum component

Mt = φt(G1, · · ·, Gt)

= βMt−1 + ηGt,(1)

where β > 0, η > 0, β is the momentum parameter. Themomentum accumulates the exponentially moving averageof previous gradients to correct the current update.

3.2. Learning rate dropout

During training, the optimizer assigns a learning rate α,which is typically a constant, to each parameter. Applyinglearning rate dropout in optimization, in one iteration thelearning rate of each parameter is kept with probability p,independently from other parameters, or set to zero other-wise. At each time step t, a random binary mask matrixDt ∈ {0, 1}h×k is sampled to encode the learning rate in-formation with each element dij,t ∼ Bernoulli(p). Then alearning rate matrix At at time step t is obtained by

At = αDt. (2)

LRD and coordinate descent: For a parameter, a learn-ing rate of 0 means not updating its value. For a model,applying learning rate dropout is equivalent to uniformlysampling one of 2n possible parameter subsets for whichto perform a gradient update in the current iteration (seeFigure 3). This is closely connected to coordinate descent,in which each set of a partition of parameters is fully opti-mized in a cycle, before returning to the first set for the nextiteration. Therefore, each update of LRD will still causea decrease in the loss function according to the gradient

3

Page 4: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

Figure 3. Applying learning rate dropout to training. This modelcontains 3 parameters, so there are 8 loss descent paths to choosein each iteration. The blue dot is the model state. The solid lineis the update of each parameter. The dashed line is the resultingupdate for the model. “×” represents dropping the learning rate.

descent theory [29, 36, 25], while better avoiding saddlepoints. On the other hand, our LRD does not interrupt thegradient calculation of any parameters. If there is a gradientaccumulation term (e.g. momentum) in the optimizer, thegradients of each parameter will be stored for subsequentupdates, regardless of whether the learning rate is dropped.Furthermore, our method won’t slow training like previousmethods [30, 37, 33].

Toy example: The learning rate dropout can acceleratetraining and improve generalization. By adding stochastic-ity to loss descent path, this technique helps the model totraverse quickly through the “transient” plateau (e.g. saddlepoints or local minima) and gives the model more chancesto find a better minimum. To verify the effectiveness oflearning rate dropout, we show a toy example to visualizethe loss descent path during optimization. Consider the fol-lowing nonconvex function:

f(x, y) =(1.5− x2 + xy)2 + (2.25− x2 + xy2)2

+ (2.625− x2 + xy3)2,(3)

where x ∈ [−4, 0], y ∈ [−2.0, 3.0]. We use the popularoptimizer Adam to search the minima of the function inthe two dimensional parameter space. For this function, thepoint (−0.74, 1.40) is the optimal solution. In Figure 4, wesee that the convergence of Adam is sensitive to the initialpoint. Different initializations lead to different convergenceresults. And, Adam is easily trapped by the minimum nearthe initial point, even if it is a bad local minimum. In con-trast, our learning rate dropout makes Adam more active.Even if the optimizer reach a local minimum, the learn-ing rate dropout still encourages the optimizer to search forother possible paths instead of doing nothing. This exam-ple illustrates that the learning rate dropout can effectivelyhelp the optimizer escape from suboptimal points and finda better result.

4. ExperimentsTo empirically evaluate the proposed method, we per-

form different popular machine learning tasks, including

(a) Adam (b) Adam with learning rate dropout

Figure 4. Visualization of the loss descent paths. The learning ratedropout can help Adam escape from the local minimum. “?” isoptimal point (−0.74, 1.40).

image classification, image segmentation and object detec-tion. Using different models and optimization algorithms,we demonstrate that learning rate dropout is a general tech-nique for improving neural network training not specific toany particular application domain.

4.1. Image classification

We first apply learning rate dropout to the multi-class classification problem of the MNIST, CIFAR-10 andCIFAR-100 datasets.

MNIST: The MNIST digits dataset [17] contains 60, 000training and 10000 test images of size 28× 28. The task isto classify the images into 10 digit classes. Following theprevious publications [15, 3], we train a simple 2-hiddenfully connected layer neural network (FCNet) on MNIST.We use a fully connected 1000 rectified linear units (ReLU)as each hidden layer for this experiment. The training wason mini-batches with 128 images per batch for 100 epochsthrough the training set. A decay scheme is not used.

CIFAR: The CIFAR-10 and CIFAR-100 datasets consistof 60,000 RGB images of size 32× 32, drawn from 10 and100 categories, respectively. 50,000 images are used fortraining and the rest for testing. In both datasets, train-ing and testing images are uniformly distributed over allthe categories. To show the broad applicability of the pro-posed method, we use the ResNet-34 [12] for CIFAR-10and DenseNet-121 [14] for CIFAR-100. Both models aretrained for 200 epoches with a mini-batch size 128. We re-duce the learning rates by 10 times at the 100-th and 150-thepoches. The weight decay rate is set to 5e− 4.

To do a reliable evaluation, the classifiers are trainedmultiple times using several optimization algorithms in-cluding SGD-momentum (SGDM) [26], RMSprop [31],Adam [15], AMSGrad [27] and RAdam [19]. We ap-ply learning rate dropout to each optimization algorithmsto show how this technique can help with training. Un-less otherwise specified, the dropout rate p (the probabil-ity of performing an individual parameter update in any

4

Page 5: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

Figure 5. The learning curves for 2-layers FCNet on MNIST. Top: Training loss. Middle: Training accuracy. Bottom: Test accuracy.

Figure 6. The learning curves for ResNet-34 on CIFAR-10. Top: Training loss. Middle: Training accuracy. Bottom: Test accuracy.

given iteration) is set to 0.5. For each optimization algo-rithm, the initial learning rate has the greatest impact onthe ultimate solution. Therefore, we only tune the initiallearning rate while retaining the default settings for otherhyper-parameters. Specifically, the initial learning rates forSGDM, RMSprop, Adam, AMSGrad, and RAdam are 0.1,0.001, 0.001, 0.001, 0.03, respectively.

We first show the learning curves for MNIST in Figure5. We find that all methods show fast convergence speedand good generalization. However, compared with theirprototypes, the optimizers that use learning rate dropoutstill display better performance in terms of training speed

and test accuracy. We further report the results for CI-FAR datasets in Figure 6 and Figure 7. The test accuracyobtained by different methods is summarized in Table 1.The same models with and without learning rate dropouthave very different convergence. These optimization algo-rithms without learning rate dropout either have slow train-ing or converge to poor results. In contrast, learning ratedropout benefits all optimization algorithms by improvingtraining speed and generalization. In particular we observethat by applying learning rate dropout to the non-adaptivemethod SGDM, we can achieve convergence speed com-parable to other adaptive methods. On CIFAR-100, learn-

5

Page 6: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

Figure 7. The learning curves for DenseNet-121 on CIFAR-100. Top: Training loss. Middle: Training accuracy. Bottom: Test accuracy.

Table 1. Test accuracy on image classification. (%)

Dataset Model Optimization algorithms (without / with learning rate dropout)SGDM | LRD RMSprop | LRD Adam | LRD AMSGrad | LRD RAdam | LRD

MNIST FCNet 97.82 / 98.21 98.06 / 98.40 97.88 / 98.15 97.88 / 98.51 97.97 / 98.21CIFAR-10 ResNet-34 95.30 / 95.54 92.71 / 93.68 93.05 / 93.77 93.31 / 93.73 94.24 / 94.61

CIFAR-100 DenseNet-121 79.09 / 79.42 70.21 / 74.02 72.55 / 74.34 73.91 / 75.23 71.81 / 73.04

ing rate dropout even helps RMSprop obtain an accuracyimprovement of close to 4%. Furthermore, learning ratedropout incurs negligible computational costs and no pa-rameter tuning aside from the dropout rate p.

4.2. Image segmentation

We also consider the semantic segmentation task, whichassigns each pixel in an image a semantic label [7, 21]. Weevaluate our method on the PASCAL VOC2012 semanticsegmentation dataset [8], which consists of 20 object cate-gories and one background category. Following the conven-tional setting in [6, 18], the dataset is augmented by extraannotated VOC images provided in [11], which results in10,582, 1,449 and 1,456 images for training, validation andtesting, respectively. We use the state-of-the-art segmenta-tion model PSPNet [39] to conduct the experiments. Weuse Adam solver with the initial learning rate 0.001. Otherhyperparameters such as batch size and weight decay fol-low the setting in [39]. The segmentation performance ismeasured by the mean of class-wise intersection over union(Mean IoU) and pixel-wise accuracy (Pixel Accuracy). Re-sults for this experiment are reported in Figure 8. Withour proposed learning rate dropout, the model yields results0.688/0.921 in terms of Mean IoU and Pixel Accuracy, ex-ceeding the vanilla Adam of 0.637/0.905. In addition, Fig-ure 8 shows once again that learning rate dropout leads to

Figure 8. Results for PSPNet on VOC2012 semantic segmentationdataset. Left: mean IOU. Right: Pixel Accuracy.

Figure 9. Results for object detection. Left: Training loss. Right:Validation loss.

faster convergence.

4.3. Object detection

Object detection is a challenging and important problemin the computer vision community. We next apply learning

6

Page 7: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

(a) Training accuracy (b) Test accuracy

Figure 10. Results obtained using different dropout rate p. Top:Adam. Bottom: SGDM.

Table 2. Test accuracy on CIFAR-10 using different dropout ratep. (%)

Adam SGDMp = 1(No LRD) 93.05 95.30

p = 0.1 93.32 94.32p = 0.3 93.93 95.39p = 0.5 93.77 95.54p = 0.7 93.50 95.42p = 0.9 93.35 95.26

rate dropout to an object detection task. We use VOC2012trainval and VOC2007 trainval datasets for training, anduse VOC2007 test as our validation and testing data. Wetrain a one-stage detection network SSD [20] using Adamwith a learning rate 0.001. Other hyperparameter settingsare the same as in [20]. To show the effect of learning ratedropout on training, we use the training loss and validationloss as the evaluation metric. We run the model for 100epochs and show the results in Figure 9. After applyinglearning rate dropout, the training loss and validation lossdrop faster and converge to better results. This means thatthe detector achieves higher detection accuracy.

4.4. Effect of dropout rate

As mentioned, the hyperparameter p which controls theexpected size of the coordinate update set is the only param-eter that needs to be tuned. In this section, we explore theeffect of various p. We conduct experiments with ResNet-34 on the CIFAR-10 dataset, where p is chosen from a broadrange {0.1, 0.3, 0.5, 0.7, 0.9}. We use Adam and SGDM totrain the model, other hyperparameters are the same as inprevious experiments. The results are shown in Figure 10.We can see that for any p, learning rate dropout can speed

(a) Training accuracy (b) Test accuracy

Figure 11. Results on CIFAR-10 using different regularizationstrategies. Top: Adam. Bottom: SGDM.

up training, but the smaller the p, the faster the convergence.In addition, we report the test accuracy in Table 2. As can beseen, almost all values of p lead to an improvement in testaccuracy. In practical applications, in order to balance thetraining speed and generalization, we recommend setting pin the range of 0.3 to 0.7.

4.5. Comparison with other regularizations

We further compare our learning rate dropout with threepopular regularization methods: standard dropout [30],noisy label [37] and noisy gradient [24]. The standarddropout regularizes the network on hidden units, and weset the probability that each hidden unit is retained to 0.9.The noise label disturbs each train sample with the proba-bility 0.05 (i.e., a label is correct with a probability 0.95).For each changed sample, the label is randomly drawn uni-formly from the other labels except the correct one. Thenoisy gradient adds Gaussian noise n ∼ N(0, 0.1) to thegradient at every time step. The variance of the Gaussiannoise gradually decays to 0 according to the parameter set-ting in [24]. We apply these regularization techniques tothe ResNet-34 trained on CIFAR-10. This model is trainedmultiple times using SGDM and Adam. For clarity, we usethe terms “ SD”, “ NL”, “ NG” to denote training withstandard dropout, noisy label and noisy gradient, respec-tively.

We report the learning curves in Figure 11. As can beseen, our learning rate dropout can speed up training whileother regularization methods hinder convergence. This isbecause other regularization methods inject random noiseinto the training, which may lead to the wrong loss descentpath. In contrast, our learning rate dropout always sam-ples a correct loss descent path, so it does not worsen the

7

Page 8: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

Table 3. Test accuracy on CIFAR-10 using different regularizationstrategies. (%)

Adam SGDMNo regularization 93.05 95.30

Standard dropout (SD) 94.05 95.33Noise gradient (NG) 93.70 95.00

Noise label (NL) 92.59 94.61Learning rate dropout (LRD) 93.77 95.54

LRD and SD 94.73 95.58LRD and NG 93.92 95.26

(a) Training accuracy (b) Test accuracy

Figure 12. LRD vs. DG on CIFAR-10 (ResNet-34 is used). Top:Adam. Bottom: SGDM.

training. We also show the test accuracy in Table 3. Theresults show that the standard dropout and our learning ratedropout can effectively improve generalization, while theeffects of noise label and noisy gradient are disappointing.On the other hand, we find that our learning rate dropout andother regularization methods can be complementary. Weuse “ LRD ∗∗” to represent the simultaneous use of learn-ing rate dropout and other optimizers. From Figure 11 andTable 3, we see that learning rate dropout can work withother regularization methods to further improve the perfor-mance of the model. This shows that learning rate dropoutis not limited to being a substitute for standard dropout orother methods.

4.6. Dropout on Gt

Learning rate dropout is also equivalent to applyingdropout to the update 4Wt of parameters W at eachtimestep t. The idea is to inject uncertainty into the loss de-scent path, while ensuring that the loss descends in the cor-rect direction at each iteration. One may wonder if there areother ways to realize this idea. For example, using dropouton the gradientGt = Oft(Wt−1) is also a possible solution.

(a) Training accuracy (b) Test accuracy

Figure 13. Results obtained using different psd.

Table 4. Test accuracy on CIFAR-10 using different psd. (%)psd 1 (No SD) 0.9 0.8 0.7 0.6

Adam 93.05 94.05 94.08 93.34 92.89

The dropout on Gt can also interfere with the loss descentpath and does not lead to the wrong descent direction. Here,we compare our learning rate dropout with dropout on Gt

(DG). DG follows the setting of LRD, the gradient gij,t of aparameter wij,t−1 is retained with a probability p, or is setto 0 with probability 1−p. In this experiments, p = 0.5. Weshow the learning curves in Figure 12. As can be seen , DGhas no positive effect on training. The causes of this may bemultiple. First, the gradient accumulation terms (e.g., mo-mentum) can greatly suppress the impact of dropping gra-dients on the loss descent path. In addition, dropping thegradient may slow down training due to the lack of gradientinformation. In contrast, our LRD only temporarily stopsupdating some parameters, and all gradient information isstored by the gradient accumulation terms. Therefore, inour learning rate dropout training, there is no loss of gradi-ent information.

5. Dropout rate of standard dropoutWe have shown that LRD can speed up training and im-

prove generalization with almost any dropout rate p (p ∈(0, 1)). The standard dropout (SD) also has a tunabledropout rate psd. In this section, we explore the effect ofvarious psd. We apply SD to ResNet-34 trained on CIFAR-10 dataset, where psd is chosen from {0.9, 0.8, 0.7, 0.6}.We show the results in Figure 13 and Table 4. As canbe seen, the performance of standard dropout is sensitiveto the choice of dropout rate psd. The smaller the psd,the worse the convergence. This indicates that we shouldchoose psd carefully when applying SD to convolution neu-ral networks.

6. ConclusionWe presented learning rate dropout, a new technique for

regularizing neural network training with gradient descent.This technique encourages the optimizer to actively explorein the parameter space by randomly dropping some learn-

8

Page 9: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

ing rates. The uncertainty of the learning rate helps modelsquickly escape poor local optima and gives models moreopportunities to search for better results. Experiments showthe substantial ability of learning rate dropout to acceler-ate training and enhance generalization. In addition, thistechnique is found to be effective in wide variety of appli-cation domains including image classification, image seg-mentation and object detection. This shows that learningrate dropout is a general technique, which has great poten-tial in practical applications.

References[1] L. Adilova, N. Paul, and P. Schlicht. Introducing noise in

decentralized training of neural networks. In Joint EuropeanConference on Machine Learning and Knowledge Discoveryin Databases, pages 37–48. Springer, 2018.

[2] G. An. The effects of adding noise during backpropagationtraining on a generalization performance. Neural computa-tion, 8(3):643–674, 1996.

[3] W. An, H. Wang, Q. Sun, J. Xu, Q. Dai, and L. Zhang. A pidcontroller approach for stochastic optimization of deep net-works. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 8522–8531, 2018.

[4] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wier-stra. Weight uncertainty in neural networks. arXiv preprintarXiv:1505.05424, 2015.

[5] B. Chen, W. Deng, and J. Du. Noisy softmax: Improvingthe generalization ability of dcnn via postponing the earlysoftmax saturation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5372–5381, 2017.

[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Semantic image segmentation with deep con-volutional nets and fully connected crfs. arXiv preprintarXiv:1412.7062, 2014.

[7] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-decoder with atrous separable convolution for se-mantic image segmentation. In Proceedings of the Euro-pean conference on computer vision (ECCV), pages 801–818, 2018.

[8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. International journal of computer vision, 88(2):303–338, 2010.

[9] A. Graves. Practical variational inference for neural net-works. In Advances in neural information processing sys-tems, pages 2348–2356, 2011.

[10] C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio. Noisyactivation functions. In International conference on machinelearning, pages 3059–3068, 2016.

[11] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Ma-lik. Semantic contours from inverse detectors. In 2011 In-ternational Conference on Computer Vision, pages 991–998.IEEE, 2011.

[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition, pages770–778, 2016.

[13] G. E. Hinton and S. T. Roweis. Stochastic neighbor embed-ding. In Advances in neural information processing systems,pages 857–864, 2003.

[14] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-berger. Densely connected convolutional networks. In Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition, pages 4700–4708, 2017.

[15] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

[16] A. Krizhevsky, G. Hinton, et al. Learning multiple layers offeatures from tiny images. Technical report, Citeseer, 2009.

[17] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

[18] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid. Efficientpiecewise training of deep structured models for semanticsegmentation. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 3194–3203,2016.

[19] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han.On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265, 2019.

[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.In European conference on computer vision, pages 21–37.Springer, 2016.

[21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 3431–3440, 2015.

[22] L. Luo, Y. Xiong, Y. Liu, and X. Sun. Adaptive gradientmethods with dynamic bound of learning rate. In ICLR,2019.

[23] I. Necoara. Random coordinate descent algorithms for multi-agent convex optimization over networks. IEEE Transac-tions on Automatic Control, 58(8):2001–2012, 2013.

[24] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser,K. Kurach, and J. Martens. Adding gradient noise im-proves learning for very deep networks. arXiv preprintarXiv:1511.06807, 2015.

[25] Y. Nesterov. Efficiency of coordinate descent methods onhuge-scale optimization problems. SIAM Journal on Opti-mization, 22(2):341–362, 2012.

[26] N. Qian. On the momentum term in gradient descent learningalgorithms. Neural networks, 12(1):145–151, 1999.

[27] S. J. Reddi, S. Kale, and S. Kumar. On the convergence ofadam and beyond. In ICLR, 2018.

[28] H. Robbins and S. Monro. A stochastic approximationmethod. The annals of mathematical statistics, pages 400–407, 1951.

[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learninginternal representations by error propagation. Technical re-port, California Univ San Diego La Jolla Inst for CognitiveScience, 1985.

9

Page 10: arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 · 2019. 12. 6. · arXiv:1912.00144v2 [cs.CV] 5 Dec 2019 (a) Back propagation (b) Applying dropout (c) Applying learning rate dropout Figure

[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Dropout: a simple way to prevent neuralnetworks from overfitting. The journal of machine learningresearch, 15(1):1929–1958, 2014.

[31] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Di-vide the gradient by a running average of its recent magni-tude. COURSERA: Neural networks for machine learning,4(2):26–31, 2012.

[32] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.Manzagol. Stacked denoising autoencoders: Learning usefulrepresentations in a deep network with a local denoising cri-terion. Journal of machine learning research, 11(Dec):3371–3408, 2010.

[33] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus. Reg-ularization of neural networks using dropconnect. In Inter-national conference on machine learning, pages 1058–1066,2013.

[34] S. Wang and C. Manning. Fast dropout training. In inter-national conference on machine learning, pages 118–126,2013.

[35] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht.The marginal value of adaptive gradient methods in machinelearning. In Advances in Neural Information Processing Sys-tems, pages 4148–4158, 2017.

[36] S. J. Wright. Coordinate descent algorithms. MathematicalProgramming, 151(1):3–34, 2015.

[37] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian. Disturbla-bel: Regularizing cnn on the loss layer. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 4753–4762, 2016.

[38] M. D. Zeiler. Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012.

[39] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 2881–2890,2017.

[40] M. Zinkevich. Online convex programming and general-ized infinitesimal gradient ascent. In Proceedings of the 20thInternational Conference on Machine Learning (ICML-03),pages 928–936, 2003.

10