A Comparison of Selected Optimization Methods for Neural ...

IN DEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2020

A Comparison of Selected Optimization Methods for Neural Networks

OSKAR BONDE

LUDVIG KARLSSON

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

Abstract

Which numerical methods are ideal for training a neural network? In this reportfour different optimization methods are analysed and compared to each other.First, the most basic method Stochastic Gradient Descent that steps in the neg-ative gradients direction. We continue with a slightly more advanced algorithmcalled ADAM, often used in practice to train neural networks. Finally, we studytwo second order methods, the Conjugate Gradient Method which follows conju-gate directions, and L-BFGS, a Quasi-Newton method which approximates theinverse of the Hessian matrix. The methods are tasked to solve a classificationproblem with hyperspheres acting as decision boundaries and multiple differentnetwork configurations are used. Our results indicate why first order methodsare so commonly used today and that second order methods can be difficult touse effectively when the number of parameters are large.

2

Sammanfattning

Vilka numeriska metoder ar ideala for att trana ett neuralt natverk? I denna rap-port analyseras fyra olika optimeringsmetoder och jamfors med varandra. Forstden mest grundlaggande metoden Stochastic Gradient Descent som hela tidentar steg i motsatt riktning till gradienten av objektfunktionen. Vi fortsattersedan med ADAM, en lite mer avancerad algoritm som ofta anvands i prak-tiken for att trana neurala natverk. Slutligen studerar vi tva andra ordnin-gens metoder, Conjugate Gradient Method som foljer s.k. konjugerade rik-tningar, och L-BFGS, en Quasi-Newton-metod som anvander en approxima-tion av Hessianen av objektfunktionen. Metoderna har fatt uppgiften att losaett klassificeringsproblem med hypersfarer som klassificeringsgrans, dar fleraolika natverksstrukturer anvants. Vara resultat visar varfor forsta ordningensmetoder anvands sa ofta idag och att andra ordningens metoder kan vara svaraatt anvanda effektivt nar antalet parametrar ar stort.

3

Contents

1 Introduction 61.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Neural Networks 82.1 Feedforward networks . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Training of Neural Networks . . . . . . . . . . . . . . . . . . . . . 102.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Concepts from Probability Theory . . . . . . . . . . . . . . . . . 12

2.4.1 The Loss Function . . . . . . . . . . . . . . . . . . . . . . 122.4.2 Surrogate loss function . . . . . . . . . . . . . . . . . . . . 13

3 Optimization Algorithms 143.1 Batch Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 First Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 153.2.2 Moment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.3 ADAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Second Order Methods . . . . . . . . . . . . . . . . . . . . . . . . 193.3.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . 193.3.2 Conjugate Gradient Method . . . . . . . . . . . . . . . . . 203.3.3 Broyden-Fletcher-Goldfarb-Shanno Algorithm . . . . . . . 213.3.4 L-BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Issues in Optimization of Neural Networks . . . . . . . . . . . . . 23

4 Method 254.1 The Classification Problem . . . . . . . . . . . . . . . . . . . . . 254.2 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . 264.3 Analysis Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Results 285.1 Optimal Batch Size Hyperparameter . . . . . . . . . . . . . . . . 285.2 Shallow Network Algorithm Comparison . . . . . . . . . . . . . . 305.3 Wide Network Algorithm Comparison . . . . . . . . . . . . . . . 305.4 Deep Network Algorithm Comparison . . . . . . . . . . . . . . . 32

4

5.5 L-BFGS Failure Examples . . . . . . . . . . . . . . . . . . . . . . 345.6 Topics for Further Investigation . . . . . . . . . . . . . . . . . . . 34

6 Analysis & Discussion 366.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 376.3 Algorithm Performance Analysis . . . . . . . . . . . . . . . . . . 376.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7 Appendix 417.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.2 Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.3 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.4 Hyperparameters used . . . . . . . . . . . . . . . . . . . . . . . . 43

5

Introduction

Machine Learning is the topic of making computers adapt their behaviour sothat the actions of the computer get more accurate, where the accuracy is somekind of measure that reflects how well the actions corresponds to the correct ac-tions. A formal definition of machine learning was proposed by Tom M.Mitchell[1] and is formulated as

A computer program is said to learn from experience E on a given task T withsome measure of performance P , if its performance in the measure P increaseswith increasing experience E.

Usually the experience E is some set of data that the machine learning al-gorithm should use to learn how to perform some task T . There are a widerange of different machine learning problems, and these fall under some class ofproblems, one of which being supervised learning. In modern days, neural net-works are widely used as the computational tool for machine learning problems.The process of training neural networks is highly related to the optimizationof an objective function and thus the choice of optimization method becomes avery important step when designing the learning algorithm.

First order optimization methods uses the gradient of the objective functionto gradually decrease the value of the objective function in hope to finally reachan approximate optimum. These methods are easily the most popular ones intoday’s learning algorithms, but there is a massive amount of research going onabout the use of second order optimization methods [2]. These methods use notonly the gradient, but also the derivative of second order which in the multivari-ate case means involving the Hessian matrix. The second order methods have(in general) not yet shown to be as efficient as the first order methods; a bigreason is the computational load that comes with calculating the Hessian. Inthis report we will examine two first order optimization methods that are verypopular in practical applications: Stochastic Gradient Descent and the ADAMalgorithm. We will then compare these to two selected second order methods:the Conjugate Gradient method and the Limited-Memory Broyden-Fletcher-Goldfarb-Shanno algorithm.

6

1.1 Problem Formulation

In this report we will study a problem that falls under the category of supervisedlearning. These problems have the following properties [2]:

(i) A training set of examples with corresponding labels are given

(ii) The algorithm should, based on the training set, generalize to respondwell on all possible input

One problem that falls under this category is the problem of classification whichis the problem that will be studied in this report.

Classification is a popular class of problems in machine learning in which thegoal is to determine which class a given data sample belongs to, and there isa wide range of problems to study within this field. The problem that we aregoing to focus on is that of training a neural network to find a pre-defined clas-sification boundary. More precisely, we are going to define some boundary, e.g.a circle, and classify the points that lies within this boundary as 0s and the restas 1s. There are of course many ways of choosing the boundary, some moredifficult than others, but we have chosen to study a few versions of the followingproblem

1. An n-dimensional hyper-sphere that occupies about half of the area of thedomain [0, 1]

n

2. Points within the sphere are classified as 0s and the rest as 1s

So the learning algorithm will be designed to produce a function

f : Rn → {0, 1}. (1.1)

Figure 1.1 shows an example of the classification problem when the decisionboundary is a circle. The input data are points in the plane with labels 0 or 1.

Figure 1.1: Classification of points in the plane with a circle as decision bound-ary. Points with label 1 are plotted blue, the red points has label 0.

7

Neural Networks

A neural network is a computing system that consists of a composition of func-tions f (i) that takes some input x and returns an output y. The term neuralnetwork comes from the inspiration from biological neural networks that con-stitute the brain, and neural networks are often visualized graphically as a setof connected units called neurons. There are different kinds of neural networksthat differs in how data is handled and processed in the network. In classifica-tion problems one uses so called feedforward networks, in which the informationflows through the neural network from the input through intermediate compu-tations and finally returns an output. The neurons are often arranged in layersand by the width of a layer we refer to the number of neurons in that layer. Thedepth of the network is the number of layers. The neural network in Figure 2.1has e.g. depth 3, and the widest layer has width 4. The first layer is called in-put layer and analogously the final layer is called output layer. All intermediatelayers are called hidden layers. A neural network with at least one hidden layeris called a Deep Neural Network and the process of training such networks iscalled Deep Learning.

Figure 2.1: Neural network with one hidden layer

8

2.1 Feedforward networks

The goal of a feedforward network is to approximate some function f∗ basedon the training set of inputs x. In this report the problem is that of classifyingthe inputs to one of two categories, so the network defines a mapping y =f (x;θ) where θ is a vector of parameters which are adapted to find the bestapproximation of f∗. In this case, the target function is defined by

f∗(x(i)

)=

{0, x(i) ∈ A1, x(i) /∈ A

(2.1)

where A is the domain within the pre-defined classification boundary. Theclassification boundary in Figure 1.1 would e.g. correspond to the domain

A = {(x, y) ∈ R2 : (x− 0.5)2 + (y − 0.5)2 ≤ 0.42}. (2.2)

The network consists of layers of neurons, where each neuron is given an activationwhich is computed using the weighted output of the previous layer. Betweenthe neurons in layer i and i + 1 there are connections with associated weights,which are a part of the parameters that will be adapted during the learningprocess. One also adds a bias to the activation of each neuron, and these arethe second part of the parameters θ of the network.

If we denote the weight between neuron i in layer l + 1 and neuron j in layer lby wlij and the corresponding bias with bli, the calculation of the weighted inputto neuron i in layer l + 1 is given by

zl+1i =

∑j

wlijzlj + bli. (2.3)

The parameter vector is related to the weights and biases by

θ =(w1

11, w112, ..., w

D11, ..., w

Dmm, b

11, ..., b

Dn

)T(2.4)

where D is the depth of the network. By collecting the weights in a matrix W l

and the biases in a vector bl we can write the weighted input vector of layerl + 1 as

zl+1 = W lzl + bl. (2.5)

The final step of computing the activation is to apply an activation functionto the weighted input. In the problem encountered in this report we want theoutput to lie in [0, 1], so the activation function in the final layer will be thesigmoid

σ(x) =1

1 + e−x. (2.6)

The activation in all of the hidden layers, i.e. the layers between the input layerand the output layer will be the hyperbolic tangent

tanhx =ex − e−x

ex + e−x. (2.7)

9

Thus, the activation al in layer l will be

aD = σ(zD)

(2.8)

al = tanh(zl), l = 1, ..., D − 1, (2.9)

where D again is the depth of the network1.

2.2 Training of Neural Networks

The training of neural networks is based on adapting the parameters θ in such away that progress is made in terms of some measure of accuracy. In the problemof classification of points that we will study, a set of N training examples x(i)

with corresponding labels y(i) will be given. The network will then use this tocalculate an output y(i) ∈ [0, 1]. To measure the performance of the network oneuses a so called loss function which should achieve a high value if the networkis performing poorly, and a low value otherwise.

The natural loss function to use would be

L =1

N

N∑i=1

Li, (2.10)

where

Li =

{0, y(i) = y(i).

1, y(i) 6= y(i).(2.11)

As we will see however, this is not a good choice of loss function since it is notdifferentiable. The learning algorithms that we will use all include a calculationof the gradient of the loss function w.r.t. the parameters θ, so the function needsto be differentiable. We will instead use a so called surrogate loss function whichwill be motivated in section 2.4.2. It is given by [1]

L = − 1

N

N∑i=1

[y(i) log

(y(i))

+(

1− y(i))

log(

1− y(i))]. (2.12)

The goal of the learning process is then to minimize (2.12) with respect to theparameters θ.

1Here we mean applying the function element-wise on the vector, resulting in a new vector.

10

2.3 Backpropagation

As mentioned, all algorithms studied in this report requires a calculation ofthe gradient of the loss function. Since it is common to have a large amountof parameters in the neural network, it soon becomes infeasible to straight-forward calculate the gradient analytically. The Backpropagation Algorithm isan enormously popular algorithm for calculating the gradient and it is describedin Algorithm 1. A derivation of the algorithm can be found in the appendix,but the basic idea behind the algorithm is to use the chain rule of calculus

d

dx(f (g (x))) = f ′ (g (x)) g′ (x) . (2.13)

Since a neural network is a composition of functions f (i), one can use the chainrule to get the derivative of the loss function w.r.t. the parameters θ. This isexactly what the backpropagation does, but in a subtle manner.

Some notational notes: D is the depth of the network, the activation inlayer l is denoted al and the weighted input to layer l is denoted zl. Also, weintroduce the Hadamard Product � which is an operator on matrices or vectorsof the same shape, and is defined by{

(A�B)ij = (A)ij · (B)ij

(a� b)i = ai · bi(2.14)

where A and B are matrices while a and b are vectors. When letting somefunction operate on a vector we mean that it should be applied to each elementseparately, resulting in a new vector of the same shape.

Algorithm 1 Backpropagation [2]

Set a1 = xfor i = 2, ..., D dozl = W lal−1 + bl

end forLet δD = ∇yL� σ′

(zD)

for l = D − 1, ..., 2 do

δl =((W l+1

)Tδl+1

)� r′

(zl)

end for∂L∂bl

= δl

∂L∂W l = δl

(al−1

)T

11

The advantages that come with the Backpropagation algorithm are

(i) The calculation of the gradient requires only matrix multiplications. Thiscan be implemented efficiently in most programming languages.

(ii) The algorithm can be implemented so that we store the result from pre-vious layers. This gives a significant decrease in computational time.

2.4 Concepts from Probability Theory

To analyze and derive the behaviour of machine learning algorithms we needsome concepts from probability theory. The reason for this is that machinelearning always involves some uncertain and/or stochastic quantities. In thefollowing subsections we present some basic background theory for motivatingthe methods and results derived in this report.

2.4.1 The Loss Function

In the supervised learning case the goal is to minimize the loss function acrossthe data generating distribution p w.r.t. the parameters θ. This loss functioncan often, and always in this report, be written as an expectation value acrossthe data generating distribution p [1]

L (θ) = E(x(i),y(i))∼p

[Li

(f(x(i);θ

), y(i)

)]. (2.15)

In the classification problem f(x(i);θ

)is the predicted output of the network

and Li is the loss of each training example, i.e.

Li

(f(x(i);θ

), y(i)

)= −y(i) log

(f(x(i);θ

))−

−(

1− y(i))

log(

1− f(x(i);θ

))If the data generating distribution is known we have a pure optimization prob-lem, but in machine learning problems this is not the case. To be able to trainour model we have to convert the problem to an optimization problem we cansolve. A common approach is to replace the data generating distribution withthe empirical distribution p which is defined by the training set {(x(i), y(i)}Ni=1

[1]. One then minimizes the expectation over this distribution instead, so wereplace L in equation (2.15) by [1]

L (θ) = E(x(i),y(i))∼p

[Li

(f(x(i);θ

), y(i)

)]=

1

N

N∑i=1

Li

(f(x(i);θ

), y(i)

).

(2.16)To train the network on a classification problem we minimize (2.16) and hopethat doing so will also minimize (2.15). A lot of theoretical results have estab-lished conditions under which this will succeed [1], but this theory will not bepresented in this report.

12

2.4.2 Surrogate loss function

As mentioned in section 2.2 the 0-1 loss function is not differentiable which willcause a problem when using gradient based learning algorithms. In equation(2.12) we introduced the surrogate loss function for our classification problem.A motivation of the form of this function is provided in the following.

Let y(i) be the prediction of the neural network on the training example(x(i), y(i)

).

Since we use the sigmoid function in the output layer we know that y(i) ∈ [0, 1],so we can interpret the prediction as a conditional probability y = p

(y(i)|x(i);θ

)[1]. We can now formulate the learning problem as choosing a point estimateθ of the parameters θ s.t. y is as close to the true probability p as possible.The Maximum Likelihood Estimate of θ on the independent2 set of examples{x(i)}Ni=1 is defined as [1]

θ = arg maxθ

N∏i=1

[p(y(i)|x(i);θ

)]. (2.17)

The maximization of (2.17) is equivalent of maximizing the logarithm of theprobability, which gives [1]

θ = arg maxθ

N∑i=1

[log(p(y(i)|x(i);θ

))]. (2.18)

Dividing (2.18) byN does not change the maximum, so the problem is equivalentto

θ = arg maxθ

E [log (p (y|x;θ))] . (2.19)

In binary classification we only have two possible values, so determining theprobability P for one value determines the probability of the other as 1 − P .This finally gives

θ = arg maxθ

1

N

N∑i=1

[y(i) log

(y(i))

+(

1− y(i))

log(

1− y(i))]. (2.20)

Maximizing (2.20) is the same as minimizing (2.12).

2For independent stochastic variables X,Y : P (X = x,Y = y) = P (X = x)P (Y = y)[1].

13

Optimization Algorithms

Machine learning algorithms often1 rely on optimization of some objective func-tion, so the choice of optimization algorithm is a crucial part of the learningprocess. We will study two of the most common first order optimization algo-rithms frequently used in machine learning problems: Stochastic Gradient De-scent and ADAM. These algorithms will then be compared to two algorithmsof second order: the Conjugate Gradient Method and L-BFGS. A description ofthe mentioned algorithms are given in this section.

3.1 Batch Methods

The algorithms used to train neural networks requires computation of the gra-dient w.r.t. the parameters θ. If the number of training examples is extremelylarge it can result in a computational burden that is too heavy to handle. Dueto this issue, training algorithms that uses minibatches have become popular[2]. The idea is to simply divide the training set into batches of some size m,evaluate the gradient only at m examples at a time and update the parametersaccordingly. Most algorithms converges much faster, in the measure of totalcomputation time, when allowed to approximate the gradient rapidly on theminibatches than when calculating the exact gradient using the whole trainingset [1]. The optimal size of the minibatch depends on both the algorithm andthe problem, and we will try to determine a batch-size that is optimal for eachof the studied algorithms given our decision problem.

The selected algorithms presented in sections 3.2 and 3.3 all iteratively updatesthe parameters with an approximation of the gradient using the minibatches.We will refer to one such update as an iteration, and a single pass through thewhole training set as an epoch. In all of the studied algorithms, one iteratesuntil some stopping criterion is met. We have chosen to stop after a certainnumber of epochs for each of the presented algorithms.

1There are methods such as Support Vector Machines[1] that does not involve gradientbased optimization.

14

3.2 First Order Methods

Methods that only use the gradient or first derivative to search for the optimumare called first order methods. The most basic such algorithm is StochasticGradient Descent, which takes steps in the direction of the negative gradientof the objective function. This algorithm has shown, although it is simple, toperform reasonably well in machine learning problems [2]. A more sophisticatedfirst order method is the ADAM algorithm which is designed to be less sensitiveto noisy- and small gradients than SGD and hence we have chosen to study thisalgorithm as well [1].

3.2.1 Stochastic Gradient Descent

Gradient Descent is an algorithm for finding the minimum of an objective func-tion. It is based on the fact that the gradient of a function L always pointsin the direction of maximum increase, so by moving in the opposite directionof the gradient one can achieve an improvement in the value of the objectivefunction. In machine learning the objective function can often be written as asum of functions [1]

L =

n∑i=1

Li

E.g. in our case each term in the sum is the loss Li on each individual trainingexample. The Gradient Descent method would then calculate the gradient ofeach of the functions in the sum and evaluate all of these functions at the currentpoint [1]. When the sum involves a large number of terms, which correspondsto a large amount of data points in our case, the computational load of thismethod becomes too heavy. The Stochastic Gradient Descent algorithm solvesthis computational problem by simply choosing one of the functions in the sumrandomly. The new direction is then determined by the gradient of this one andonly function. However, using only one of the terms for the estimation of thegradient often results in unstable behaviour of the objective function [1]. It istherefore common to use a subset of functions from the sum, i.e. a minibatch,when estimating the gradient. The full minibatch-SGD algorithm is shown inAlgorithm 2.

An important parameter in SGD is the learning rate ε. This parameter de-termines the step size in each iteration; a large ε results in large steps and viceversa. If the step size is too large, we will get a very noisy behaviour of thealgorithm which can result in incapability of converging to a reasonable valueof the objective function [2]. If the step size is too small however, the algorithmmay get stuck in an area where the gradient is small, leading to the same failure.In practice it is common to let the learning rate be adaptive. A simple rule forcalculating the learning rate εk in iteration k is [1]

εk = (1− α)ε0 + αετ , α = k/τ, (3.1)

15

where ε0, ετ and τ are pre-defined parameters. The idea is to use (3.1) for calcu-lating the learning rate up until iteration τ , on each iteration coming thereafterthe learning rate is constant. There are also variants of SGD that uses linesearch to find the optimal step size ε∗ at each iteration. In this report wewanted the most basic optimization algorithm to compare with, so we havekept the learning rate constant throughout the training process.

Algorithm 2 Stochastic Gradient Descent

Initial point θ = θ0Learning rate εwhile Stopping criterion not met do

Choose a minibatch of examples {x(i), y(i)}mi=1 randomlyθ ← θ − ε

m

∑mi=1∇θLi

end while

Stochastic Gradient Descent has, although it is simple, turned out to be suc-cessful in many machine learning problems. There are some cases in which thealgorithm struggles. As mentioned, the algorithm can get stuck in areas wherethe gradient is small if the learning rate is also small [1]. Also, if the algorithmencounters an area of high curvature it can take a very large step, missing alot of information in the local area. These are some of the issues that the moresophisticated ADAM algorithm tries to overcome.

3.2.2 Moment

Since SGD does not save any information of the gradient, and calculates thegradient using only a few training examples, the step directions can oscillate alot between each iteration. Algorithms that use moment are designed to reducethis oscillatory behaviour [1].

The term moment in mathematics is built on the physical analogy but witha more abstract meaning. The moment of a function is a quantitative measureof its shape. In statistics, the first moment of a random variable is its expectedvalue and the second central moment is its variance. More formally, the n-thmoment of a random variable is defined by [4]

E [Xn] =

M∑i=1

xni P (X = xi) (3.2)

The central moment of a random variable is the moment of the random variableafter the expectation value has been subtracted, i.e. [4]

E [(X − E [X])n] =

M∑i=1

(xi − E [X])nP (X = xi) (3.3)

16

These quantities, mostly the first moment and the second central moment, areused in some optimization methods in deep learning and this is referred to as themethod of momentum. Instead of the standard update rule of the parametersthat is used in stochastic gradient descent, algorithms using the method ofmomentum use an update rule containing the moments of the objective functionin some sort. E.g. if one only uses the first moment, the update rule becomes

Initialize v0, θ0.

v ← αv − ε∇L (θn) .

θn ← θn−1 + v,

where L is the objective function, v is a exponentially decreasing average of pastgradients and α is a hyperparameter that determines the rate of the exponentialdecay.

3.2.3 ADAM

The ADAM algorithm, first presented in the article ADAM: A Method ForStochastic Optimization written by Diederik P.Kingma and Jimmy Lei Ba [5],is a widely used optimization method in machine learning [1]. The name ADAMcomes from the method used in the algorithm, namely adaptive moments. TheADAM algorithm is a first order optimization method for stochastic objectivefunctions which is based on adaptive estimates of the first- and second ordermoments. The update rule in the algorithm is based on exponentially decayingmoving averages of the gradient g and the squared gradient g� g. The movingaverages, denoted s and r, are calculated by

s = ρ1s+ (1− ρ1)g (3.4)

r = ρ2r + (1− ρ2)g � g (3.5)

and are in fact estimates of the first- and second order moments [5]. The al-gorithm introduces two new hyperparameters ρ1 and ρ2 which determines howfast the contribution of previous iterations contribute. The estimates s and ris initialized to zero and are therefore biased towards zero. To handle this bias,the authors behind the algorithm introduces the bias-correcting terms whichgives the bias-corrected estimates s and r. In their article, Kingma and Lei Bashow that these new estimates should be calculated by [5]

s =s

1− ρt1, (3.6)

r =r

1− ρt2, (3.7)

where t is the current iteration.

17

As mentioned, the ADAM algorithm is a very popular algorithm for trainingneural networks and has shown to be able to get good results on a wide varietyof problems [5]. For this reason we found it interesting to include this algorithmin our study of deep learning algorithms to get a good measure of the perfor-mance of the second order methods introduced in section 3.3.

The required parameters in the algorithm is the learning rate ε, and the twodecay rates ρ1 and ρ2. Also one needs some small constant δ to avoid division byzero through numerical errors. Values used for these parameters in this studywere ε = 0.001, ρ1 = 0.99, ρ2 = 0.999 and δ = 10−10, which are the valuesproposed in [5].

Algorithm 3 ADAM [5]

Initial point θ = θ0s← 0r ← 0t← 0while Stopping criterion not met do

Choose a minibatch of examples {x(i), y(i)}mi=1 randomlyg ← 1

m∇θ∑mi=1 Li

t← t+ 1s← ρ1s+ (1− ρ1) gr ← ρ2r + (1− ρ2) g � gs← s

1−ρt1r ← r

1−ρt2θ ← θ − ε s√

r+δ

end while

Note: by division of two vectors we mean the element-wise division. Analo-gously, when taking the square root of a vector we mean the element-wise squareroot.

18

3.3 Second Order Methods

In addition to using the gradient for finding the optimum, second order methodsuses the Hessian matrix H2 or some approximation of it to account for the cur-vature of the objective function. An algorithm that works when the objectivefunction is quadratic or convex3 is Newton’s method, but since neural networksoften are non-convex convergence is not guaranteed. Also, Newton’s methodinvolves the exact calculation of the Hessian which can be extremely heavy tocompute. If the function f depends on n variables, the Hessian will have dimen-sion n×n. In deep learning we often have millions of parameters which makes itinfeasible to calculate the Hessian exactly both in terms of computational timeand memory load. Thus, Newton’s Method is not an appropriate algorithm forneural networks. We will instead study the Conjugate Gradient Method andthe L-BFGS algorithm, both of which are second-order methods.

3.3.1 Newton’s Method

The second-order Taylor series of the objective function f around some pointx0 is given by

f (x) ≈ f (x0) + (x− x0)∇f (x0) +1

2(x− x0)

TH (x0) (x− x0) . (3.8)

Solving for the critical point of f gives [1]

x∗ = x0 −H−1∇f (x0) . (3.9)

This procedure can be performed iteratively to move towards the optimum,but if the objective function is non-convex it is not guaranteed to converge.The objective function of neural networks is usually non-convex which causesproblems for this method. Considering the problem of non-convexity and theheavy computational load, Newton’s method is a poor choice of optimizationalgorithm for neural networks. However, some of the benefits of consideringhigher derivatives could improve the optimization process and we will thereforestudy two of the methods that tries to provide the benefits of the Hessian withouthaving to calculate it explicitly. The general algorithm for Newton’s methodare shown in Algorithm 4.

Algorithm 4 Newton’s Method

Initial point x0

while Stopping criterion not met dog ← ∇fH ← ∇2fx← x−H−1g

end while

2(H(f))ij = ∂2f∂xi∂xj

.3f : X→ R is convex if f (tx1 + (1− t)x2) ≤ tf(x1) + (1− t)f(x2) ∀ t ∈ [0, 1], x1, x2 ∈ X.

19

3.3.2 Conjugate Gradient Method

A big issue with second order optimization algorithms is the computation of theHessian or its inverse because of the computational complexity, the memory us-age and the ill-conditioning of the matrix (see section 3.4 for more details aboutill-conditioning). Conjugate gradients is a second order method that avoids thecalculation of the Hessian by a strategy of following so called conjugate direc-tions [1]. In some versions of the gradient descent algorithm, one perform aline search to find the step size ε in the direction of the negative gradient. Thisresults in an optimal step size in each iteration which could lead to a fasterconvergence towards an optimum. The problem with this version of gradientdescent is that subsequent search directions becomes orthogonal to each other[1].

Let gt be the gradient at the beginning of iteration t, i.e. gt = ∇L (θt). Whenusing line search (see appendix for a more detailed description of line search) weseek to minimize the objective function in the direction of the negative gradient−gt. As the line search terminates and the optimum is found, the derivative inthis direction will be zero, i.e.

− gt · ∇L = 0 (3.10)

But the search direction in iteration t+ 1 is precisely the gradient at the pointwhere the line search in iteration t terminated. Thus, gt · gt+1 = 0, i.e. sub-sequent search directions are orthogonal. Because of the orthogonality of sub-sequent search directions, the minimum in previous gradient directions are notpreserved and thus the algorithm has to re-minimize the objective function inprevious directions. This is the problem that the conjugate gradient method isdesigned to solve.

The key idea of the algorithm is to choose a search direction that does notundo the progress made in previous directions. This is achieved by choosing thesearch direction at iteration k to be [1]

dt = −∇L+ βtdt−1. (3.11)

The parameter βt determines how much of the previous direction we add to thegradient. This parameter should be chosen so that the directions are conjugate,i.e. dTt Hdt−1 = 0. There are two popular methods for calculating an appropri-ate value of βt: the Fletcher-Reeves method and the Polak-Ribiere method [1].In this report we will be using the latter, which is given by [1]

βt =(∇L (θt)−∇L (θt−1))

T ∇L (θt)

∇L (θt−1)T ∇L (θt−1)

. (3.12)

The final step in the algorithm is to perform a line search to find the step sizeε.

20

Algorithm 5 The Conjugate Gradient Method

Initial point θ = θ0ρ0 ← 0g0 ← 0t← 1while Stopping criterion not met do

Choose a minibatch of examples {x(i), y(i)}mi=1 randomlygt ← 1

m∇θ∑mi=1 Li

Compute βt by Polak-Ribiere methodρt ← −gt + βtρt−1Line search: ε∗ ← arg minε

1m

∑mi=1 Li (θt + ερt)

θt+1 ← θt + ε∗ρtend while

3.3.3 Broyden-Fletcher-Goldfarb-Shanno Algorithm

Algorithms that seeks to achieve some of the advantages of Newtons Methodwithout the computational complexity of the Hessian are called Quasi-NewtonMethods. The main idea is to somehow approximate the Hessian in a goodway to avoid explicitly calculating all the second derivatives of the objectivefunction. Quasi-Newton methods are less common (in machine learning) thanthe first order methods mentioned at the beginning of this section because ofthe higher computational burden on second order methods. However, we willstudy one algorithm that lies in this family of algorithms: the Broyden-Fletcher-Goldfarb-Shanno algorithm [1].

Recall that the update rule in Newton’s method involves the calculation and in-version of the Hessian H. The approach of the BFGS algorithm is to iterativelycompute a matrix M t that approximates the inverse of the true Hessian of theobjective function. The matrix M t should satisfy the Quasi-Newton condition[6]

M t+1 (θt+1 − θt) = ∇L (θt+1)−∇L (θt) . (3.13)

In the BFGS algorithm the approximation of the Hessian at stage t is given bythe update rule [1]

M t+1 = M t +yty

Tt

yTt st−M tsts

Tt M

Tt

sTt M tst(3.14)

where yt = ∇f (θt+1)−∇f (θt) and st = θt+1 − θt. The full BFGS algorithmis shown in Algorithm 6.

As mentioned, the Hessian of the loss function of a neural network can beextremely large. Thus, calculating it means a huge computational load. Butnot only do we have to calculate all the second order derivatives, we also haveto invert it to apply an update rule such as the one in Newton’s method. With

21

this in consideration, one can readily realise that the BFGS algorithm can savea lot of time by instead approximating the inverse of the Hessian.

Algorithm 6 BFGS

Initial point θ0Initial guess M0

Learning rate εwhile Stopping criterion not met do

Solve: M tpt = −∇L (θt)ρt ← −gt + βtρt−1st ← ερtθt+1 ← θt + styt ← ∇L (θt+1)−∇L (θt)

M t+1 ←M t +yty

Tt

yTt st− Mtsts

Tt M

Tt

sTt Mtst

end while

3.3.4 L-BFGS

In the BFGS algorithm one has to store the previous matrix M t for the nextiteration which can be a huge restriction on the problems approachable by thealgorithm. When training neural networks one often has millions of weightsand biases which makes it infeasible to store the matrix between each itera-tion. The Limited-Memory Broyden-Fletcher-Goldfarb-Shanno algorithm ap-proximates the BFGS algorithm while requiring less computer memory whichmakes it an interesting algorithm for neural networks [1]. This is the version ofthe BFGS algorithm that we will study in this report.

As in the BFGS algorithm we define st = θt+1 − θt and yt = ∇L (θt+1) −∇L (θt), and we assume that the last m updates of these are stored. We thenstart with a initial approximate of the Hessian at iteration t by H0

t = γtI

where I is the identity matrix and γt =sTt−1yt−1

yTt−1yt−1

[1]. Also, define a sequence of

m scalars ρi as [1]

ρi =1

yTi si(3.15)

for i = k −m, ..., k. The algorithm is shown in Algorithm 7. The big improve-ment from the BFGS algorithm is that we only need to store the m last updatesof st and yt instead of storing the full matrix approximation. A disadvantagewith this approach is that the approximation of the inverse of the Hessian be-comes more inaccurate, but the hope is that updating the parameters moreoften will compensate for the added inaccuracy.

22

Algorithm 7 L-BFGS [9]

Initial point θ0t← 1while Stopping criterion not met doq ← ∇L (θt)for i = k − 1, ..., k −m doαi ← ρis

Ti q

q ← q − αiyiend forCompute γtH0

t ← γtIz ←H0

tqfor i = k −m, ..., k − 1 doβi ← ρiy

Ti z

z ← (αi − βi) siend forθt+1 ← θt − εzt← t+ 1

end while

Although it is possible to use minibatches in the L-BFGS algorithm, we en-countered some problems when doing so. The algorithm was very sensitive tochanges in the hyperparameters as well as the input data. Also, the methodhad problems converging to a reasonably low value of the objective function.Because of these issues we decided to use the whole data set in the L-BFGS, incontrast to the other methods where we used minibatches.

Since second order method in general, and L-BFGS in particular, has not beenexamined as much as first-order methods it is quite hard to determine whatreasons lie behind the trouble of minibatching in the L-BFGS algorithm. Onecould suspect that it is necessary to use the whole- or a big part of the dataset to get a reasonable approximation of the Hessian [6]. For a more thoroughstudy of minibatch methods for the L-BFGS algorithm we refer to the article AProgressive Batching L-BFGS Method for Machine Learning by Bollapragadaet al. that can be found in the references of this report [6].

3.4 Issues in Optimization of Neural Networks

When attempting to find a global minimum for the loss function the optimiza-tion algorithm often fails to find it. This is due to several problems, someof which we will go over shortly. Firstly, we’ll cover an issue caused by a ill-conditioned Hessian matrix. If a function is ill-conditioned it means that it issensitive to the input, i.e. small errors can heavily affect the result. The condi-tion number of a matrix is the fraction λi

λj, λi is the largest eigenvalue and λj is

23

the smallest eigenvalue [1]. If the condition number is large it means that thematrix is ill-conditioned.

The first-order methods don’t use the Hessian explicitly but the Taylor ex-pansion of the gradient descent step −εg is 1

2ε2gTHg − εgTg [1]. So if the

Hessian is ill-conditioned the gradient step becomes sensitive to errors. Thismeans that 1

2ε2gTHg can exceed −εgTg, making the step small even though

the gradient is large [1]. Quasi-Newton methods calculate an approximation of−H−1g so an ill-conditioned Hessian will affect the resulting step. Because wedon’t calculate the Hessian we can’t calculate the condition number.

The optimization algorithms can get stuck in a local minimum with a largeloss, however, according to Yann N. Dauphin [7], we know that local minimaare often close to the global minima. Also, the ratio of saddle points to lo-cal minima rises exponentially with the loss function’s dimension [7] so saddlepoints are a much bigger problem. Saddle points are points with a zero gradientwhich aren’t local minimums or maximums. While gradient descent is designedto move downhill second-order methods find points with zero gradients, thussaddle points are a big problem for second-order methods. Therefore we expectsecond-order methods might perform worse when there are more parameters.

A common problem with machine learning is overfitting. For neural networks itmeans that the the network has learnt too much from the noise in the data andthis knowledge doesn’t apply to new test data. This problem arises because thenetwork only has access to a finite training set instead of the the data generatingdistribution [1].

24

Method

The goal of the results is to allow us to draw meaningful conclusions about theperformance of the different algorithms. The algorithms were tested on threedifferent networks, a shallow network with 1 hidden layer and 20 neurons, awide network with 1 hidden layer and 100 neurons and lastly a network with 3hidden layers each with 20 neurons.

Also, two different classification problems were used, one in 2 dimensions andone in 4 dimensions. The 2-dimensional problem was tested on all networkstructures and the 4-dimensional problem was tested on the deep network.In the appendix we present the hyperparameters used when optimizing eachnetwork structure.

4.1 The Classification Problem

As was mentioned earlier, the algorithms will be solving a classification problemwith hyperspheres acting as boundaries. To make the problem more complex,in each problem there will be 2D hyperspheres, D is the dimension of the input.The radius is selected so that the volume of the spheres is half that of the totalspace [0, 1]D so that half of the data is classified 1 and the other 0. We choose2D hyperspheres so that there will be space between the hyperspheres for alldimensions. Figure 4.1 shows the result of a trained neural network assigned tosolve the 2-dimensional classification problem.

25

Figure 4.1: Example of the 2-dimensional classification problem, the red andblue color represents different classifications

4.2 Hyperparameter Optimization

To fairly compare the optimization algorithms the first step is to find the op-timal hyperparameters for each algorithm. For each hyperparameter we testeddifferent values to find the one that causes the best performance. For eachvalue, the network was optimized at least 5 times. Then the optimal value wasfound by comparing the average loss over time results. The hyperparametersoptimized were learning rate and batch size whose optimal value was affectedby the network structure.

Then there are hyperparameters that aren’t unique to any of the algorithms,the structure of the network, the initial size of the parameters in the network,and the size of the training data. The structures were selected so that we couldanalyse the effect of adding more layers and increasing the number of neurons.The initial weights and biases in the network were generated by a normal dis-tribution with mean 0. The initial size parameter set the standard deviation ofthis distribution, which affected the absolute size of the parameters. The initialsize was selected so that all algorithms performed reasonably when optimizingthe network. We used a large set of training data so that the possibility ofoverfitting was reduced.

4.3 Analysis Method

When comparing the algorithms to each other we measured the loss over epochand the loss over time. It’s important to note that the loss over time resultsare heavily affected by the time complexity of our implementation. An epoch

26

is when the algorithm has trained using all of the data once.

Also, in order to account for the effect of overfitting we measured loss of eachnetwork on new test data. This loss is called the real loss and is the fraction ofmissclassified data points, from a sample of 10000.

The algorithms’ performance varied for different realisations, to account forthis we took the average loss from 10 or 5 different realisations. The trainingdata was randomly generated from an equal distribution in the classificationspace. The initial parameters in the network were also randomly generated, butfrom a normal distribution with mean 0. If an algorithm failed to decrease theloss below 0.4 we would discard that result from the average and try again, butrecord how many times the algorithm failed.

27

Results

5.1 Optimal Batch Size Hyperparameter

Not all hyperparameter optimization results are presented in this report, as tonot overwhelm the reader. Only the optimal batch size for all algorithm on theshallow network as well as the optimal batch size for the first order methodson the deep network solving the 4-dimensional classification problem. This isbecause the optimal batch size remained mostly constant.

The following figures, 5.1 to 5.4, show the average loss from 5 samples of train-ing a shallow neural network on the 2-dimensional classification problem withdifferent batch sizes. The number of data points used for the 2-dimensionalclassification problem was always 2048.

The results from Figure 5.1 show how the batch size affects the result for SGDtasked to solve the 2-dimensional classification problem on a shallow network.It seems that batch size 64 was the most efficient, but only slightly. In Figure5.2 we see how different batch sizes affected the performance of the ADAM al-gorithm on the same network. Batch size 128 seems marginally better than theother sizes.

For conjugate gradient the optimal batch sizes are larger than for the first-ordermethods, see Figure 5.3. Each minibatch was run 3 times, this gives the mostconsistent result according to Jiquan Ngiam [8]. This is because the algorithmuses information from previous iterations, that information would be inaccurateif used for a different batch. From these results, it seems that a batch size of256 performs a little better than the rest.

L-BFGS will sometimes fail to find a minimum and instead the loss will gotowards a maximum. Because of this it’s difficult to obtain an average of theloss. In Figure 5.4 is the result of one run. We found that not using minibatchgave a much more consistent result, therefore we used L-BFGS without a mini-batch. However, it is possible to implement minibatch for L-BFGS [6] [8].

28

Figure 5.1: SGD average loss on ashallow network with different batchsizes, solving the 2-dimensional problem

Figure 5.2: ADAM average loss on ashallow network with different batchsizes, solving the 2-dimensional problem

Figure 5.3: CG average loss on a shal-low network with different batch sizes,solving the 2-dimensional problem

Figure 5.4: L-BFGS loss on a shallownetwork with different batch sizes, solv-ing the 2-dimensional problem

Figures 5.5 and 5.6 show how the first order methods performance is affected bythe batch size when solving the 4-dimensional classification problem on a deepnetwork. The average loss is created from 10 samples in this case. For SGD weused batch size 128, even though batch size 64 is faster it doesn’t reduce the lossas much. When ADAM was used to solve the 4-dimensional problem it usedbatch size 256, because it was fast and it reached a lower loss than other batchsizes.

29

Figure 5.5: SGD average loss on a deepnetwork with different batch sizes, solv-ing the 4-dimensional problem

Figure 5.6: ADAM average loss on adeep network with different batch sizes,solving the 4-dimensional problem

5.2 Shallow Network Algorithm Comparison

Figures 5.7-5.8 show the average loss from 10 different runs for each of thealgorithms, tasked to solve the classification problem in 2 dimensions. L-BFGSwas the only algorithm which in some cases failed to decrease the loss functionsignificantly, 9 out of 19 times it failed. The specific hyperparameters used arein the appendix Figure 7.1.

SGD ADAM CG L-BFGSReal loss 0.02242 0.01985 0.02582 0.07608

Figure 5.9: Average real loss from shallow network, solving the 2-dimensionalclassification problem

5.3 Wide Network Algorithm Comparison

Figures 5.10-5.11 show the average loss for each methods trained on the widenetwork. The L-BFGS method failed to decrease the loss 20 out of 30 times.The hyperparameters used are in Figure 7.2 in the appendix.

30

Figure 5.7: Shallow network aver-age loss over epoch, solving the 2-dimensional classification problem

Figure 5.8: Shallow network aver-age loss over seconds, solving the 2-dimensional classification problem

Figure 5.10: Wide network average lossover epoch, solving the 2-dimensionalclassification problem

Figure 5.11: Wide network average lossover seconds, solving the 2-dimensionalclassification problem


Figure 5.12: Average real loss from wide network, solving the 2-dimensionalclassification problem

31

5.4 Deep Network Algorithm Comparison

Two different problems were solved using the deep network, classification in 2,and 4 dimensions. The average loss, over 10 samples, from the 2-dimensionalproblem is shown in Figure 5.13 and 5.14. L-BFGS failed to decrease the lossfunction significantly 39 out of 49 times. The hyperparameters used are in Fig-ure 7.3 and 7.4 in the appendix.

Figure 5.13: Deep network average lossover epoch, solving the 2-dimensionalclassification problem

Figure 5.14: Deep network average lossover seconds, solving the 2-dimensionalclassification problem


Figure 5.15: Average real loss from deep network, solving the 2-dimensionalclassification problem

Figures 5.16 and 5.17 show the average loss from 5 samples of the deep net-work solving the 4-dimensional classification problem. In 4 dimensions thereare 24 = 16 hyperspheres acting as boundaries. For the first-order algorithmswe increased the batch size, to 128 for SGD and 256 for ADAM because thisincreased their performance. The number of data points was increased from2048 to 8192. Again L-BFGS was the only algorithm that failed to always de-crease the loss function significantly, this time it was 15 out of 20 runs. Thehyperparameters used when training the deep network are shown in Figures 6.3and 6.4.

32

Figure 5.16: Deep network average lossover epoch, solving the 4-dimensionalclassification problem.

Figure 5.17: Deep network average lossover seconds, solving the 4-dimensionalclassification problem.


Figure 5.18: Average real loss from deep network, solving the 4-dimensionalclassification problem

33

5.5 L-BFGS Failure Examples

Figure 5.19 and 5.20 show examples of when L-BFGS fails to reduce the networkloss.

Figure 5.19: Example of L-BFGS get-ting stuck when optimizing the shallownetwork.

Figure 5.20: Example of L-BFGS di-verging when optimizing the shallownetwork.

5.6 Topics for Further Investigation

One hyperparameter we didn’t investigate fully was the initial absolute sizeof the parameters in the network. We found that changing the size affectedsome algorithms positively and others negatively, but a full investigation of thiseffect was outside the scope of this report. In Figure 5.22 the initial size pa-rameters is 0.5 and in Figure 5.21 it’s 1. This is a topic for further investigation.

Another topic for further investigation is how the choice of activator functionaffects the performance of the different algorithms. We wanted to compare tanhto rectify, defined by

r(x) = max (0, x) . (5.1)

Even though rectify is often used in modern neural networks, [1] p. 226, itsurprisingly performed much worse than tanh on our neural network. Becauserectify performed so poorly we choose to only use tanh. Figure 5.23 shows howthe shallow network performed using the rectify activation function.

34

Figure 5.21: Deep network average losssolving the 2-dimensional classificationproblem with initial size = 1

Figure 5.22: Deep network average losssolving the 2-dimensional classificationproblem with initial size = 0.5

Figure 5.23: Shallow network with rec-tify as activator function average lossover epoch, solving the 2-dimensionalclassification problem

Figure 5.24: Shallow network with tanhas activator function average loss overepoch, solving the 2-dimensional classi-fication problem

35

Analysis & Discussion

6.1 Hyperparameters

This section analyses the optimal batch size for the different algorithms. As ex-pected, both first-order methods have a small optimal batch size, Figure 5.1, 5.2,gradient descent is almost always used with minibatch, [1] p. 279, in practiceand our results show that the performance decreases as the batch size increases.The standard error of the expected value is given by σ√

n[1] p. 278, n is the batch

size, it means that increasing the batch size yields diminishing returns becausethe time complexity of calculating the gradient is at least linearly dependent onthe batch size.

When the dimension of the classification problem increased the optimal batchsize for the first-order methods increased, Figure 5.5 5.6. As previously men-tioned, an increased batch size means that each minibatch is a more accuraterepresentation of the classification problem. The classification problem in 4 di-mensions is a lot more complex for two reasons, the larger volume of the spaceand the increase in the number of circles 4 −→ 16. Because of this, it’s morevaluable to train on an accurate representation of the classification problem.That’s why the size of the optimal batch size increased slightly.

Conjugate gradient uses minibatch with a slight modification, each batch is usedthree times before a new batch is selected and the gradient history is discarded.This works well and the batch size is almost as small as for the first-order meth-ods, even though second-order methods typically use much larger batch sizes p.279 [1], Figure 5.3. This is because conjugate gradient doesn’t attempt to usethe hessian matrix, but it still uses information from second-order derivatives.

L-BFGS on the other hand creates an approximation of the hessian matrix andthose methods typically need a batch size of 10,000 according to Deep Learningp. 279 [1]. In this implementation minibatch of any size didn’t work as wellas using the complete data set, even when the algorithm was trained on eachbatch 20 times, which should make minibatch perform well [8].

36

The learning rate had to be adjusted each time the network structure was al-tered. The learning rate was optimized the same way batch size was optimized.When the number of parameters increases the learning rate decreased. The val-ues used are in the appendix 7.4. Conjugate gradient used linesearch instead ofa fixed learning rate in this implementation, which didn’t need any adjustments.

6.2 Network Structure

By analysing the performance of the optimization methods on the different net-works we can draw some conclusions. First of all, the more parameters andhidden layers a network has the longer it takes to train the network on oneepoch and the algorithms might require more iterations to minimize the loss.But more layers and neurons allow the network to solve more complex problemsso it’s important to use a network that can solve your problem without takingtoo much time.

For the 2-dimensional classification problem it seems that the shallow networkwas best suited for solving the problem compared to the wide and deep network.Most of the methods were slower when using the wide network, both in termsof epochs and time. All methods performed approximately the same on thedeep network except for the quasi-Newton method L-BFGS which performedmuch worse. In the deep network there are 921 parameters compared to 81 inthe shallow network. In theory, this means there are much more saddle pointscompared to minimums [7], which would negatively affect the performance ofsecond-order methods.

The 4-dimensional problem is a lot more complex so the deep network wasthe only structure able to solve the problem. All of the algorithms needed moreepochs and time to decrease the loss significantly. This is in part due to theslightly increased parameter dimension but mostly due to more training dataand the fact that the problem requires a much more complex solution.

6.3 Algorithm Performance Analysis

First we’ll discuss the first-order methods. ADAM outperformed SGD both interms of epochs and seconds for all network structures. Our implementation isthe most basic version of SGD so this was expected, and other researcher haveshown ADAM to outperform SGD [5]. But ADAM did become more unstablethan SGD in the larger networks, this is probably due to our implementation.

Conjugate gradient actually performed comparably to the first-order methods.Comparing the loss to epochs CG performed better than the first-order methodswhen solving the 4-dimensional problem. But CG had much worse time com-

37

plexity than the other algorithms, this is most likely due to our implementationof line search. Using an implementation with a better line-search algorithm CGmight outperform SGD, other researcher have shown that’s possible [8].

As the results show, the L-BFGS algorithm often fails to converge to an accept-able value of the loss function. Recall that one problem with Newton’s Methodwas the non-guaranteed convergence to a minimum for non-convex functions.This problem is also present in the L-BFGS algorithm and is probably a bigreason for convergence issues. Also, since we are using a matrix it is possiblethat the ill-conditioning of this matrix contributes to the incapability of con-verging to reasonably low values of the objective function. But, because wedon’t calculate the Hessian explicitly we can’t measure the condition number.

Figure 5.19 shows an instance of the algorithm when it fails to converge. Inthis case it is probably the non-convexity of the loss function that causes theproblem, since it clearly gets stuck in a region where the value of the loss func-tion is still relatively high which could be a local minimum/maximum or asaddle point. Figure 5.20 is probably an illustration of the second problem, theill-conditioning of the matrix. Ill-conditioning causes the algorithm to be verysensitive to changes in the input, and in Figure 5.20 the loss function increasesabruptly between two successive iterations. This might be due to a large errorin the computation caused by an ill-conditioned Hessian.

When comparing the performance of L-BFGS to the other algorithms it’s im-portant to remember that the average loss is calculated from a fraction ofbest-performing optimizations. Still it didn’t perform better than the otheralgorithms. It was slower and more unstable, especially when there were moreparameters. It did seem to reach a lower loss when optimizing the wide networkbut that seems to be because of overfitting.

It might be that L-BFGS performs significantly better with a successful im-plementation of minibatch [8] [6]. This would increase the number of iterationsin an epoch and each iteration would have a lower computational load. How-ever, L-BFGS builds on information from the previous iterations, 20 in this case,and when switching batch the algorithms needs to discards that information be-cause it would be inaccurate for a different batch. Discarding this informationdecreases the accuracy of proceeding iterations. This is probably why our imple-mentation didn’t work with minibatch. Our implementation of L-BFGS mighthave performed better with linesearch instead of a fixed learning rate, manyimplementations use linesearch [9]. This is a topic for further investigation.

38

6.4 Conclusion

This report considered one type of classification problem but neural networksare applicable to a lot more problems. Also considering we used our implemen-tation of these algorithms the results from this report can’t be generalized toall neural networks.

It’s clear from the results that ADAM performed better than the other al-gorithms and the code for ADAM was much easier to implement comparedto the second-order methods. It’s common practice to use first-order methodswhen training neural networks and these results show why that is. Even thoughthere are examples of second-order methods performing better than first-ordermethods [8], first-order methods perform well and are consistent.

39

Bibliography

[1] Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MITPress, 2016-11-18.

[2] Stephen Marsland. Machine Learning: An Algorithmic Perspective. CRCPress, 2014-10-08.

[3] B.Karlik, A. Vehbi Olgac. Performance Analysis of Various Activation Func-tions in Generalized MLP Architectures of Neural Networks. InternationalJournal of Artificial Intelligence and Expert Systems, 2011.

[4] L.Rade, B.Westergren. Mathematics Handbook for Science and Engineering.Studentlitteratur AB, 2019-08-12

[5] Diederik P.Kingma, Jimmy Lei Ba. ADAM: A Method For Stochastic Opti-mization. International Conference on Learning Representations, 2015.

[6] Raghu Bollapragada et al. A Progressive Batching L-BFGS Method for Ma-chine Learning. https://arxiv.org/, 2018. Northwestern University.

[7] Yann N. Dauphin, Razvan Pascanu. Identifying and attacking the saddlepoint problem, 2014-06-10

[8] Quoc V. Le, Jiquan Ngiam On Optimization Methods for Deep LearningStanford University. 2011

[9] Jorge Nocedal Updating Quasi-Newton Matrices with Limited Storage Amer-ican Mathematical Society, 2018-05-30.

[10] P.Wolfe. Convergence Conditions for Ascent Methods. SIAM Review, 1969-04-02.

40

https://arxiv.org/abs/1802.05374

Appendix

7.1 Backpropagation

The purpose of this section is to give a motivation to the backpropagataionalgorithm. As described in the report, we denote the activation in layer l withal and the weighted input to layer l with zl. The depth of the network isdenoted with D. We start by defining the error in the ith neuron in layer l by[1]

δli =∂L

∂zli. (7.1)

To see how this quantity is related to the error, assume that we change the valueof zli by an amount ∆zli. The value of the loss function will then change by anamount of ∂L

∂zli∆zli = δli∆z

li [1]. If δli is large, the neuron is far from its optimum

and the change in loss function will be relatively large. If however, δli is smallthe neuron is near its optimum and the loss function will remain unchanged. Soδli is a measure of how near a neuron is from being at its optimum.

The error in the output layer δDi can easily be calculated with the chain rule ofcalculus

δDi =∂L

∂zli=∑k

∂L

∂aDk

∂aDk∂zDi

=∂L

∂aDi· σ′(zDi)

(7.2)

since the activation aDk = σ(zDk)

only depends on zDi when i = k. Using (7.2)we can then propagate backwards through the network to calculate the otherδli. These can be calculated in a similar manner as the output error [1]

δli =∂L

∂zli=∑k

∂L

∂zl+1k︸︷︷︸

= δl+1k

∂zl+1k

∂zli=∑k

δl+1k

∂

∂zli

∑j

wl+1kj tanh

(zlj)

+ bl+1j

=

(7.3)

=∑k

δl+1k wl+1

ki tanh′(zli)

=(

(W l+1)δl+1 � tanh′(zl))i. (7.4)

41

So if we start by calculating the error in the output layer we can then use (7.4)to calculate the error in all other layers. This is the reason why the algorithmis called ”backpropagation”, we propagate backwards through the network. Wenow relate the errors to the derivatives w.r.t. the weights and biases which isthe quantities that we compute the derivatives with respect to. The derivativew.r.t. the biases is straight-forward:

∂L

∂bli=∑k

∂L

∂zlk

∂zlk∂bli︸︷︷︸δik

=∂L

∂zli= δli. (7.5)

Here, δij is the Kronecker delta. The derivative w.r.t. the weights is also quitestraight-forward to derive:

∂L

∂wlij=∑k

∂L

∂zlk︸︷︷︸δlk

∂zlk∂wlij

=∑k

∂L

∂zlk

∂

∂wlij

(∑m

wlkmal−1m + blk

)= (7.6)

=∑k

∑m

δikδjmδlkal−1m = δlia

l−1j . (7.7)

Putting all of the derived results in vector notation we get [1]δD = ∇aDL� σ′

(zD)

δl =(W l+1

)Tδl+1 � tanh′

(zl) (7.8)

{∂L∂bl

= δl

∂L∂W l = δl ·

(al−1

)T (7.9)

Using (7.8)-(7.9) one can determine the gradient w.r.t. the parameters, whichis exactly what the backpropagation algorithm in Algorithm 1 does.

42

7.2 Line Search

In the Conjugate Gradient Method we use a so called Line Search to find thestep length ε∗. The procedure that is performed is simply a univariate opti-mization of some function h (ε) = f (x+ εp) where p is the search direction. Ifpossible, we solve h′ (ε) = 0 analytically but more often one settles for a valueof ε that decreases h with some fixed value. The former method is called ExactLine Search while the latter is called Inexact Line Search.

If the objective function is not quadratic, which is the case in our problems,we have to perform an Inexact Line Search to approximate the optimal stepsize. One method for doing so, and the method that has been used in thisreport, uses the Wolfe Conditions. The step size ε is said to satisfy the WolfeConditions if [10]

(i) f (x+ εp) ≤ f (x) + c1εpT∇f (x)

(ii) −pT∇f (x+ εp) ≤ −c2pT∇f (x)

where c1 and c2 are some constants between 0 and 1.

7.3 Source Code

The optimization methods studied in this report have all been implemented in-dependently by the authors and the source code can be found at

https://github.com/lkarlason/Optimization-NeuralNetworks.

Note that the main focus of this study has not been on the implementationof the optimization methods, and the authors are aware of the existance offlaws in the code.

7.4 Hyperparameters used

In this section we’ll present the hyperparameters used when optimizing eachnetwork. Some parameters are lists because different optimization algorithmsuse different parameters. SGD uses parameters in the first index, ADAM usesthe second, CG uses the third and L-BFGS uses the fourth index. The valueof the initial parameter is the standard deviation of a normal distribution withmean 0 used to set the initial weights and biases in the networks.

43

https://github.com/lkarlason/Optimization-NeuralNetworks

""" Hyperparameters """

""" SGD, ADAM, CG, L_BFGS"""

N_train = 2048 # number of training observations

iterations = [1000, 1000, 1000, 1000]

learn_rate = [0.9, 0.07, 1, 0.05]

batch_size = [64, 128, 256, N_train]

reps = [1, 1, 3, 1] # how many times the algorithm iterates

# over one batch before switching batch

features = 2 # input dimension

layer_width = [20, 1] # width of layers

N_layers = len(layer_width) # number of hidden layers -1

random_seed = 1337

N_seeds = 10 # number of samples used when calculating average

initial = 1 # the size of the initial parameters in network

Figure 7.1: Hyperparameters used for the deep network




iterations = [1500, 1500, 1500, 1500]

learn_rate = [0.7, 0.05, 1, 0.03]





layer_width = [100, 1] # width of layers


random_seed = 1337


initial = 0.5 # the size of the initial parameters in network

Figure 7.2: Hyperparameters used for the wide network

44




iterations = [1000, 1000, 1000, 1000]

learn_rate = [0.05, 0.004, 1, 0.003]





layer_width = [20, 20, 20, 1] # width of layers


random_seed = 1337



Figure 7.3: Hyperparameters used for the deep network, solving the 2-dimensional problem




iterations = [3000, 3000, 3000, 3500]

learn_rate = [0.05, 0.004, 1, 0.003]





layer_width = [20, 20, 20, 1] # width of layers


random_seed = 1337



Figure 7.4: Hyperparameters used for the deep network, solving the 4-dimensional problem

45

www.kth.se

A Comparison of Selected Optimization Methods for Neural ...

Documents

Transcript of A Comparison of Selected Optimization Methods for Neural ...