Neural Networks - Backpropagationcogsci.ucsd.edu/~desa/backpropslides.pdf · Neural Networks -...

1Neural Networks - Backpropagation

We have looked at parametric learning — where we assume a parametric form(e.g. Gaussian, mixture of Gaussians), and then try to learn the best parameters(or distribution over parameters) to fit the data.

Another approach is non-parametric learning, where you try to fit a model whosestructure is not fixed, but may grow in complexity to fit the data.

Neural networks are a non-parametric learning method. Feedforward neuralnetworks learn a mapping from input to output where each “artificial neuron” is anon-linearly squashed weighted sum of the outputs from the layer below. Thefunction that transforms the weighted summed input to the output is called theactivation function.

2one-layer network

1 2

3b3

w32w31

yp3 = f(xp1 ∗ w31 + xp2 ∗ w32 + b3) = f(netp3)

The f function is called the activation function. We will assume thatf(x) = σ(x) = 1

1+e−x for the rest of these notes. This is called the sigmoidfunction and has several nice properties sigmoid.pdf

3one-layer network

1 2

3b3

w32w31

yp3 = f(xp1 ∗ w31 + xp2 ∗ w32 + b3) = f(netp3)

The current standard notation is that wij represents the weight from unit j tounit i. bi represents a “bias” weight which can be considered a weight from a unitthat is always 1 to the unit i.

We call the net input to a unit i neti which is equal to the weighted summedinput (e.g. w31 ∗ x1 + w32 ∗ x2 + b3 in network on the next page).

The output of the unit i is called yi and it is obtained by applying the activationfunction to the net input. In other words ypi = f(netpi ) = f(bi +

∑j x

pjwij) where

the sum is taken over all inputs to that unit. The p superscript indexes the specificinput pattern (and thus also their specific output and desired target patterns).

4Error function

In back-propagation learning, the network is trained to minimize an error function.Usually the squared error function is used.

E = 1/2

N∑p=1

(yp − tp)2

(superscript p refers to pattern p, we will drop this notation below)

Remember the gradient vector points in the direction of increasing function value.Our goal is to find the set of weights and biases to give minimal error (to minimizethe squared error function). In order to do this we want to update the weights inthe negative gradient direction.

The gradient descent learning rule is (move a small (η) step in the negativegradient direction)

w31(new) = w31(old) − η∂E

∂w31

5(the below summation is over pattern presentations)

1 2

3b3

w32w31

The gradient terms can be computed using the chain rule

∂E

∂w31=

∑p

∂E

∂yp3

∂yp3

∂w31=

∑ ∂E

∂yp3

∂yp3∂w31

=∑ ∂E

∂yp3

∂yp3∂netp3

∂netp3∂w31

=∑ ∂E

∂yp3

∂yp3∂netp3

∂netp3∂w31

=∑

(yp3 − tp3)yp3(1 − yp3)xp1 =∑

(yp3 − tp3) yp3(1 − yp3) xp1

=

P∑p=1

−δp3 ∗ xp1 =

P∑p=1

−δp3 ∗ xp1

We define

− ∂E

∂netpj= δpj

The usefulness of this will become apparent on the next page. (Note that line (3)in the derivation is correct, we are only rewriting it in line (4) using our newlydefined δ for speed and clarity when we go to multi-layer networks.).

6Multilayer Network

5

2

3 4

1

In order to make notation consistent for multilayer networks (where the output ofone layer becomes the input of the next layer), we will use y to refer to theinputs/outputs at the first layer. define yp1 = xp1 and yp2 = xp2 for the input units

yp5

= f(netp5)

= f(w53 ∗ yp3 + w54 ∗ yp4 + b5)

= f(w53 ∗ (f(w31 ∗ yp1 + w32 ∗ yp2 + b3) + w54 ∗ f(w41 ∗ yp1 + w42 ∗ yp2 + b4) + b5)

7

E = 1/2

P∑p=1

(yp − tp)2

∂E

∂w53=

∑p

(yp5 − tp5)yp5(1 − yp5)yp3

(Analogous to the one layer net above)

For the next layer

∂E

∂w31=

∑p

∂E

∂yp5

∂yp5∂netp5

∂netp5∂yp3

∂yp3∂netp3

∂netp3∂w31

=∑p

(yp5 − tp5)yp5(1 − yp5)w53(yp3)(1 − yp3)yp1

=∑p

−δp5 w53f′(netp3)yp1

8

=∑p

−δp3 yp1

The answer in the second line above is correct. The reason we take it further (inthe next two lines) by using the δ’s is for efficiency and clarity.

It turns out that just as in a forward pass (where we don’t compute yp5 directlyusing all the weights and inputs (e.g. using yp1 and w31) but use the precomputedyp3 and yp4, we do not need to recompute all parts of this equation. Many havealready been computed to get ∂E

∂w53for example. The use of saved computation is

formalized with the use of the δ notation. The idea is that you compute the δp’sat the top layer. And then compute δp’s at the next layer down using aback-propagation rule. That is, in actual computation, you would not have to dothe computation shown in line (10).

If there are multiple output units then we must sum over the back-propagatedcontribution from each higher level unit (k) that it sends connections too.

Consider a situation where a unit j projects to several higher level units (indexedby k)

9

kj

k

j

w

δpj is computed from the higher level δ′s as follows:

δpj = − ∂E

∂netpj=

∑k

− ∂E∂net

pk

∂netpk

∂netpj

=∑k

− ∂E∂net

pk

∂netpk∂ypj

∂ypj∂netpj

=∑k

δpk wkjypj (1 − ypj )

= f ′(netpj)∑k

wkj δpk

Therefore δp’s can be propagated backwards just like activations (y’s) arepropagated forward. (make sure you can see this from the above equation) (Thisis why the algorithm is called back-propagation. ) This makes the algorithm quiteefficient (you don’t have to recompute the effect of all the higher weights/units onthe error derivatives for the lower weights)

10Once the δp’s are available (remember these represent − ∂E∂net

pj), the batch weight

update rule (accumulated for each pattern and then summed and implemented atonce) is

wji(new) = wji(old)−η∂Ewji

= wji(old)−η∑p

∂E

∂netpj

∂netpj∂wji

= wji(old)+η∑p

δpj ypi

wji(new) = wji(old) + η∑p

δpjypi (1)

If you want to do online updates after each presentation (for an online algorithm),the equation is

wji(new) = wji(old) + ηδpjypi (2)

after each pattern presentation.

Basic back-propagation algorithm is

11• propagate forward activations

• compute error and compute δp’s at the output units

• back-propagate deltas

• update weights using online update rule above (17)

• repeat for next pattern

The basic batch back-propagation algorithm is

• ? propagate forward activations? compute error and compute δp’s at the output units? back-propagate deltas? (add deltas to previous deltas computed in this batch)? repeat inner loop for next pattern (in batch)

• update weights with summed deltas using batch update rule (16)

• reset summed deltas and restart batch (outer loop)

12Training Issues

Picking the right step size for gradient descent is tricky/impossible. There mightbe no globally optimal appropriate step size. One learning rate is not optimal forall starting points (or for the whole trajectory). Some ways of dealing with this are...

Gradient descent with momentum

Similar to physical momentum, this algorithm computes the new weight update asfor gradient descent but then adds in a small fraction (given by the momentumcoefficient (mu) ) of the last weight update. If the new error is too much largerthan the old error, the weights and biases are discarded and the momemtumcoefficient set to 0.

∆w(t) = −η ∗ dE/dw + µ ∗ ∆w(t− 1)

Variable learning rate

If error goes down increase learning rate for next step. If error goes up too muchdecrease learning rate and discard new weights and biases

13Conjugate Gradient Algorithms

Most conjugate gradient algorithms use a line search algorithm.

A line search algorithm finds a local minimum in a 1-D direction (it does not justtake a step downhill)

Here is a demo of a line search algorithm using golden section search to bracket alocal minimum

http://heath.cs.illinois.edu/iem/optimization/GoldenSection/

http://heath.cs.illinois.edu/iem/optimization/GoldenSection/

14More info on golden section search

from http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/

BMVA96Tut/node17.html

“An elegant and robust method of locating a minimum in such a bracket is theGolden Section Search. This involves evaluating the function at some point x inthe larger of the two intervals (a,b) or (b,c). If f(x) < f(b) then x replaces themidpoint b, and b becomes an end point. If f(x) > f(b) then b remains themidpoint with x replacing one of the end points. Either way the width of thebracketing interval will reduce and the position of the minima will be betterdefined (Figure 2). The procedure is then repeated until the width achieves adesired tolerance. It can be shown that if the new test point, x, is chosen to be a

proportion (3−√5)

2 (hence Golden Section) along the larger sub-interval, measuredfrom the mid-point b, then the width of the full interval will reduce at an optimalrate [6].”

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/BMVA96Tut/node17.html

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/BMVA96Tut/node17.html

15Line searches

Line searches can also employ derivatives to further constrain the location of aminimum and/or fitting of quadratic or higher order interpolation

Brent’s search is an efficient hybrid approach that uses quadratic fitting and thegolden section search.

16Conjugate gradient algorithm

The conjugate gradient with line search algorithms can be summarized as follows

for initial weight vector w

1. find steepest descent direction d1 = −g1

2. At step j Perform a line search in direction dj (minimize E(w + α ∗ dj) wrt α)

3. Test for stopping criterion

4. Compute steepest descent direction gj+1

5. Compute conjugate direction dj+1 using one of several formulas

6. j=j+1 goto step 2

17Picking the conjugate direction

The conjugate direction is picked so as to avoid undoing what you have done sofar. Rather than heading off in the new steepest descent direction, you go in adirection that goes downhill but does not undo the progress you have done inthe direction you searched along

What this means is you pick a direction so that you are still minimized withrespect to the previous line search directions

Graphically this can be seen by considering a parallel line search performed(ideally) infinitessimally downhill (at the ending point) from the first. Your newconjugate gradient line search direction is to head towards the point on this newline that is also a minimum in the parallel search direction (this means you are notlosing your “minimality” you achieved in that direction)

18Conjugate Gradient example 1

Consider starting search from the X


Perform a line search in the negative gradient direction


Stop at the minimum in that direction. The new negative gradient direction isperpendicular to this direction.


To find the conjugate gradient search direction, consider a path moved slightlyover (on the downhill side of the minimum)


The conjugate gradient direction heads towards the minimum on this very closeparallel path.


Here is an example on a more complex surface


Consider starting search from the X


Perform a line search in the negative gradient direction


Stop at the minimum in that direction. The new negative gradient direction isperpendicular to this direction.


To find the conjugate gradient search direction, consider a path moved slightlyover (on the downhill side of the minimum)


The conjugate gradient direction heads towards the minimum on this very closeparallel path.

29

Where would this line search (from the X) stop?

30

What is the negative gradient direction from here?

31

What is the negative gradient direction from here?

32

And what is the conjugate gradient search direction from here?

33

And what is the conjugate gradient search direction from here?

34Overfitting

Training the model is not the only problem. You want to make sure that you don’t“overfit “ the training data in a way that won’t generalize well to future data.(just like when fitting a too-high dimensional polynomial that can go through allyour points with zero error but is not likely to generalize well)

Methods for preventing overfitting

Note for infinite amounts of data, overfitting is not a problem. Overfitting is aproblem when you try to fit a model with lots of parameters and not enough data(compare to fitting polynomials of different degrees)

Weight Decay

One way to reduce overfitting is to reduce the function fitting ability of thenetwork. (If it can’t learn too many complicated functions, it can’t overfit toobadly) This is analogous to fitting lower degree polynomials when you have lessdata.

Reducing the number of hidden units restricts the complexity of functions that canbe learned. The problem is that ahead of time, you don’t know how complex afunction you want

35The term “regularization” refers to adding a penalty to the usual error function toencourage smoothness.

Enew = E + ν ∗ Ω

here nu is the regularization parameter and Omega is the smoothness penalty

Weight decay sets

Ω = 1/2∑i

w2i

Note that when you then take the partial derivative of Enew with respect to aweight the update rule will now include a term that is −wi. This will encouragethe weights to decay to zero (hence the name)

Bayesian Regularization

The Bayesian neural network formalism of David MacKay and Radford Neal,considers neural networks not as single networks but as distributions over weights(and biases). The output of a trained network is thus not the result of applyingone set of weights but an average over the outputs from the distribution. This canbe computationally expensive but MacKay and Neal have developed

36approximations and the approach leads to automatic regularization that is veryeffective.

Early Stopping

Another method of overfitting is again based on the fact that units with smallweights act quite linearly. If we start with small weights, the network starts withvery little capacity but as it trains it can increase its weights and make use ofnon-linearitities. This method depends on using a “stand-in” for future data. Onepreselects a set to be held out (called the hold-out or validation set). Matlab doesthis automatically for you in the newer versions.

The idea behind early stopping is to stop adjusting the weights when the error onthe validation set starts increasing.

Dropout and DropConnect

It has been shown that averaging the result of many networks trained on the sametask performs better than training one network. Dropout and DropConnect areefficent ways of doing this by randomly removing nodes (drop out) or weights(drop connect) during training and then down-scaling the weights during testing(with all nodes and conncections present).

Neural Networks - Backpropagationcogsci.ucsd.edu/~desa/backpropslides.pdf · Neural Networks -...

Documents

Transcript of Neural Networks - Backpropagationcogsci.ucsd.edu/~desa/backpropslides.pdf · Neural Networks -...