12 1 Variations on Backpropagation. 12 2 Variations Heuristic Modifications –Momentum –Variable...

12

1

Variationson

Backpropagation

12

2

Variations

• Heuristic Modifications– Momentum

– Variable Learning Rate

• Standard Numerical Optimization– Conjugate Gradient

– Newton’s Method (Levenberg-Marquardt)

12

3

Performance Surface Example

Network Architecture

w 1 11

10= w 2 11

10= b11

5–= b21

5=

w 1 12

1= w 1 22

1= b2

1–=

-2 -1 0 1 20

0.25

0.5

0.75

1

Nominal Function

Parameter Values

12

4

Squared Error vs. w11,1 and w2

1,1

-5

0

5

10

15

-5

0

5

10

15

0

5

10

-5 0 5 10 15-5

0

5

10

15

w11,1w2

1,1

w11,1

w21,1

12

5

Squared Error vs. w11,1 and b1

1

w11,1

b11

-10

0

10

20

30 -30-20

-100

1020

0

0.5

1

1.5

2

2.5

b11w1

1,1-10 0 10 20 30-25

-15

-5

5

15

12

6

Squared Error vs. b11 and b1

2

-10

-5

0

5

10

-10

-5

0

5

10

0

0.7

1.4

-10 -5 0 5 10-10

-5

0

5

10

b11

b21

b21b1

1

12

7

Convergence Example

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

12

8

Learning Rate Too Large

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

12

9

Momentum

0 50 100 150 2000

0.5

1

1.5

2

0 50 100 150 2000

0.5

1

1.5

2

y k y k 1– 1 – w k +=

Filter0 1

Example

w k 12k16

--------- sin+=

0.9= 0.98=

12

10

Momentum Backpropagation

-5 0 5 10 15-5

0

5

10

15

Wm

k sma

m 1–

T–=

bm

k sm

–=

Wm

k Wm

k 1– 1 – sma

m 1–

T–=

bmk bm

k 1– 1 – sm–=

Steepest Descent Backpropagation(SDBP)

Momentum Backpropagation(MOBP)

w11,1

w21,1

0.8=

12

11

Variable Learning Rate (VLBP)

• If the squared error (over the entire training set) increases by more than some set percentage after a weight update, then the weight update is discarded, the learning rate is multiplied by some factor (1>>0), and the momentum coefficient is set to zero.

• If the squared error decreases after a weight update, then the weight update is accepted and the learning rate is multiplied by some factor >1. If has been previously set to zero, it is reset to its original value.

• If the squared error increases by less than , then the weight update is accepted, but the learning rate and the momentum coefficient are unchanged.

12

12

Example

-5 0 5 10 15-5

0

5

10

15

100 102 1040

0.5

1

1.5

Iteration Number100 102 1040

20

40

60

Iteration Number

w11,1

w21,1

1.05=

0.7=

4%=

Squared error Learning rate

12

13

Conjugate Gradient

p0 g0–= gk F x x x k=

xk 1+ xk kpk+=

pk gk– kpk 1–+=

k

g k 1–T gk

gk 1–T

pk 1–

-----------------------------= k

gkTgk

gk 1–Tg k 1–

-------------------------= k

gk 1–T gk

gk 1–Tgk 1–

-------------------------=

1. The first search direction is steepest descent.

2. Take a step and choose the learning rate to minimize thefunction along the search direction.

3. Select the next search direction according to:

where

or or

12

14

Interval Location

12

15

Interval Reduction

12

16

Golden Section Search

=0.618Set c1 = a1 + (1-)(b1-a1), Fc=F(c1)

d1 = b1 - (1-)(b1-a1), Fd=F(d1)

For k=1,2, ... repeatIf Fc < Fd then

Set ak+1 = ak ; bk+1 = dk ; dk+1 = ck

c k+1 = a k+1 + (1-)(b k+1 -a k+1 )

Fd= Fc; Fc=F(c k+1 )

elseSet ak+1 = ck ; bk+1 = bk ; ck+1 = dk

d k+1 = b k+1 - (1-)(b k+1 -a k+1 )

Fc= Fd; Fd=F(d k+1 )

endend until bk+1 - ak+1 < tol

12

17

Conjugate Gradient BP (CGBP)

-5 0 5 10 15-5

0

5

10

15

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

w11,1

w21,1

Intermediate Steps Complete Trajectory

12

18

Newton’s Method

xk 1+ xk Ak1– gk–=

Ak F x 2x xk=

gk F x x xk=

If the performance index is a sum of squares function:

F x v i2 x

i 1=

N

vT x v x = =

then the jth element of the gradient is

F x jF x

x j--------------- 2 vi x

vi x x j

---------------

i 1=

N

= =

12

19

Matrix Form

F x 2JTx v x =

The gradient can be written in matrix form:

where J is the Jacobian matrix:

J x

v1 x x1

----------------v1 x

x2----------------

v1 x xn

----------------

v2 x x1

----------------v2 x

x2----------------

v2 x xn

----------------

vN x

x1-----------------

vN x x2

----------------- vN x

xn-----------------

=

12

20

Hessian

F x 2 k j2 F x

xk x j------------------ 2

vi x x k

---------------vi x

x j--------------- vi x

2v i x xk x j

------------------+

i 1=

N

= =

F x 2 2JT x J x 2S x +=

S x vi x v i x 2

i 1=

N

=

12

21

Gauss-Newton Method

F x 2 2JTx J x

xk 1+ xk 2JT xk J xk 1–2JT xk v xk –=

xk JT xk J xk 1–JT xk v xk –=

Approximate the Hessian matrix as:

Newton’s method becomes:

12

22

Levenberg-Marquardt

H JTJ=

G H I+=

1 2 n z1 z2 zn

Gz i H I+ zi Hzi zi+ izi zi+ i + z i= = = =

Gauss-Newton approximates the Hessian by:

This matrix may be singular, but can be made invertible as follows:

If the eigenvalues and eigenvectors of H are:

then Eigenvalues of G

xk 1+ x k JT x k J x k kI+ 1–JT xk v xk –=

12

23

Adjustment of k

As k0, LM becomes Gauss-Newton.

x k 1+ xk JT xk J xk 1–JT x k v xk –=

As k, LM becomes Steepest Descent with small learning rate.

x k 1+ xk1k-----JT xk v xk – x k

12k--------- F x –=

Therefore, begin with a small k to use Gauss-Newton and speed

convergence. If a step does not yield a smaller F(x), then repeat the step with an increased k until F(x) is decreased. F(x) must

decrease eventually, since we will be taking a very small step in the steepest descent direction.

F x 2JTx v x =

12

24

Application to Multilayer Network

F x tq aq– Ttq aq–

q 1=

Q

eqTeq

q 1=

Q

e j q 2

j 1=

SM

q 1=

Q

vi 2

i 1=

N

= = = =

The performance index for the multilayer network is:

The error vector is:

The parameter vector is:

vT

v1 v2 vNe 1 1 e2 1 e

SM

1e1 2 e

SM

Q= =

xTx 1 x2 x n w1 1

1w1 2

1 wS

1R

1b1

1 bS

11

w1 12 b

SM

M= =

N Q SM=

The dimensions of the two vectors are:

n S1

R 1+ S2

S1

1+ SM

SM 1–

1+ + + +=

12

25

Jacobian Matrix

J x

e1 1

w1 11

--------------e1 1

w1 21

-------------- e1 1

wS

1R

1----------------

e1 1

b11

------------

e2 1

w1 11

--------------

e2 1

w1 21

--------------

e2 1

wS

1R

1----------------

e2 1

b11

------------

eS

M1

w1 11

---------------e

SM

1

w1 21

--------------- ee

SM

1

wS

1R

1----------------

eeS

M1

b11

----------------

e1 2

w1 11

--------------

e1 2

w1 21

--------------

e1 2

wS

1R

1----------------

e1 2

b11

------------

=

12

26

Computing the Jacobian

F̂ x x l

---------------eq

Teq

x l-----------------=

SDBP computes terms like:

J h lvhxl

--------e k q

xl------------= =

For the Jacobian we need to compute terms like:

F̂

w i jm

------------F̂

nim

---------

nim

wi jm

------------=

sim F̂

nim

---------

using the chain rule:

where the sensitivity

is computed using backpropagation.

12

27

Marquardt Sensitivity

If we define a Marquardt sensitivity:

s̃i hm vh

ni qm

------------ek q

ni qm

------------= h q 1– SMk+=

We can compute the Jacobian as follows:

J h lv h

x l--------

ek q

wi jm

------------ek q

ni qm

------------ni q

m

w i jm

------------ s̃i hm ni q

m

wi jm

------------ s̃i hm

a j qm 1–

= = = = =

weight

bias

J h lvhxl

--------ek q

bim

------------

e k q

ni qm

------------ni q

m

bim

------------ s̃i hm ni q

m

bim

------------ s̃i hm

= = = = =

12

28

Computing the Sensitivities

s̃i hM vh

ni qM

------------ek q

ni qM

------------tk q ak q

M–

ni qM

--------------------------------ak q

M

ni qM

------------–= = = =

S̃m

S̃1mS̃2

m S̃Q

m=

Backpropagation

Initialization

11 ~~ mq

Tmmq

mmq SWnFS

Mq

MMq nFS ~

ki

kinfs

Mqi

MMhi

0~ ,

,

12

29

LMBP

• Present all inputs to the network and compute the corresponding network outputs and the errors. Compute the sum of squared errors over all inputs.

• Compute the Jacobian matrix. Calculate the sensitivities with the backpropagation algorithm, after initializing. Augment the individual matrices into the Marquardt sensitivities. Compute the elements of the Jacobian matrix.

• Solve to obtain the change in the weights.

• Recompute the sum of squared errors with the new weights. If this new sum of squares is smaller than that computed in step 1, then divide k by , update the weights and go back to step

1. If the sum of squares is not reduced, then multiply k by

and go back to step 3.

12

30

Example LMBP Step

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

12

31

LMBP Trajectory

-5 0 5 10 15-5

0

5

10

15

w11,1

w21,1

12 1 Variations on Backpropagation. 12 2 Variations Heuristic Modifications –Momentum –Variable...

Documents

Transcript of 12 1 Variations on Backpropagation. 12 2 Variations Heuristic Modifications –Momentum –Variable...