12 1 Variations on Backpropagation. 12 2 Variations Heuristic Modifications –Momentum –Variable...
-
date post
21-Dec-2015 -
Category
Documents
-
view
227 -
download
2
Transcript of 12 1 Variations on Backpropagation. 12 2 Variations Heuristic Modifications –Momentum –Variable...
12
2
Variations
• Heuristic Modifications– Momentum
– Variable Learning Rate
• Standard Numerical Optimization– Conjugate Gradient
– Newton’s Method (Levenberg-Marquardt)
12
3
Performance Surface Example
Network Architecture
w 1 11
10= w 2 11
10= b11
5–= b21
5=
w 1 12
1= w 1 22
1= b2
1–=
-2 -1 0 1 20
0.25
0.5
0.75
1
Nominal Function
Parameter Values
12
4
Squared Error vs. w11,1 and w2
1,1
-5
0
5
10
15
-5
0
5
10
15
0
5
10
-5 0 5 10 15-5
0
5
10
15
w11,1w2
1,1
w11,1
w21,1
12
5
Squared Error vs. w11,1 and b1
1
w11,1
b11
-10
0
10
20
30 -30-20
-100
1020
0
0.5
1
1.5
2
2.5
b11w1
1,1-10 0 10 20 30-25
-15
-5
5
15
12
6
Squared Error vs. b11 and b1
2
-10
-5
0
5
10
-10
-5
0
5
10
0
0.7
1.4
-10 -5 0 5 10-10
-5
0
5
10
b11
b21
b21b1
1
12
9
Momentum
0 50 100 150 2000
0.5
1
1.5
2
0 50 100 150 2000
0.5
1
1.5
2
y k y k 1– 1 – w k +=
Filter0 1
Example
w k 12k16
--------- sin+=
0.9= 0.98=
12
10
Momentum Backpropagation
-5 0 5 10 15-5
0
5
10
15
Wm
k sma
m 1–
T–=
bm
k sm
–=
Wm
k Wm
k 1– 1 – sma
m 1–
T–=
bmk bm
k 1– 1 – sm–=
Steepest Descent Backpropagation(SDBP)
Momentum Backpropagation(MOBP)
w11,1
w21,1
0.8=
12
11
Variable Learning Rate (VLBP)
• If the squared error (over the entire training set) increases by more than some set percentage after a weight update, then the weight update is discarded, the learning rate is multiplied by some factor (1>>0), and the momentum coefficient is set to zero.
• If the squared error decreases after a weight update, then the weight update is accepted and the learning rate is multiplied by some factor >1. If has been previously set to zero, it is reset to its original value.
• If the squared error increases by less than , then the weight update is accepted, but the learning rate and the momentum coefficient are unchanged.
12
12
Example
-5 0 5 10 15-5
0
5
10
15
100 102 1040
0.5
1
1.5
Iteration Number100 102 1040
20
40
60
Iteration Number
w11,1
w21,1
1.05=
0.7=
4%=
Squared error Learning rate
12
13
Conjugate Gradient
p0 g0–= gk F x x x k=
xk 1+ xk kpk+=
pk gk– kpk 1–+=
k
g k 1–T gk
gk 1–T
pk 1–
-----------------------------= k
gkTgk
gk 1–Tg k 1–
-------------------------= k
gk 1–T gk
gk 1–Tgk 1–
-------------------------=
1. The first search direction is steepest descent.
2. Take a step and choose the learning rate to minimize thefunction along the search direction.
3. Select the next search direction according to:
where
or or
12
16
Golden Section Search
=0.618Set c1 = a1 + (1-)(b1-a1), Fc=F(c1)
d1 = b1 - (1-)(b1-a1), Fd=F(d1)
For k=1,2, ... repeatIf Fc < Fd then
Set ak+1 = ak ; bk+1 = dk ; dk+1 = ck
c k+1 = a k+1 + (1-)(b k+1 -a k+1 )
Fd= Fc; Fc=F(c k+1 )
elseSet ak+1 = ck ; bk+1 = bk ; ck+1 = dk
d k+1 = b k+1 - (1-)(b k+1 -a k+1 )
Fc= Fd; Fd=F(d k+1 )
endend until bk+1 - ak+1 < tol
12
17
Conjugate Gradient BP (CGBP)
-5 0 5 10 15-5
0
5
10
15
-5 0 5 10 15-5
0
5
10
15
w11,1
w21,1
w11,1
w21,1
Intermediate Steps Complete Trajectory
12
18
Newton’s Method
xk 1+ xk Ak1– gk–=
Ak F x 2x xk=
gk F x x xk=
If the performance index is a sum of squares function:
F x v i2 x
i 1=
N
vT x v x = =
then the jth element of the gradient is
F x jF x
x j--------------- 2 vi x
vi x x j
---------------
i 1=
N
= =
12
19
Matrix Form
F x 2JTx v x =
The gradient can be written in matrix form:
where J is the Jacobian matrix:
J x
v1 x x1
----------------v1 x
x2----------------
v1 x xn
----------------
v2 x x1
----------------v2 x
x2----------------
v2 x xn
----------------
vN x
x1-----------------
vN x x2
----------------- vN x
xn-----------------
=
12
20
Hessian
F x 2 k j2 F x
xk x j------------------ 2
vi x x k
---------------vi x
x j--------------- vi x
2v i x xk x j
------------------+
i 1=
N
= =
F x 2 2JT x J x 2S x +=
S x vi x v i x 2
i 1=
N
=
12
21
Gauss-Newton Method
F x 2 2JTx J x
xk 1+ xk 2JT xk J xk 1–2JT xk v xk –=
xk JT xk J xk 1–JT xk v xk –=
Approximate the Hessian matrix as:
Newton’s method becomes:
12
22
Levenberg-Marquardt
H JTJ=
G H I+=
1 2 n z1 z2 zn
Gz i H I+ zi Hzi zi+ izi zi+ i + z i= = = =
Gauss-Newton approximates the Hessian by:
This matrix may be singular, but can be made invertible as follows:
If the eigenvalues and eigenvectors of H are:
then Eigenvalues of G
xk 1+ x k JT x k J x k kI+ 1–JT xk v xk –=
12
23
Adjustment of k
As k0, LM becomes Gauss-Newton.
x k 1+ xk JT xk J xk 1–JT x k v xk –=
As k, LM becomes Steepest Descent with small learning rate.
x k 1+ xk1k-----JT xk v xk – x k
12k--------- F x –=
Therefore, begin with a small k to use Gauss-Newton and speed
convergence. If a step does not yield a smaller F(x), then repeat the step with an increased k until F(x) is decreased. F(x) must
decrease eventually, since we will be taking a very small step in the steepest descent direction.
F x 2JTx v x =
12
24
Application to Multilayer Network
F x tq aq– Ttq aq–
q 1=
Q
eqTeq
q 1=
Q
e j q 2
j 1=
SM
q 1=
Q
vi 2
i 1=
N
= = = =
The performance index for the multilayer network is:
The error vector is:
The parameter vector is:
vT
v1 v2 vNe 1 1 e2 1 e
SM
1e1 2 e
SM
Q= =
xTx 1 x2 x n w1 1
1w1 2
1 wS
1R
1b1
1 bS
11
w1 12 b
SM
M= =
N Q SM=
The dimensions of the two vectors are:
n S1
R 1+ S2
S1
1+ SM
SM 1–
1+ + + +=
12
25
Jacobian Matrix
J x
e1 1
w1 11
--------------e1 1
w1 21
-------------- e1 1
wS
1R
1----------------
e1 1
b11
------------
e2 1
w1 11
--------------
e2 1
w1 21
--------------
e2 1
wS
1R
1----------------
e2 1
b11
------------
eS
M1
w1 11
---------------e
SM
1
w1 21
--------------- ee
SM
1
wS
1R
1----------------
eeS
M1
b11
----------------
e1 2
w1 11
--------------
e1 2
w1 21
--------------
e1 2
wS
1R
1----------------
e1 2
b11
------------
=
12
26
Computing the Jacobian
F̂ x x l
---------------eq
Teq
x l-----------------=
SDBP computes terms like:
J h lvhxl
--------e k q
xl------------= =
For the Jacobian we need to compute terms like:
F̂
w i jm
------------F̂
nim
---------
nim
wi jm
------------=
sim F̂
nim
---------
using the chain rule:
where the sensitivity
is computed using backpropagation.
12
27
Marquardt Sensitivity
If we define a Marquardt sensitivity:
s̃i hm vh
ni qm
------------ek q
ni qm
------------= h q 1– SMk+=
We can compute the Jacobian as follows:
J h lv h
x l--------
ek q
wi jm
------------ek q
ni qm
------------ni q
m
w i jm
------------ s̃i hm ni q
m
wi jm
------------ s̃i hm
a j qm 1–
= = = = =
weight
bias
J h lvhxl
--------ek q
bim
------------
e k q
ni qm
------------ni q
m
bim
------------ s̃i hm ni q
m
bim
------------ s̃i hm
= = = = =
12
28
Computing the Sensitivities
s̃i hM vh
ni qM
------------ek q
ni qM
------------tk q ak q
M–
ni qM
--------------------------------ak q
M
ni qM
------------–= = = =
S̃m
S̃1mS̃2
m S̃Q
m=
Backpropagation
Initialization
11 ~~ mq
Tmmq
mmq SWnFS
Mq
MMq nFS ~
ki
kinfs
Mqi
MMhi
0~ ,
,
12
29
LMBP
• Present all inputs to the network and compute the corresponding network outputs and the errors. Compute the sum of squared errors over all inputs.
• Compute the Jacobian matrix. Calculate the sensitivities with the backpropagation algorithm, after initializing. Augment the individual matrices into the Marquardt sensitivities. Compute the elements of the Jacobian matrix.
• Solve to obtain the change in the weights.
• Recompute the sum of squared errors with the new weights. If this new sum of squares is smaller than that computed in step 1, then divide k by , update the weights and go back to step
1. If the sum of squares is not reduced, then multiply k by
and go back to step 3.