Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized...

Ch 5. Neural Networks (1/2)Ch 5. Neural Networks (1/2)

Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.

Summarized by

M.-O. Heo

Biointelligence Laboratory, Seoul National University

http://bi.snu.ac.kr/

2 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

ContentsContents

5.1 Feed-forward Network Functions 5.1.1 Weight-space symmetries

5.2 Network Training 5.2.1 Parameter optimization 5.2.2 Local quadratic approximation 5.2.3 Use of gradient information 5.2.4 Gradient descent optimization

5.3 Error Backpropagation 5.3.1 Evaluation of error-function derivatives 5.3.2 A simple example 5.3.3 Efficiency of backpropagation 5.3.4 The Jacobian matrix

5.4 The Hessian Matrix 5.4.1 Diagonal approximation 5.4.2 Outer product approximation 5.4.3 Inverse Hessian 5.4.4 Finite differences 5.4.5 Exact evaluation of the Hessian 5.4.6 Fast multiplication by the Hessian



Feed-forward Network FunctionsFeed-forward Network Functions

Goal: to extend linear model by making the basis functions depend on parameters, allow these parameters to be adjusted.

: diff. nonlinear activation function

( )j x

(1) (1)0

1

D

j ji i ji

a w x w

First layer

biasesweightsactivations

( )j jz h a

(2) (2)0

1

M

k kj j kj

a w z w

Forward propagation of information



Feed-forward Network Functions (Cont’d)Feed-forward Network Functions (Cont’d)

A key difference compared to perceptron and multilayer perceptron (MLP) MLP: continuous sigmoidal nonlinearities in the hidden units Perceptron: step-function nonlinearities

Skip-layer connections A network with sigmoidal hidden units can always mimic skip l

ayer connections by using a sufficiently small first-layer weight that the hidden unit is effectively linear, and then compensating with a large weight value from the hidden unit to the output.



Feed-forward Network Functions (Cont’d)Feed-forward Network Functions (Cont’d)

The approximation properties of feed-forward networks Universal approximator: a two-layer network with linear outputs

can uniformly approximate any continuous function on a compact input domain to arbitrary accuracy provided the network has a sufficiently large number of hidden units.

Ex) N = 50, 2-layer network having 3 hidden units with tanh activation functions and linear output.

2( )f x x ( ) sin( )f x x ( ) | |f x x ( ) ( )f x H x



Feed-forward Network Functions Feed-forward Network Functions - - Weight-space symmetriesWeight-space symmetries

Sign-flip symmetry If we change the sign of all of the weights and the bias feeding i

nto a particular hidden unit, then, for a given input pattern, the sign of the activation of the hidden unit will be reversed.

Compensated by changing the sign of all of the weights of leading out of that hidden units.

For M hidden units, by tanh(-a) = - tanh(a), there will be M sign-flip symmetries.

Any given weight vector will be one of a set 2M equivalent weight vectors.



Feed-forward Network Functions Feed-forward Network Functions - - Weight-space symmetries (Cont’d)Weight-space symmetries (Cont’d)

Interchange symmetry We can interchange the values of all of the weights (and the

bias) leading both into and out of a particular hidden unit with the corresponding values of the weights (and bias) associated with a different hidden unit.

This clearly leaves the network input-output mapping function unchanged.

For M hidden units, any given weight vector will belong to a set of M! equivalent weight vectors



Network TrainingNetwork Training

t has a Gaussian distribution with an x-dependent mean

(likelihood function)

(negative log)

Maximizing the likelihood function is equivalent to minimizing the sum-of-squares error function

1( | , ) ( | ( , ), )p t x w N t y x w

212

1

( ) { ( , ) }N

n nn

E y t

w x w

1

( | , , ) ( | , , )N

n nn

p p t x w

t x w

2

1

{ ( , ) } ln ln(2 )2 2 2

N

n nn

N Ny X w t



Network Training (Cont’d)Network Training (Cont’d)

The choice of output unit activation function and matching error function Standard regression problems:

Error function: Negative log-likelihood function Output: identity

For multiple binary classification problems: Error function: cross-entropy error function

Output: logistic sigmoid

For multiclass problems: Multiclass cross-entropy error function a softmax activation function

( )k ky a 11 exp( )( ) aa

k ky a



Network Training Network Training - Parameter optimization- Parameter optimization

Error E(w) is a smooth continuous function of w and smallest value will occur at a point in weight space

Global minimum Local minima

Most techniques involve choosing some initial valuefor weight vector and then moving through weight space in a succession of steps of the form Many algorithms make use of gradient info.

( ) 0E w

(0)w

( 1) ( ) ( )w w w



Network Training Network Training - Local quadratic approximation- Local quadratic approximation Taylor expansion of E(w) around some point

Local approximation to the gradient

The particular case of a local quadratic approximationwhen at w*

w1

ˆ ˆ ˆ ˆ( ) ( ) ( ) ( ) ( )2

T TE w E w w w w w w w b H

ˆ|w wE b

ˆ( ) |ij w wi j

E

w w

H

(gradient)

(Hessian Matrix)

ˆ( )E w w b H

0E * * *1

( ) ( ) ( ) ( )2

TE w E w w w w w H

* 21( ) ( )

2 i ii

E w E w



Network Training Network Training - Use of gradient information- Use of gradient information

In the quadratic approximation, computational cost to find minimum is O(W3) W is the dimensionality of w perform O(W2) evaluations, each of which would require O(W)

steps.

In an algorithm that makes use of the gradient information, computational cost is O(W2) By using error backpropagation, O(W) gradient evaluations and

each such evaluation takes only O(W) steps.



Network Training Network Training - Gradient descent optimization- Gradient descent optimization

Weight update to comprise a small step in the direction of the negative gradient

Batch method Techniques that use the whole data set at once. The error function always decreases at each iteration unless the weight

vector has arrived at a local or global minimum. On-line version of gradient descent

Sequential gradient descent or stochastic gradient descent Update to the weight vector based on one data point at a time. Can handle redundancy in the data much more efficiently. The possibility of escaping from local minima.

( 1) ( ) ( )( )w w E w



Error BackpropagationError Backpropagation- Evaluation of error-function derivatives- Evaluation of error-function derivatives

Error Backpropagation Apply an input vector Xn to the network and forward propagate

through the network using equations below to find the activations of all the hidden and output units.

Evaluate the for all the output units using Backpropagate the ’s using

to obtain for each hidden unit in the network Use to evaluate the required derivatives.

j ji ii

a w z ( )j jz h a

k k ky t k

'( )j j ji kk

h a w j

nj i

ji

Ez

w



Error Backpropagation Error Backpropagation - A simple example- A simple example

( ) tanh( )h a a tanh( )a a

a a

e ea

e e

2'( ) 1 ( )h a h a

2

1

1( )

2

K

n k kk

E y t

(yk: output unit k, tk: the corresponding target)

(1)

1

D

j ji ii

a w x

tanh( )j jz a

(2)

0

M

k kj jj

y w z

k k ky t

2

1

(1 )K

j j kj kk

z w

(1)n

j iji

Ex

w

(2)

nk j

kj

Ez

w



Error Backpropagation Error Backpropagation - Efficiency of backpropagation- Efficiency of backpropagation

Computational efficiency of backpropagation: O(W)

An alternative approach for computing derivatives of the error function: Using finite differences

The accuracy of the approximation to the derivatives can be improved by making ε smaller, until numerical roundoff limit.

Improved using symmetrical central differences

=> O(W2)

2( ) ( )( )

2n ji n jin

ji

E w E wEO

w

( ) ( )( )n ji n jin

ji

E w E wEO

w



Error Backpropagation Error Backpropagation - The Jacobian matrix- The Jacobian matrix

The technique of backpropagation can also be applied to the calculation of other derivatives.

The Jacobian matrix Elements are given by the derivatives of the network outputs w.r.

t. the inputs

Minimizing an error function E w.r.t. the parameter w

kki

i

yJ

x

,

jk

k j k j

zyE E

w y z w



Error Backpropagation Error Backpropagation - The Jacobian matrix (Cont’d)- The Jacobian matrix (Cont’d)

A measure of the local sensitivity of the outputs to change in each of the input variables. In general, the network mapping is nonlinear, and so the elemen

ts will not be constants. This is valid provided the are small.

The evaluation of the Jacobian Matrix

kk i

i j

yy x

x

| |ix

jk k kki ji

j ji j i j

ay y yJ w

x a x a

'( )k k l kj lj

l lj l j l

y y a yh a w

a a a a

'( )k

kj jl

ya

a

(Sigmoidal activation function)



The Hessian MatrixThe Hessian Matrix

2

ji lk

E

w w

The Hessian matrix The second derivatives form the elements of H Important roles

Considering the second-order properties of the error surfaces. Forms the basis of a fast procedure for re-training a feed-forwar

d network following a small change in the training data The inverse of the Hessian is used to identify the least significa

nt weights in a network (pruning algorithm) In the Laplace approximation for a Bayesian NN

Various approximation schemes are used to evaluate the Hessian matrix for a NN.



The Hessian Matrix The Hessian Matrix - Diagonal approximation- Diagonal approximation

Inverse of the Hessian is useful, and inverse of diagonal matrix is trivial to evaluate.

Consider an error function Replaces the off-diagonal elements with zeros The diagonal elements of the Hessian:

2 22

2 2n n

iji j

E Ez

w a

2 2

2'2

' '

'( ) ''( )n n nj kj k j j kj

k k kj k k k

E E Eh a w w h a w

a a a a

2 2

2 22 2

'( ) ''( )n n nj kj j kj

k kj k k

E E Eh a w h a w

a a a

(neglect off-diagonal)



The Hessian Matrix The Hessian Matrix - Outer product approximation (Levenberg-Marquardt)- Outer product approximation (Levenberg-Marquardt)

Output y happen to be very close to the target values t, the second term will be small and can be neglected. Or the value of y – t is uncorrelated with the value of the second

derivative term, then the whole term will average to zero.

1 1

( )N N

n n n n nn n

E y y y t y

H

1

NT

n nnH b b

n n ny a b



The Hessian Matrix The Hessian Matrix - Inverse Hessian- Inverse Hessian

A procedure for approximating the inverse of the Hessian First, we write the outer-product approximation Derive a sequential procedure for building up the Hessian by inc

luding data points one at a time.

The initial matrix H0 = H + αI α is a small quantity, not sensitive to the precise value of α.

1 1 1T

L L L L H H b b1 1

1 1 1 11 1

1 11

TL L L L

L L TL L L

H b b H

H Hb H b



The Hessian Matrix The Hessian Matrix - Finite differences- Finite differences

By using a symmetrical central differences formulation

Require O(W3) operations to evaluate the complete Hessian By applying central differences to the first derivatives of the

error function. Costs: O(W2)

2

2

1{ ( , ) ( , )

4

( , ) ( , )} ( )

ji lk ji lkji lk

ji lk ji lk

EE w w E w w

w w

E w w E w w O

221

{ ( ) ( )} ( )2 lk lk

ji lk ji ji

E E Ew w O

w w w w



The Hessian Matrix The Hessian Matrix - Exact evaluation of the Hessian- Exact evaluation of the Hessian

The Hessian can also be evaluated exactly. Using extension of the technique of backpropagation used to ev

aluate first derivatives.n

kk

E

a

2

''

nkk

k k

EM

a a

2

' '(2) (2)' '

nj j kk

kj k j

Ez z M

w w

2

(2)' ' ' '(1) (1)

' '

(2) (2)' ' ' ' '

'

''( )

'( ) '( )

ni i j jj kj k

kji j i

i i j j k j kj kkk k

Ex x h a I w

w w

x x h a h a w w M

(Both weights in the second layer)

(Both weights in the first layer)

2(2)

' ' ' ' '(1) (2)''

'( ){ }ni j k jj j k j kk

kji kj

Ex h a I z w H

w w

(Both weights in each layer)



The Hessian Matrix The Hessian Matrix - Fast multiplication by the Hessian- Fast multiplication by the Hessian

Try to find an efficient approach to evaluating vTH directly. O(W) operations vTH = vT ( E) ( : the gradient operator in weight space)∇ ∇ ∇ Introducing new notation R{w} = v

to denote the operator vT ∇

The Implementation of this algorithm involves the introduction of additional variables for the hidden units and for the output units.

For each input pattern, the values of these quantities can be found.

{ }jR a { }jR z { }jR { }kR { }kR y

{ }j ji ii

R a v x { } '( ) { }j j jR z h a R a { } { }k kj j kj jj j

R y w R z v z { } { }k kR R y

{ } ''( ) { } '( ) '( ) { }j j j kj k j kj k j kj kk k k

R h a R a w h a v h a w R

Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized...

Documents

Transcript of Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized...