Chapter 2 Introduction to Neural networktomczak/PDF/[Grbic]Neural...Chapter 2 Introduction to Neural...

Chapter 2

Introduction to Neuralnetwork

2.1 Introduction to Artificial Neural Net-

work

Artificial Neural Networks (ANN:s) are an attempt to model theinformation processing of neurons systems.We want to use the Artificial Intelligence (AI) concept for solvingproblems which are hard (or impossible) to solve otherwise.Some examples of problems can be:

2.1.1 Approximation

We want to approximate an unknown function

F : Rn → Rm

by only knowing the behavior of the function for some input-outputsequences.

Example: We are given a number of input-output pairs.

x y1 22 43 6

find the function y = f(x) : R→ R. (Could it be y = 2x ?)¤

28

Many signal processing problems can be transformed to the ap-proximation problem. We will deal with arbitrary number of inputand output for any nonlinear function.

2.1.2 Association

The ideas is to put a finite number of items in memory (e.g. theLatin characters) and by presenting distorted versions, we want theitem to be restored.

Example:

��

Input Output

¤

2.1.3 Pattern classification

A number of inputs should be classified into categories.

Example:

��90 C OK

o

��100 C Error

o

¤

2.1.4 Prediction

Given information up to present time predict the behavior in thefuture.

2.1.5 Automatic control (Reglerteknik)

We want to simulate the behavior of a process so that we can controlit to fit our purposes.

29

2.1.6 Beamforming

We want to create directional hearing by the use of multiple sensors.

The modelRegard the ANN as a mapping box. An input x gives an output y

��

��

��

��

��

��

The box can be feed forward (i.e. no recursions) or a recurrent(with recursion) network. The complexity of its interior can varydepending on the task.

The box have parameters (weights) which can be modified to suitedifferent tasks.

2.2 A Neuron model

Given an input signal x it create a single output value

��

��

��

��

��

��

��

�

where y = f(xTw), f : R → R, f ⊂ C1, any function. The vectorw = [w1 w2 · · · wn]T is called the weights of the neuron. Oftenw ∈ Rn.

30

2.3 A feedforward Network of Neurons

(3-layers)

��

��

��

��

��

[1]

[1]

[1]

��

��

��

��

��

[2]

[2]

[2]

��

��

��

[3]

[3]

[3]

�

�

��

��

��

��

Inputlayer

First hidden outputSecond hiddenlayer layerlayer

⇔ y = H

[W[3] ·G

[W[2] · F

[W[1] · x

]]]

where

F = [f1 f2 · · · fm]T ,G = [g1 g2 · · · gk]T ,H = [h1 h2 · · · hi]

T

and

W[1] = [w[1]ij ]m×n, W[2] = [w

[2]ij ]k×m, W[3] = [w

[3]ij ]i×k

2.4 A recurrent Network

��

��

��

��

��

��

��

��

��

Z−1-delay operator. ”The Hopfield Network”

31

2.5 Learning

Depending on the task to solve different learning paradigms (strate-gies) are used.

Learning

Unsupervised Supervised

Reinforced learning Corrective learning

Supervised - during learning we tell the ANN what we want asoutput (desired output).

corrective learning - desired signals are realvalued

reinforced learning - desired signals are true/false

Unsupervised - Only input signals are available to the ANN duringlearning (e.g. signal separation)

One drawback with learning system are that it is totally lost whenfacing a scenario which it has never faced during training.

32

Chapter 3

Threshold logic

These units are neurons where the function is a shifted step functionand where the inputs are Boolean numbers (1/0). All weights areequal to ±1

��θ�

��

��

��

��

��

��

y =

{1 if xTw > θ0 if xTw < θ

��θ�

�

�

θ

Proposition All logical functions can be implemented with a net-work of these units, since they can implement, AND, OR and NOTfunction.

33

3.1 Hadamard-Walsh transform

For bipolar coding (-1/1) the Hadamard-Walsh transform convertsevery Boolean function of n variables into a sum of 2n products.Example:

X1 X2 X1 OR X2

−1 −1 −1 f1

−1 1 1 f2

1 −1 1 f3

1 1 1 f4

The hadamard-Walsh matrix Hn is defined recursively as

Hn =1

2

[Hn−1 Hn−1

−Hn−1 Hn−1

]

whereby

H1 =1

2

[1 1−1 1

]

each row/column in the matrix are orthogonal to the other rows/columns.We want to find the coefficients a which solve the equation

X1ORX2 =1

4(a1 + a2x + a3x2 + a4x1x2︸︷︷︸

no. 22

)

by using

a1

a2...

a2n

= Hn

f1

f2...

f2n

a1

a2

a3

a4

=

1

4

1 1 1 1−1 1 −1 1−1 −1 1 11 −1 −1 1

−1111

which gives

X1ORX2 =1

4(2 + 2x1 + 2x2 − 2x1x2)

¤

34

Chapter 4

The Perceptron

The perceptron is a weighted threshold logic unit.

��θ�

��

��

��

��

��

��

y =

{1 if xTw > θ0 if xTw < θ

x,w ∈ Rn

The perceptron can classify all input vectors belonging to linearlyseparable classes.

Example:

��

��

��

��

P and N are linear separable!¤

35

Example:

��

��

��

��

Not linear separable!¤

Example: Classification problem

��

��

��

��

��

��

��

��

��

¤

4.1 Perceptron training

Assume we have sampled the sensors when the process is OK (P)as well as when its broken (N). We have a number of input vectorsin each class.

Start: Choose w0 randomly, t = 0

test: Select a vector x ∈ P ∪N

If all x correct, stop!

36

if x ∈ P and xTwt > 0 goto testif x ∈ P and xTwt 6 0 goto addif x ∈ N and xTwt < 0 goto test

if x ∈ N and xTwt > 0 goto subtract

add: 1. wt+1 = wt + x or

wt+1 = wt − (xT wt−ε)

‖x‖22x , ε > 0

2. t = t + 1

subtract: 1. wt+1 = wt − x or

wt+1 = wt − (xT wt+ε)

‖x‖22x , ε > 0

2. t = t + 1

If the two sets N and P are not perfectly linear separable we wantto make a good approximation of the classification.

The pocket algorithm:

start: Initialize w0 randomlydefine the best weight vector ws = wdefine hs = 0

iterate: Update w as beforeCount the number h of consecutively correctly classifications.If h > hs, set ws = w and hs = h.Continue iterating.

37

4.2 The Linear Neuron (LMS-algorithm)

Is a neuron with a linear function

��

��

��

��

��

��

��w x

Assume we have a signal x(n) which we want to transform linearlyand match a signal d(n)

��

��

��

��

��

��

��w x

x

TDLL

�

We form the error signal e(n) = d(n)− y(n). We define the instan-taneous error function

J =1

2e(n)2

We want to minimize J . we use the steepest descent approach andcalculate ∇wJ

∇wJ = ∇w1

2(d(n)−wTx(n))2

Used the chain rule and the equation become

∇wJ = (d(n)−wTx(n)) · (−x(n)) = −e(n) · x(n)

where the −x(n) is the inner derivative. The steepest descent stateswk+1 = wk − α∇wJ

wk+1 = wk + αe(n) · x(n) (LMS-algorithm)

38

Chapter 5

Layered Networks ofperceptrons

We have seen that a single perceptron can separate 2 linearly sepa-rable classes. If we put several perceptrons in a multilayer networkwhat’s the maximum number of classification regions, i.e. what’sthe capacity of the network ?

Proposition Let R(m,n) denote the number of regions boundedby m hyper-planes (of dim. n − 1) in a n-dimensional space(allhyperplane are going through origin), then

R(m,n) = R(m− 1, n) + R(m− 1, n− 1)

where

R(1, n) = 2 , n > 1

R(m, 0) = 0 ,∀m > 1

Example:

(fig 6.20) R(m, n) is computed recursively.

M\N 0 1 2 3 4 5 61 0 2 2 2 2 2 22 0 2 4 4 4 4 43 0 2 6 8 8 8 84 0 2 8 14 16 16 165 0 2 10 22 30 32 326 0 2 12 32 52 62 64

39

¤A formula for R(m,n)

R(m,n) = 2n−1∑i=0

(m− 1

i

)

For a network with 3 hidden units with structure n−k− l−m thatis, n input nodes k first hidden nodes, l second hidden nodes andm output nodes we get

max # regions = min{R(k, n), R(l, k), R(m, l)}Example: Structure 3-4-2-3

Inputlayer

First hidden outputSecond hiddenlayer layerlayer

The maximum of regions in 3-dimensional input space classifiableby the network is

min{R(4, 3), R(2, 4), R(3, 2)} = min(14, 4, 6) = 4

That is the second hidden layer limits the performance ⇒ often thehidden layers contain more units than the input and output layers.

40

Chapter 6

The backpropagationalgorithm

Learning of a single neuron. Consider the following model of a neu-ron

��

��

��

��

��w x+θ

θ�

θ is a scalar parameter called the ”bias”. This bias gives the neu-ron the possibility of shifting the function f(·) to the left or right(positive or negative bias)

Example: f(x + θ) = sign(x + θ)

��

�

��

θ=��

��

��

θ=�

We can use the same terminology as before by defining the ex-tended input vector x = [x1 x2 · · · xn 1]T and the extended weightvector w = [w1 w2 · · · wn θ]T ⇒ y = f(wT x)

41

Assume that we have a sequence of input vectors and a correspond-ing sequence of target (desired) scalars

(x1, t1), (x2, t2), , · · · , (xN , tN)

We wish to find the weights of a neuron with a non-linear functionf(·) so that we can minimize the squared difference between theoutput yn and the target tn, i.e.

min En = min1

2(tn − yn)2 = min

1

2e2

n , n = 1, 2, · · · , NWe will use the steepest descent approach

w(n+1) = w(n) − α∇wE

We need to find ∇wE!

∇wEn = ∇w1

2(tn − f(wT xn︸︷︷︸

u

)

︸︷︷︸f(u)︸︷︷︸

h(f)

)2

︸︷︷︸g(h)

The chain-rule gives ∇wg(h(f(u))) = ∇hg∇fh∇uf∇wu

∇hg = en

∇fh = tn − f(u) = −1∇uf = depends on the nonlinearity f(·) we choose∇wu = xn

⇒ w(n+1) = w(n) + αen∇uf xn

since f(·) is a function f : R → R, i.e. one-dimensional. We canwrite ∇uf = df

du

⇒ The neuron learning rule for general function f(·) is

w(n+1) = w(n) + αendf

duxn

Where u = wT x and α is the stepsize.

OBS! If f(u) = u, that is a linear function with slope 1, the abovealgorithm will become the LMS alg. for a linear neuron.

(df

du= 1

)

42

6.1 The non-linearity of the neuron

Any differentiable function f(·) can be used in the neuron. We willuse 2 different functions, bi- and unipolar sigmoid function.

6.1.1 The bipolar sigmoid function

(the tangent hyperbolic)

f(u) = tanh(γu) =1− e−2γu

1 + e−2γu

γ is called the slope (γ ∈ R)

Example:

6 4 2 0 2 4 61.5

1

0.5

0

0.5

1

1.5

γ = 2

γ = 0.25

γ = 1

u

tanh(γu)

often γ = 1.

The derivative of f(u) is

df

du= γ(1− tanh(γu)2) = γ(1− f(u)2)

43

6.1.2 The unipolar sigmoid function

f(u) =1

1 + e−γu

Example:

6 4 2 0 2 4 6

0.2

0

0.2

0.4

0.6

0.8

1

1.2

γ = 2

γ = 0.25

γ = 1

u

f(u)

The derivative of f(u) is

df

du= γ

1

1 + e−γu

(1− 1

1 + e−γu

)= γf(u)(1− f(u))

this gives the update formulas for the neuron

1. bipolar

w(n+1) = w(n) − αγen(1− y2n)xn

yn = f(un) = tanh(γun) = tanh(γwTn xn)

2. unipolar

w(n+1) = w(n) − αγenyn(1− yn)xn

yn = f(un) = 11+eγun = 1

1+eγwTn xn

44

6.2 The backpropagation algorithm for

a 3-layer network

��

��

��

��

��

��

��

��

��

��

��

��

�

�

��[1]

��[1]

��[1]

�[1]

��[1]

�[1]

��[2]

��[2]

�[2]

�[2]

��[2]

[2]

��[3]

��[3]

��[3]

��[3]��

[1] ��[2]

�

�

�

��

��

��

��

��

��

��

• w[i]- is the extended weight matrix. The last column consistsof bias terms.

Assume we have a set of input vectors xn and a set of correspondingtarget vectors tn, n = 1, 2, · · · , N (a training set)

(x1, t1), (x2, t2), · · · , (xN , tN)

We want to minimize the squared difference between the output yn

of the network and the target vectors tn

i.e. minP∑

i=1

e2i

We first state the algorithm and then we prove it.

6.2.1 Backpropagation algorithm

Step 1. Initiate all the weight-matrices. This can be done by thefollowing rule. Pick values randomly in the interval (−0.5, 0.5) anddivide with fan-in, which is the number of units feeding the layer.

Example:

W[1] =(rand(m,n + 1)− 0.5)

n + 1

W[2] =(rand(k, m + 1)− 0.5)

m + 1

45

W[3] =(rand(p, k + 1)− 0.5)

k + 1

where ”rand” is uniformly distributed in [0, 1].

Step 2. Pick a input-target pair randomly from the training set.say (xi, ti). Calculate the output when xi is the input according to

yi = H

[W[3] ·G

[W[2] · F

[W[1]xi

]]]

where

F = [f1(u1)[1], f2(u2)

[1], · · · , fm(um)[1]]T

G = [g1(u1)[2], g2(u2)

[2], · · · , gk(uk)[2]]T

H = [h1(u1)[3], h2(u2)

[3], · · · , hp(up)[3]]T

All functions are chosen in advance.

Step 3.Find the weight corrections for each layer.

First define the vector ei = ti − yi.Define the local error vector δ[s], s = 1, 2, 3

δ[3] = diag(e)∂H

∂u[3]

4W[3] = αδ[3](o[2])T , α− stepsize

OBS! o[2] is the extended vector!

δ[2] = diag((

W[3])T

δ[3]) ∂G

∂u[2]

OBS! W[3] is without the biases!

4W[2] = αδ[2](o[1])T

δ[1] = diag((

W[2])T

δ[2]) ∂F

∂u[1]

4W[1] = αδ[1]xT

46

(The algorithm is called backpropagation since the local error ispropagating backwards from output layer in to the hidden layers)

Step 4.Update the weight matrices.

W[3] = W[3] +4W[3]

W[2] = W[2] +4W[2]

W[1] = W[1] +4W[1]

Return to step 2, stop when no changes in 4w[s]

¤

The partial derivative are

∂H

∂u[3]=

[dh1

du[3]1

dh2

du[3]2

. . .dhm

du[3]m

]T

If we use the bipolar sigmoid function

∂H

∂u[3]=

[(1− tanh(u

[3]1 )2) (1− tanh(u

[3]2 )2) . . . (1− tanh(u[3]

m )2)]T

and the same follows for ∂G∂u[2] and ∂F

∂u[1]

6.3 The proof of the Back propagation

We first define the error function for input-target pair i′, i.e. (xi′ , ti′)

Ei′ =1

2

p∑j=1

(tji′ − yji′)2 =

1

2

p∑j=1

e2ji′ =

1

2‖ei′‖2

2

that is we want to minimize the sum of the squared errors.For every weight w

[s]ji in layer s = 1, 2, 3 we will use the steepest

descent rule to update the weight

w[s]ji = w

[s]ji − α

∂Ei

∂w[s]ji

= w[s]ji +4w

[s]ji

We need to find the partial derivative∂Ei′∂w

[s]ji

for every weight in every

layer. First we start with the outputlayer (s = 3)

− ∂Ei′

∂w[3]ji

= − ∂Ei′

∂u[3]j

∂u[3]j

∂w[3]ji

47

where

u[3]j =

p∑i=1

w[3]ji o

[2]i

we define the local error δ[3]j as

δ[3]j = − ∂Ei′

∂u[3]j

= −∂Ei′

∂eji′

∂eji′

∂u[3]j

= eji′∂hj

∂u[3]j

And we get the update rule for all weights in the output layer(s = 3) as

w[3]ji = w

[3]ji + αeji′

∂hj

∂u[3]j

o[2]i = w

[3]ji + αδ

[3]j o

[2]i

Which is the same rule as for the single neuron where the input o[2]

is the output from previous layer.

We continue to derive the update for the second hidden layer (s = 2)As before

− ∂Ei′

∂w[2]ji

= − ∂Ei′

∂u[2]j

∂u[2]j

∂w[2]ji

= δ[2]j o

[1]i

The calculation of the local error δ[2]j for the second hidden layer

will be different

δ[2]j = − ∂Ei′

∂u[2]j

= −∂Ei′

∂o[2]j

∂o[2]j

∂u[2]j

since o[2]j = gj(u

[2]j ) where gj(·)- the function associated with neuron

j.We have

δ[2]j = −∂Ei′

∂o[2]j

∂gj

∂u[2]j

To calculate − ∂Ei′∂o

[2]j

we use the chain rule once again to express the

derivative in terms of the local error in layer 3.

−∂Ei′

∂o[2]j

= −p∑

i=1

∂Ei′

∂u[3]i

∂u[3]i

∂o[2]j

=

p∑i=1

(− ∂Ei′

∂u[3]i

)

︸︷︷︸=δ

[3]i

∂

∂o[2]j

p∑

k=1

(w

[3]ik o

[3]k

)

︸︷︷︸u[3]i

48

The sum comes from the fact that one output oj in layer 2 is con-nected to all ui in layer 3.

−∂Ei′

∂o[2]j

=

p∑i=1

δ[3]i w

[3]ij

which gives the update rule for the second hidden layer

w[2]ji = w

[2]ji + αδ

[2]j o

[1]i

where

δ[2]j =

∂gj

∂u[2]j

p∑i=1

δ[3]i w

[3]ij

In the same way the update for the first hidden layer can be derivedusing the local error of the following layer

w[1]ji = w

[1]ji + αδ

[1]j xi

where

δ[1]j =

∂fj

∂u[1]j

k∑i=1

δ[2]i w

[2]ij

Using matrix notation the algorithm is derived.¤

49

6.4 The rule of the hidden layer

Consider a two layer feedforward neural network, one hidden andone output layer, without non-linearities The output vector will

x y

ww[1] [2]

then be

y = W[2](W[1]x) ⇔ y = Wx

which means that the network con be reduced to a single layer net-work with weight matrix W = W[2]W[1].⇒ Every multilayer feedforward network with linear neurons canbe reduced to an equivalent single layer network.

Proposition A three layer feedforward neural network with sig-moidal functions can approximate any multivariate function, withfinite number of discontinuities, at any desired degree of accuracy.¤

Provided we have enough number of neurons.

6.5 Applications of back-propagation alg.

the solution to a set of linear equations

Ax = b (A can be any size (m× k))

⇒ xopt = A+b

(in the LS-sense if no solution exists)How can we use Neural networks to find the pseudo-inverse of A?We will use a single layer NN with linear neurons (no biases). Writethe matrix A as

AT = [a1 a2 · · · am]

50

x y

w��

and define the columns of a identity matrix

Im×m = [e1 e2 · · · em]

i.e.

e1 =

10...0

, e2 =

01...0

, . . . , em =

00...1

then the training set is

(a1, e1), (a2, e2), · · · , (am, em)

we use the LMS algorithm to adapt the weights W to the trainingset with one modification, that is

W(k+1) = (1− γ)Wk + αexT

γ ¿ 1- leaky factor, α - stepsize, xk=input=ak and e = error vector

(remember we have linear functions, the algorithm is the MIMO-LMS, Multiple-Input Multiple-Output)After convergence WT = A+ (Pseudoinverse of A) ¤

6.5.1 One more example of Backpropagation

An alternative way to find the pseudo inverse of a matrix A.We will use a linear two layer network. We create a random se-quence of input vectors and use the same vectors as targets.

⇒ Training sequence(x1,x1), (x2,x2), . . . , (xN ,xN)

We fix the first layer weights to A and we only update the outputlayer weights W.

51

��

��

��

��

��

��

A

�

�

�

W

Since we know that for any input vector xi we have

xi = WAxi

and that backpropagation minimizes the norm of

min ‖ xi −WAxi ‖⇒ W = A+

6.6 Application speech synthesis-NETtalk

People in the field of linguistics have been working with rule basedspeech synthesis systems for many years.Speech synthesis is away of converting text to speech and a feed-forward neural network, NETtalk, can be trained to solve this.Is uses a 7 character sliding window to read the text and it is trainedwith backpropagation to classify the phonemes of the text’s pronun-ciation (See figure 9.14).

u r a l _ N e

26 output units

80 hidden units

7 groups of

29 input sites

Phonemes

Another application is the reverse problem of listening to speechand then convert to text. They use Hidden Markov Models trainedwith Backpropagation.

6.7 Improvements to backpropagation

There are many ways to improve the convergence of backpropaga-tion.

52

Main problem is the steepest descent algorithm. We know thatsteepest descent is perpendicular to the contours of the error func-tion.

closest path

We also know that the more elliptic the contour is the slower theconvergence will be. The ratio of the ellipses diameters are con-trolled by the eigenvalues of the correlation matrix Rxx (x-input).If we decorrelate the input before the network we will get betterperformance.

2λ

1λ

λ− eigenvalues of Rxx

Decorrelation can be done by an unsupervised algorithm (The Gen-eralized Hebbian algorithm) by finding the Principal Components(PCA). We will se this later in this course.

6.8 Momentum update

In order to drop faster in the error surface we use the average ofprevious gradients when we update the weights

4w(k)ij = − ∂E

∂wij

+ β4w(k−1)ij︸︷︷︸

momentum term, β<1

53

The backpropagation algorithm can cause the weight update to fol-low a fractal pattern (fractal means it follow some rules which areapplied statistical), (see figure 8.11)⇒ The stepsize and momentum term should decrease during train-ing to avoid this.

6.9 Adaptive step algorithms

These algorithms modifies the stepsize during training in order tospeed convergence. The idea is to increase the stepsize,α , if thesign of the gradient is the same for two following updates and todecrease if the sign change.

Example: Error function

increase

start

increasedecrease

6.10 Silva and Almeidal’s algorithms

α(k+1)i =

{α

(k)i u if ∇iE

(k)∇iE(k−1) > 0

α(k)i d if ∇iE

(k)∇iE(k−1) < 0

u, d parameters (e.g. u = 1.1, d = 0.9) and index i stands for weightno. i¤

Other similar methods are Delta-bar-delta and and Rprop.

6.11 Second order algorithms

These methods use the Newton’s algorithm instead of the steepestdescent.

w(k+1) = w(k) −[∂2E

∂w2

]−1

∇wE

54

One can derive the updates for the network in each layer as beforebut the computational complexity will increase drastically in thetraining of the network.

Every neuron needs a matrix inversion!

There are good approximations to these algorithms, the so calledPseudo-Newton’s methods, e.g. Quickprop, QRprop etc.

6.12 Relaxation methods

These methods are used to avoid calculating the gradient. Theyuse a trial and error approach, change the weights in any way- if the error is dropping use the new weights otherwise try again.

6.13 Generalization

The learning process may be viewed as a ”curve-fitting” problem.The network itself may be considered simply as a non linear input-output mapping.

When training Neural Networks it is important to get a networkthat has good generalization. A network is said to ”generalize” wellwhen the input-output mapping computed by the network is correct(or nearly so) for test data never used in creating or training thenetwork.

Over-fitting is one of the biggest threats to a good generaliza-tion. Over-fitting occurs during training when the number of freeparameters, or weights, is too large. What happens is that the er-ror during training is driven to a very small value, but when newdata is presented to the network the error is large. This depend onthat the networks complexity is too big which results in that theentire training patterns is memorized and that it has big enoughdegree of freedom to fit the noise perfectly. Of course this meansbad generalization to new test data. The figure tries to visualizethis problem. One solution to this problem is to reduce the numberof free parameters. This is easiest done by reducing the number ofhidden neurons. Generalization is influenced by three factors:

• The size of the training set, and how representing it is for theenvironment of interest.

55

x

xx

x x x

x

o

x - training data o - generalization - non linear mapping

x

xx

x x x

x

o

Good fit but bad generalization Bad fit but much better generalization

• The architecture of the neural network

• The physical complexity of the problem of hand.

We may view the issue of generalization from two different per-spectives.

• The architecture of the network is fixed. We have to deter-mine the size of the training set needed for a good general-ization to occur.

• The size of the training set is fixed. We have to determinethe best architecture of the network for achieving good gen-eralization.

6.13.1 The size of the training set

We May in generally say that for good generalization, the numberof training examples P should be larger than the ratio of the totalnumber of free parameters (the number of weights) in the networkto the mean square value of the estimation error.

6.13.2 Number of hidden neurons

To choose the correct number of hidden neurons is an importantaspect when designing a Neural Network. Too few hidden neuronsgives us a net that has difficulty to adapt to the training data andtoo many neurons may result in a network with bad regularization.A theoretical and a more straightforward approach on finding thenumber of neurons will be discussed in the following subsections.

56

A theoretical approachHow many neurons should be used in relation to the number of

training examples? Information theory suggests that the numberof input vectors should be equal to the number of weights in thesystem. The number of weights, W, for a multi-layer network canbe calculated by the expression:

W = (I + 1)J + (J + 1)K (6.1)

where I is the number of inputs, J the number of hidden neuronsand K is the number of outputs. Another point of view is theone from Widrow and Lehr (1990). They argue that the numberof training examples P should be much greater than the networkcapacity, which is defined as the number of weights divided by thenumber of outputs (W/K). In their paper they also refer to Baumand Hesslers (1989) statement that the number of training examplesinto a multi-layer network could be calculated as follows:

P =W

ε(6.2)

where ε is an ”accuracy parameter”. An ”accuracy parameter”means the fraction of tests from the test set that are incorrectlyclassified. A good value for ε is 0, 1, which corresponds to an accu-racy level of 90%. This gives us the rule of thumb below.

Lower bound for number of input examples, P , is P = W . Anda realistic upper bound for P is P = 10W .

This is just a rule of thumb and there is actually no upper boundfor P .

ExampleSuppose we know the number of training examples P and we

would like to calculate the number of hidden neurons. The input-vector is of length 24, I = 24, and the number of training examplesis 4760, P = 4760. The number of outputs is one, K = 1. Thisgives us the following equations:

W = (24 + 1)J + (J + 1)1 ⇔ W = 26J + 1 (6.3)

Since the rule of thumb says that,

W ≤ P ≤ 10W ⇔ (26J + 1) ≤ 4760 ≤ (260J + 10) (6.4)

we get that the limitations for the hidden neurons, J .

18 ≤ J ≤ 183 (6.5)

57

6.13.3 Cross validation

A standard tool in statistics known as cross-validation provides anappealing guiding principle. First the avalible data set is randomlypartitioned into a training set and a test set. The training set isfurther partitioned into two disjoint subsets:

• Estimation subset, used to select the model.

• Validation subset, used to test or validate the model.

The motivation here is to validate the model on a data set dif-ferent from the one used for parameter estimation. In this way wemay use the training set to assess the performance of various candi-date models, and thereby choose the ”best” one. There is, however,a distinct possibility that the model with the best-performing pa-rameter values so selected may end up overfitting the validationsubset. To guard against this possibility, the generalization perfor-mance of the selected model is measured on the test set, which isdifferent from the validation subset.

Extremely low training error is not hard to achieve when train-ing Neural Networks. But a low training error does not automat-ically mean that the training result is good. Two methods thatcan improve generalization, regularization and early stopping, areshortly discussed in the following subsections.

Early StoppingThe training set is used to calculate the gradients and to up-

date the network weights and biases. And as the number of epochsincrease the mean square error usually decrees. It is possible forthe network to end up over-fitting the training data if the train-ing session is not stopped at the right point. The validation setis used to improve generalization. During training the validationerror is monitored. As long as the network shows no signs of over-fitting, the validation error is normally decreasing. The sign thatthe method is waiting for is when the error goes from decreasingto increasing, which means that the network starts to over-fit. Thetraining is stopped when the error has been increased for a specificnumber of steps and the weights and biases at the minimum vali-dation error are returned.

58

Validation

sample

Mean-

squared

error

Training

sample

Early

stopping

point

Number of epochs

RegularizationThere are a couple of regularization techniques to be used which

all are based on the same idea, to generate a smooth mapping thatdoes not fit the noise in the training data. This means that over-fitting should not be a problem, even when the network has lots ofhidden neurons.

The basic idea of regularization is that the squared error func-tion, or performance function, Emse is added with a penalty func-tion Ψ, to get a new error function Emsereg . Here we define thepenalty function as the square norm of the weight vector in thenetwork, but there are other definition as well. The error functionis usually given by:

Emse =1

N

N∑i=1

(ei)2 (6.6)

where ei is network error and N the number of errors. The penaltyfunction:

Ψ =1

n

n∑j=1

(wj)2 (6.7)

where w are the weights and j the number of weights. These twofunctions, together with a multiplicative constant µ can be used toget a regularized error function:

Emsereg = µEmse + (1− µ)Ψ (6.8)

Using this error function, weights and biases will be forced tosmall values which smoothes the network response and makes itless likely to over-fit. To use a regularization method is especiallyimportant when the amount of training data is limited.

59

Chapter 2 Introduction to Neural networktomczak/PDF/[Grbic]Neural...Chapter 2 Introduction to Neural...

Documents

Transcript of Chapter 2 Introduction to Neural networktomczak/PDF/[Grbic]Neural...Chapter 2 Introduction to Neural...