Neural network for Neural Networks

12/30/2015

1

1

Lecture 6

Neural Networks

2

Neural network for Classification

3

Neural Network for Classification

The fundamental objective for pattern recognition is

classification and it is the most typical form of neural

transformation. A pattern recognition system can be

treated as a two stage device: feature extraction and

classification. A feature is defined as a measurement

taken on the input pattern that is to be classified but

the classifier has to map the input features onto a

classification state. The classifier must to decide which

type of class category they match most closely.

4


Any given input pattern must belong to one of the

classes that are in consideration. So, the mapping

from the input to the required class must exists.

This mapping is a function that transforms the input

pattern into the correct output class, and we will

consider that our network has learnt to perform

correctly, if it can carry out this mapping.

5


6

We spoke about the multilayer perceptron limitations.

When the problem of linear separability was well

understood it was also found that the problem of

single-layer network can be overcome by adding more

layers. For example, the two layer network may be

formed by cascading two single-layer networks. These

can perform more general classification, separating

those points that are contained in convex open or

closed regions.


12/30/2015

2

7

A convex regions

A convex region – is one in which any two points in the

region can be joined by a straight line that does not

leave the region.

8

To understand the convexity limitation, consider a

simple two-layer network with two inputs going to two

neurons in the first layer, both feeding a single neuron

in the layer 2. The output neuron threshold is set to

0.75 and weights si are both set to 0.5


9

In this case the output of one (1) is required from

both layer 1 neurons to exceed the threshold and

to produce a one (1) on the output. Thus the

output neuron performs a logical AND function.

Each neuron in layer 1 subdivides the OXY plane ,

producing tan output of one (1) on one side of the

line.


10

The result of double subdivision when the output of

one (1) of the layer 2 neuron is only over V-shaped

region.


11

Similarly, three neurons used in the layer 1

further subdivide the plane, creating for

example a triangle-shaped region. By including

enough neurons in the layer 1, a convex region

of any desired shape can be formed.

The layer 2 of course is not limited to the AND

function, it can produce many other functions if

the weights and threshold are suitably chosen.


12

A three-layer network is still more general, its

limitation capability is limited only by the number

of neurons and weights. There are no convexity

constrains; the layer 3 neuron receives as input a

group of convex polygons, and the logical

combination of that need not to be convex.


12/30/2015

3

13

A concave decision region formed by intersection of

two convex regions. Logical operation A and not B


Layer 3Layer 2Layer 1

y

x2

x1

Triangle ATriangle b

Non-convex region A and not B

14


Any continuous function of n variables can becomputed using only linear summations andnonlinear but continuously increasing functions ofonly one variable. It effectively states that a threelayer perceptron with n(2n+I) nodes usingcontinuously increasing non-linearities can computeany continuous function of n variables. A three layerperceptron could then be used to create anycontinuos likelihood function required in a classifier.

Existence (Kolmogorov) Theorem

15

Conclusion:

To create any arbitrary complex shape (decision

region), we never need more that three layers in the

network.

It gives the limitation on layers but does not define:

• how many elements is necessary to create

a network (in general and in particular layers),

• how these elements should be connected,

• which weights value should be.


16

Inconsistency in nomenclature

What is layer???

• some authors refer to the number of layers of

variable weights

• some authors describe the number of layers of

nodes

Usually, the nodes in the first layer, the input layer,

merely distribute the inputs to subsequent layers, and

do not perform and operations (summation or

thresholding) n.b. some authors miss out these nodes.

Network structure

17

What is a network layer?

A layer - it is the part of network structure which

contains active elements performing some operation.

A multilayer network receives a number of inputs.

These are distributed by a layer of input nodes that do

not perform any operation – these inputs are then

passed along the first layer of adaptive weights to a

layer of perceptron-like units, which do sum and

threshold their inputs. This layer is able to produce

classification lines in pattern space.

Network structure

18

The output from this layer in then passed to another

layer, and the output of this layer forms convex

regions in pattern space. A further layer is able to

define any arbitrary shape in pattern space.


12/30/2015

4

19

Neural networks and their corresponding

decision regions


20


21

A multilayer perceptron is fault-tolerant, since it is

distributed parallel processing element, with each

node contributing to the final output.

If a node or its weights are lost or damaged. recall is

impaired in quality, but the distributed nature of the

information means that the damage has to be

extensive before the network’s response degrades

badly. The network demonstrate graceful degradation

rather that catastrophic failure.


22

Backpropagation Algorithm

23

Backpropagation

For many years there was no theoretical sound

algorithm for training multilayer artificial neural

networks. Since single-layer networks proved severely

limited in what they could represent, the entire field

went into virtual eclipse.

In 1986 Rumelhart and McClelland suggested a new

learning rule known as a backpropagation rule which

is today used in many practical applications like for

example in solving the optimization problems.

24

Backpropagation

The invention of the backpropagation algorithm has

played a large part in the resurgence of interest in

artificial neural networks.

Backpropagation is a systematic method for training

multilayer artificial neural networks.

12/30/2015

5

25

Backpropagation

It is the rule how to change the weights Tij between

network elements.

The algorithm is based on the idea to minimize the

square root of errors by use of the gradient descent

method.

26

Backpropagation

Assumptions:

• the net is the regular, multilayer structure

• the first layer – input layer

• the last layer – output layer

• layers between – hidden layers

• feed forward propagation only

27

Backpropagation

2

2

N

input layer

outputlayerM

1

1

1

2

3 K

21 3 R

hidden layers

net input

net output

28

Backpropagation

To define a state of j-th neuron in layer n we calculate

the weighted sum of its M inputs

∑=

−=M

1i

1ni

nij,

nj UTE

wherenjE

nij,T

1niU −

weighted input sum of j-th neuron in layer n

connection weight between i-th neuron inlayer n-1

and j-th neuron in layer n

out put signal of i-th neuron in layer n-1

(1)

29

Backpropagation

1

j

M layer n-1

layer n

1−nMU1

1−nU

nM,jTn

,jT 1

njU

∑=

−=M

i

ni

ni,j

nj UTE

1

1

30

Backpropagation

output signal of j-th neuron in layer n is defined by

where f is the neuron transfer function

)f nj

nj (EU =

])/(E[(EU n

jnj

nj

nj

01

1

ΘΘ−−+==

exp)f

f – sigmoid function, the output values∈ (0;1),

njΘ 0Θ- the threshold, - slope

(2)

(3)

12/30/2015

6

31

Backpropagation

The global error D is a differentiable function of

weights

∑=

−⋅=M

1j

outj

*j UU

2

1 D 2)(

where desired output (a target) of j-th

neuron in the output layer

actual output of j-th neuron in the

output layer

*jU

outjU

(4)

32

Backpropagation

Aim: minimization of a global error D, modifying the

network’s weights.

Goal: define the rules of the weights adaptation to

minimize the global error

Solution: the gradient descent method

nij,

nij, T

DT

∂∂−=∆ η

where is the learning coefficient η

(5)

33

Backpropagation

Each weight is changed according to the

value and direction of negative gradient on

the hyperplane D(T).

The partial derivative in (5) can be

calculated by use of the chain rule.

34

Backpropagation

We define the change in error as a function

of the change in the net inputs to a j-th

neuron as

nij,

nj

nj

nij, T

E

ED

TD

∂∂

⋅∂∂=

∂∂

(6)

nj

nj E

Dd

∂∂−=

35

Backpropagation

(7)

=≠

=∂

∂i kfor 1

ikfor 0

T

Tnij,

nkj,

because

-1ni

M

1k

-1nkn

ij,

nkj,

M

1k

-1nk

nkj,n

ij,nij,

nj UU

T

TUT

TT

E∑∑

==

=∂∂

=∂

∂−=∂∂

36

Backpropagation

Finally, we get-1n

inj

nij, UdT η=∆ (8)

For the elements in the output

layer-1out

ioutj

outij, UdT η=∆ (9)

Decrease of D means the changes

proportional to -1outi

outj Ud

12/30/2015

7

37

Backpropagation

1

1

N

the last hidden layer

output layer

1outNU −1out

1U −

outN,MT

out1,1T

outMU

∑=

=N

1i

out1

outi,1

out1 UTE

M

out1,MTout

N,1T

out1U

38

Backpropagation

Now, we need to know what is for

each of the units

outjd

outj

outj

outj

outj

outj

E

U

U

D

E

Dd

∂

∂

∂∂=

∂∂−= (10)

where

)('f outjout

j

outj E

E

U=

∂

∂

39

Backpropagation

differentiate D with respect to giving

)( outj

*jout

j

UUU

D −−=∂

∂(11)

thus

outjU

)('f)( outj

outj

*j

outj EUUd ⋅−= (12)

40

Backpropagation

This is useful for the output units to modify

weights between the neurons of the

output layer and the neurons of the last

hidden layer (since the target and output

both are available)

= + τ

where τ is a learning coefficient

(t)Toutij, 1)-(tTout

ij,outij,T∆

outij,T

(13)

41

Backpropagation

The formula (9) can be rewritten to avoid

oscillations

= +

where is the weighted input sum of i-th

element from the last hidden layer, α (so called

smoothing parameter), is a constant value

defining the effect of the previous weights

modification on the actual modification .

(t)Toutij,∆ 1)-(tTout

ij,∆α ))( 1-outi

outj (Ed1 fα− (14)

-1wyjE

42

Backpropagation

This method is very useful for the output

layer elements because we have access both

to the output signal, target signal.

However, for the elements located in the

hidden layers unfortunately does not work.

12/30/2015

8

43

Backpropagation

If j-th neuron does not belong to the output

layer but is an element of a hidden layer s, then

its local weight modification is calculated from

Becausesj

sj E

Dd

∂∂−=

nj

njn

j

nj

nj

nj U

D)E('

E

U

UD

ED

∂∂=

∂∂

⋅∂∂=

∂∂

f

(15)

(16)

44

Backpropagation

also

then

=

∂∂

∂∂=

∂∂

∂∂=

∂∂

∑ ∑∑= =

++

=

+

+

N

k

MN

k 11 1i

ni

1nik,n

j1n

knj

1nk

1nk

nj

UTUE

DUE

ED

UD

(17)1n

jk,1n

k1n

jk,1nk

TdTED +

=

++

=+ ∑∑ =

∂∂=

N

k

N

k 11

∑=

++=∂∂−=

N

k

f1

1njk,

1nj

njn

j

nj Td)E('

ED

d (18)

45

Backpropagation

it modify the weight between i-th element

of the layer n-1 and j-th element of the layer n

= + τ (19)

(20)

nij,T

a

=∆ (t)Tnij, +∆ 1)-(tTn

ij,α )(Ed)( -1ni

nj fα−1

Iterative calculations correct weight „backwards”,

to the input layer.

(t)Tnij, 1)-(tTn

ij,nij,T∆

46

Backpropagation

This learning procedure is repeated for

each learning pattern, until the network

will generate the correct answer for

every input signal (pattern) – of course

with the defined accuracy.

47

Backpropagation

Example:

IfE)(

(E)kexp

f−+

=1

1

f(E) ∈∈∈∈ (0;1), k>0

when k ⇒⇒⇒⇒ ∞

then f(E) ⇒⇒⇒⇒ step function

)E()(EU

outj

outj

outj

kexp1

1f

−+==

48

Backpropagation

calculating the derivative f ’(E) for j-th

element:

)U(1U

(E)][1(E))e(1

e(E)'

outj

outj

2kE

kE

−=

=−⋅=+

= −

−

k

fkfk

f

(this derivative simplify he calculations)

12/30/2015

9

49

Backpropagation

for the elements in the output layer

for the elements in the hidden layers

)U(U)U(1U

)(E)U(Ud

outj

*j

outj

outj

outj

outj

*j

outj

−⋅−⋅⋅=

=⋅−=

k

'f

∑=

++⋅−⋅⋅=N

1k

1nkj,

1nk

nj

nj

nj Td)U(1Ud k

50

Recurrent MultiLayerPerceptron

(RMLP)

51

RMLP

Recurrent neural network architectures consists of a standard Multi-Layer Perceptron (MLP) plus added loops.

52

RMLP

Assumption: only one input nod (input signal x(t)) and one output nod

(output signal y(t)), and one hidden layer.

The network input signal is composed from input x, delayed inputs and

external recurrence output signal y(t) with some unit time delays.

Delay units

Hidden layer

Delay units

input element

output element

x(t)

y(t)

53

RMLP

Structure of the RMLP network

54

RMLP is a dynamic network with delayed input and output signals.

The farther analysis is performed for only one input node (signal x(t)),

one output node (signal y(t)) and one hidden layer.

The function

y(t+1)=f(x(t),x(t-1), ..., x(t-(N-1)),y(t-1),y(t-2),...,y(t-P))

where

N -1 � number of delayed input signals,

P � number of delayed output signals,

K � number of neurons in the hidden layer,

RMLP

12/30/2015

10

55

Vector input signal x applied to the network input in the time t

x(t)=[1,x(t),x(t-1), ..., x(t-(N-1)),y(t-P),y(t-P+1),...,y(t-1)]

Denote

∑+

=

=PN

jjiji xwu

0

)1( weighted input signal of the i-th neuron of

the hidden layer

∑=

=K

iji ufwg

0

)2( )( weighted input signal of the output neuron

)( ii ufv = input signal of the i-th neuron of the

hidden layer

)(gfy = output signal of the output neuron

(1-4)

RMLP

56

RMLP Learning Algorythm

The gradient learning algorithm is used. The gradient of ab

error function with respect to each networks’ weight is

calculated. For RMLP network the error function can be

define by:

Differentiating with respect to each weight in the

second layer

RMLP

2)]()([2

1)( tdtytE −=

)2(αw

57

RMLP

+−=

=−=

=−=−=∂∂

∑

∑

=

=

K

i

i

K

i

i

dw

tdvwtv

tdg

tgdftdty

dw

tvwd

tdg

tgdftdty

dw

tdg

tdg

tgdftdty

dw

tdytdty

w

tE

0)2(

)2(

0)2(

)2(

)2()2()2(

)()(

)(

))(()]()([

))((

)(

))(()]()([

)(

)(

))(()]()([

)()]()([

)(

ααα

α

α

ααα

because

≠=

=αα

α

ifor 0

αifor 1)()2(

)2(

dw

wd

58

RMLP

)2(0

)1(

)2(1

)1(

0)2(

)1()2(

)1(

)(

))((

))(1(

)(

))((

)(

))(()(

α

α

αα

dw

jPtdyw

tdu

tudf

dw

NjPtdyw

tdu

tudf

dw

dxw

tdu

tudf

dw

tdv

P

jij

i

i

PN

Njij

i

i

PN

j

jij

i

ii

+−−=

=−+−−=

==

∑

∑

∑

=

+

+=

+

=

59

RMLP

+−−+=

=

∑ ∑= =

+

K

i

P

jNji

i

ii dw

jPtdyw

tdu

tudfwtv

tdg

tgdf

dw

tdy

0)2(

1

)1(,

)2(

)2(

)1(

)(

))(()(

)(

))((

)(

αα

α

next

this formula enable calculater the in the epoch t on

the base of its previus values

)2(

)(

αdw

tdy

60

RMLP

)2()2( )(

)]()([α

α ηdw

tdytdtyw −−=∆

Using the gradient descent method the weight change of

the output layer is defined by

Similarly can be calculated the weight change in the hidden

layer.

After calualtion of the derivative of a signal y(t) with respect of

the weight ijn the hidden layer

)1(,βαw

(5)

12/30/2015

11

61

RMLP

and the formula for the weight change in the hidden

layer

)1(,βαw

∑ ∑= =

+

+−−=K

i

P

jNji

i

ii

b dw

jPtdyw

tdu

tudfw

tdg

tgdf

dw

tdy

0)2(

1

)1(,

)2()1(,

)1(

)(

))((

)(

))(()(

αα

)1(,

)1(,

)()]()([

βαβα η

dw

tdytdtyw −−=∆ (6)

62

RMLP

Summary of a learning algorithm

1. Initialize the weight vectors in the hidden layer and

output layer

2. Evaluate the neuron state in epoch (t) with input x

(1-4)

3. Calculates the values of and for

every αβ

4. Updates the weights according to 5 i 6

5. Go to step 2 algorithm.

)2(

)(

αdw

tdy)1(

)(

αβdw

tdy

63

Elman network

64

Elman Network

An Elman network is an MLP with a single

hidden layer and in addition it contains

connections from the hidden layer’s neurons

to the context units. The context units store

the output values from the hidden neurons in

a time unit and these values are fed as

additional inputs to the hidden neurons in

the next time unit.

65

Elman Network – SRN (Simple Recurrent Network) is a

simplification of Multi Layer Perceptron

Elman Network

66

N- number of external

inputs

K – number of neurons in a

hidden layer

M – number of neurons in

an output layer

A structure of an Elman Network

Contex

layer

Elman Network

12/30/2015

12

67

Vector input signal:

the elements denoted j=N+1,..,N+K are from the

hidden layer from the previous epoch

)](),...,(),(),...,(,1[ )1()1(1

)1()1(1 txtxtxtx KNNN ++

)]1(),...,1(),(),...,(,1[ )1()1(1

)1()1(1 −− tytytxtx KN

Elman Network

68

Notation

))((2 tgfy ii = output signal of the i-th neuron of

the output layer

∑=

=K

jjiji tvwtg

0

)2( )()( weighted input signal of the i-th

neuron of the output layer

))((1 tufv ii = output signal of the i-th neuron of a

hidden layer

∑+

=

=KN

jjiji txwtu

0

)1( )()( weighted input signal of the i-th

neuron in a hidden layer

Elman Network

69

Elman learning algorithm

The network will be learn according to steepest descent

algorithm. Similarly to feedforward network a gradient of the

cost function will be calculated with respect to every

network`s weight. For the Elman network a cost function can

be defined by

Differentiating this function with respect to any output layer

weight we obtain

∑∑==

=−=M

ii

M

iii tetdtytE

1

2

1

2 )]([2

1)]()([

2

1)(

)2(αβw

Elman Network

70

∑∑

∑∑

∑

==

==

=

+−=

=−=

=−=∂∂=∇

K

jj

ijij

j

i

iM

iii

K

j

jij

i

iM

iii

i

i

iM

iii

tvdw

twdw

dw

tvd

tdg

tgdftdty

dw

tvwd

tdg

tgdftdty

dw

tdg

tdg

tgdftdty

w

tEtE

0)2(

)2()2(

)2(2

1

0)2(

)2(

2

1

)2(2

1)2(

)2(

)()(()((

)(

))(()]()([

))((

)(

))(()]()([

)(

)(

))(()]()([

)()(

αβαβ

αβ

αβαβαβ

Because connections between the hidden

layer and output layer are unidirectional 0))((

)2(=

αβdw

tvd j

Elman Network

71

∑∑==

−=∂∂=∇

K

jj

ij

i

iM

iii tv

dw

twd

tdg

tgdftdty

w

tEtE

0)2(

)2(

2

1)2(

)2( )()((

)(

))(()]()([

)()(

αβαβαβ

yields to

According to the method of steepest descent the weights’

change in output layer are defined by

)()()1( )2()2()2( tEtwtw αβαβαβ η∇−=+

(7)

(8)

Elman Network

72

Weights change in the hidden layer is more complicated

because of the feedbacks.

∑∑

∑∑

==

==

−=

=−=∂∂=∇

K

jij

j

i

iM

iii

K

j

jij

i

iM

iii

wdw

tvd

tdg

tgdftdty

dw

tvwd

tdg

tgdftdty

w

tEtE

0

)2()1(

2

1

0)1(

)2(

2

1)1(

)1(

)((

)(

))(()]()([

))((

)(

))(()]()([

)()(

αβ

αβαβαβ

∑+∑ ==+

=

+

=

KN

mjm

mj

j

jKN

m

jmm

j

jj wdw

tdxx

du

udf

dw

wtxd

du

udf

dw

tdv

0

)1()1(

1

0)1(

)1(1

)1(

)()())(()()(

αββα

αβαβ

δ

(9)

Elman Network

12/30/2015

13

73

and next

−+=

=

−+=

∑

∑

=+

+

+=

−

K

mNmj

mj

j

j

KN

Nmjm

Nmj

j

jj

wdw

tdvx

du

udf

wdw

tdvx

du

udf

dw

tdv

1

)1(,)1(

1

1

)1()1(

)(1

)1(

)1()(

)1()()(

αββα

αββα

αβ

δ

δ

The last formula (10) allows to calculate the derivatives of cost

function with respect to weights of hidden layer in moment t.

It is the recurrent formula defining derivative in a moment t

dependence of a derivative in a moment t-1.

(10)

Elman Network

74

Assuming the initial values in a moment t = 0

0)0(

...)0()0(

)2()2(2

)2(1 ====

αβαβαβ dw

dv

dw

dv

dw

dv K

and basing on the steepest descent method the weights’

change in the hidden layer is defined by

)( )()1( )1()1()1( tEtwtw αβαβαβ η∇−=+ (11)

Elman Network

75

Elman Network

76

Real Time RecurrentNetwork (RTRN)

77

RTRN Network

RTRN Network (Real Time Recurrent Network) is the

network for real-time signal processing

Network structureN- number of inputs

K – number of neurons in

hidden layer

M – number of output

neurons (from the K hidden

elements)

K element

context layer

78

∑+

=

=KN

jjiji txwtu

0

)1( )()( weighted input signal of i-th neuron

in hidden layer

))(( tufy ii = output signal from i-th neuron in

hidden layer

)]1(),...,1(),(),(),...,(,1[ 121 −− tytytxtxtx KN

Vector input signal x(t) and delayed one cycle vector

y(t-1) create the excitation network signal

(12)

(13)

(14)

RTRN Network

12/30/2015

14

79

This is the special case of the Elman network with constant

values of output weights ( ) and defined by

Modifying the Elman`s algorithms one obtain the new

learning algorithm for that network

)2(αw

≠=

==jifor 0

jifor 1)2(ijijw δ

RTRN Network

80

1. Initialization of weights (usually uniform distribution from

[-1;1]).

2. Calculation of naurons’ state in (t=0,1,...) (signals ui and yi

formulas (12) and (13) and next vector excitation signal

(14).

3. Calculation of according to

RTRN training algorithm

0)(

=αβdw

tdy j

−+= ∑

=+

K

tNtj

jj

j

jj wdw

tdy

du

udf

dw

tdy

1,

)1()()(

αβα

αβ

δ

RTRN Network

81

4. Upgrades of weights according to steepest descent

algorithm according to

for α=1,2,...,K, β=0,1,2,....,N+K

5. Go to step 2.

∑=

−−=+M

j

jjj dw

tdytdtytwtw

1

)()]()([)()1(

αβαβαβ η

RTRN Network

RTRN training algorithm

82

Radial Networks

83

Radial networks

For special purposes sometimes we are using the radial

neurons i.e. neurons with Radial Basis Function RBF. They

have non typical aggregation methods of input data, they

uses non typical transfer function (Gauss function) and they

learned by the special way.

Data aggregation consist on the calculation of a distance

between an input signal (vector X), and established in a

learning process centroid of a certain set T.

Network structure

84

Sigmoid and radial neuron

Sigmoidal neuron represents in the multidimensional space

hyperplane separating space into two categories (Fig.A). Radial

neuron represents hypersphere performing circular separation

around the central point (Fig.B).

Fig. A Fig. B

Network structure

12/30/2015

15

85

Radial basis function network (RBF) uses radial basis functions as

activation functions typically have three layers: an input layer, a

hidden layer with a non-linear RBF activation function and a

linear output layer.

Functions that depend only on the distance from a center vector

are radially symmetric about that vector, hence the name radial

basis function. In the basic form all inputs are connected to each

hidden neuron. The norm is typically taken to be the Euclidean

distance and the radial basis function is commonly taken to be

Gaussian.

Radial basis function network

86

The output of the network is then a

scalar function of the input vector, and

is given by

where N is the number of neurons in

the hidden layer, ci is the center vector

for neuron , and ai is the weight of

neuron i in the linear output neuron.

Functions that depend only on the

distance from a center vector are

radially symmetric about that vector.

The radial basis function is commonly

taken to be Gaussian.

Radial basis function network

Neural network for Neural Networks

Documents

Transcript of Neural network for Neural Networks