New INTRODUCTION TO NEURAL NETWORKS - unibo.it · 2016. 4. 28. · Artificial Neural Networks W ij...

Post on 03-Sep-2020

3 views 0 download

Transcript of New INTRODUCTION TO NEURAL NETWORKS - unibo.it · 2016. 4. 28. · Artificial Neural Networks W ij...

INTRODUCTION TO NEURAL NETWORKS

Complex computations: Mach’s Bands

Observe the transitions among the bands

Complex computations: Mach’s Bands

Da: R. Pierantoni, La trottola di Prometeo, Laterza (1996)

Complex computations: Mach’s Bands

Observe the transitions among the bands

Stimulus Percept

Inte

nsi

tyIn

ten

sity

Complex computations: Mach’s Bands

A simple model of the retina neuron

0

50

100

150

200

250

0 20 40 60 80 100

Incident Intensity (fotons/s)

Potential (mV)

Linear light-to-potential transducer

Light

Potential

Neuron transduction

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0

40

80

120

160

200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Fo

ton

s/s

mV

Adding lateral inhibition

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Fo

ton

i/s

Each neuron inhibits its neighbor by a 10% of its non inhibited potential

160

-0.1x160-0.1x160

128

Adding lateral inhibition

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Fo

ton

s/s

mV

0

40

80

120

160

200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

160 - 0.1 160-0.1 40=140

40 - 0.1 160-0.1 40=20

40 - 0.1 40-0.1 40=32

160 - 0.1 160-0.1 160=128

Each neuron inhibits its neighbor by a 10% of its non inhibited potential

Many identical computing units, each one performing verysimple operations, can perform very complex computationswhen they are widely and specifically connected.

The “knowledge” is stored in the topology and in the strength of the synapses

Complex computations: Mach’s Bands

Model Neuron: McCulloch and Pitts

A neuron is a computational unit that1) performs the weighted sum of the input signals,computing the activation signal (a)

2) transforms the activation signal following a tranferfunction g and computing then the output z

i

d

i

i xwa1

)(agz

w: synaptic weights: activation threshold

Transfer functions

0

0,5

1

-10 0 10

aeag

1

1)(

Usually, NON linear functions are adopted

Non linearity

0

0,5

1

-10 0 10

The same variation in the input can give very different variations on the transferred signal

Artificial Neural Networks

Wij Synaptic weitghs

Neuron i

-

i

d

i

i xwa1

)(agz

The threshold can beimplicitly considered by addingan extra-neuron, alwaysactivated and connected tothe current neuron withweight equal to -

Topology of artificial neural networks

The topology of the connections among the neurons definesthe network class. We will take into consideration only thefeed-forward architectures, where the neurons areorganized into hierarchical layers and the signal flows injust a direction.

Perceptrons2 layers: Input and Output

wij

ji

i

ijj xwgz

Neural networks and logical operators

2

1

3

ORw13 = 0.5 w23 = 0.5 3 = 0.25

a3 = 0.25

z3 = 1

a3 = 0.25

z3 = 1

a3 = 0.75

z3 = 1

a3 = -0.25

z3 = 0

2

1

3

ANDw13 = 0.5 w23 = 0.5 3 = 0.75

a3 = -0.25

z3 = 0

a3 = -0.25

z3 = 0

a3 = 0.25

z3 = 1

a3 = -0.75

z3 = 0

Neural networks and logical operators

2

1

3

NOT (1)w13 = -0.5 w23 = 0.1 3 = -0.25

a3 = -0.25

z3 = 0

a3 = 0.35

z3 = 1

a3 = -0.15

z3 = 0

a3 = 0.25

z3 = 1

Neural networks and logical operators

Supervised artificial neural networks

Feed-forward artificial neural networks can be trainedstarting from examples with known solution.

Error functionGiven a set of examples xi with known desired output, di andgiven a network with parameters w, the square error iscomputed starting from the output of the network z (j sumsover the output neurons )

2

,

),(2

1

ji

i

j

i

j dwxzE

The training procedure consists in finding the parameters wthat minimize the error: iterative minimization algorithmsare adopted. However they do NOT guarantee to reach theglobal minimum.

Training a perceptron

We consider a differentiable transfer function

aeag

1

1)(

)(1)(1

)('2

agage

eag

a

a

Given some initial parameters w:

ii

lj

i

j

i

j

i

j

i

jlj wxw

wxa

wxa

wxz

wxz

E

w

E

),(

),(

),(

),(

),(

i

j

i

ji

j

dwxzwxz

E

),(

),(

)('),(

),(ag

wxa

wxzi

j

i

j

i

lj

i

jlx

w

wxa

),(z2

z1

x2

x1

jj agz

j

id

i

ljj lxwa

1

2

,

),(2

1

ji

i

j

i

j dwxzE

deviation: d ij

Then

i

i

l

i

j

i

i

l

i

j

i

j

lj

xxagdwxzw

Ed)('),(

deviation: d ij

Using the gradient we can update the weights with the“steepest descent” procedure

lj

ljljw

Eww

is the learning rate:Too low: slow trainingToo high: the minima can be lost

Convergence:0

ljw

E

Training a perceptron

Steepest descent finds the minimum of a function by always pointing in the direction that leads downhill.

Steepest descent finds the LOCAL minimum of a function by always pointing in the direction that leads downhill.

f: RnR. If f(x) is of class C2, objective function

Gradient of f

Is a vector containing all the partial derivatives of the first order

Gradient

Given a function f(xy) and a level curve f(x,y) = c

the gradient of f is:

Consider 2 points of the curve: (x,y); (x+εx, x+εy), for small ε

The Gradient is locally perpendicular to level curves

y

f

x

ff ,

(x,y)(x+εx, y+εy)

),(

),(),(

,

,,

yx

T

yx

y

yx

xyx

gyxf

y

f

x

fyxfyxf

ε

Since both : (x,y); (x+εx, x+εy), points satisfy the curve equation:

The gradient is perpendicular to ε.For small ε, ε is parallel to the curve and,by consequence, thegradient is perpendicular to the curve.

The gradient points towards the direction of maximumincrease of f

The local perpendicular to a curve: Gradient

(x,y)(x+εx, x+εy)

0),(

),(

yx

T

yx

T

f

fcc

ε

ε

ε

grad (f)

Steepest descent finds the LOCAL minimum of a function by always pointing in the direction that leads downhill.

Example: OR

2

1

3

w13 = 0 w23 = 0 3 = 0 =2

Training examplesx1 x2 d a z E dE/dw13 dE/dw23 dE/d3

1 0 1 0 0.5 0.125 -0.125 0 0.125

0 1 1 0 0.5 0.125 0 -0.125 0.125

0 0 0 0 0.5 0.125 0 0 -0.125

0 0 0 0 0.5 0.125 0 0 -0.125

0.5 -0.125 -0.125 0

aeag

1

1)( )1()(1)()(' zzagagag

i

i

l

i

j

i

i

l

i

j

i

j

lj

xxagdxzw

Ed)(')(

Example: OR, Step 1

2

1

3

w13 = 0.25 w23 = 0.25 3 = 0 =2

Training examplesx1 x2 d a z E dE/dw13 dE/dw23 dE/d3

1 0 1 0.25 0.56 0.096 -0.108 0 0.108

0 1 1 0.25 0.56 0.096 0 -0.108 0.108

0 0 0 0 0.5 0.125 0 0 -0.125

0 0 0 0 0.5 0.125 0 0 -0.125

0.442 -0.108 -0.108 -0.035

Example: OR, Step 2

2

1

3

w13 = 0.466 w23 = 0.466 3 = 0.069 =2

Training examplesx1 x2 d a z E dE/dw13 dE/dw23 dE/d3

1 0 1 0.397 0.598 0.081 -0.097 0 0.097

0 1 1 0.397 0.598 0.081 0 -0.097 0.097

0 0 0 -0.069 0.483 0.117 0 0 -0.121

0 0 0 -0.069 0.483 0.117 0 0 -0.121

0.395 -0.097 -0.097 -0.048

Example: OR, Step 3

2

1

3

w13 = 0.659 w23 = 0.659 3 = 0.164 =2

Training examplesx1 x2 d a z E dE/dw13 dE/dw23 dE/d3

1 0 1 0.494 0.621 0.072 -0.089 0 0.089

0 1 1 0.494 0.621 0.072 0 -0.089 0.089

0 0 0 -0.164 0.459 0.105 0 0 -0.114

0 0 0 -0.164 0.459 0.105 0 0 -0.114

0.354 -0.089 -0.089 -0.05

Generalization

2

1

3

w13 = 0.659 w23 = 0.659 3 = 0.164 =2

And what happens for the input (1,1)?x1 x2 d a z

1 1 1 1.153 0.760

The network generalized the rules learned from known examples

Linear separability

Given a step-like transfer function, the output neuron of a perceptron is activated if the activation is positive:

0a

01

i

d

i

i xw

The input space is then divided into two regions

If the requested mapping cannot be separated by an hyperplane, the perceptron in insufficient.

Linear separability

AND OR NOT(1)

XOR

The XOR problem cannot be solved with a perceptron.

Multi-layer feed-forward neural networks

Neurons are organized into hierarchical layers

Each layer receive their inputs from the previous one and transmits the output to the next one

w1ij

w2ij

111

ji

i

ijj xwgz

2122

ji

i

ijj zwgz

2

(12

1

(11

1

(21

2

1w1

11

w122

w121

w112

w211

w221

XORw1

11 = 0.7 w121 = 0.7 1

1 = 0. 5 w1

12 = 0.3 w122 = 0.3 1

2 = 0. 5 w2

11 = 0.7 w221 = -0.7 2

1 = 0. 5

a11 = -0.5 z1

1 = 0

a12 = -0.5 z1

2 = 0

a21 = -0.5 z2

1 = 0

x1 = 0 x2 = 0

2

(12

1

(11

1

(21

2

1w1

11

w122

w121

w112

w211

w221

a11 = 0.2 z1

1 = 1

a12 = -0.2 z1

2 = 0

a21 = 0.2 z2

1 = 1

x1 = 1 x2 = 0

XORw1

11 = 0.7 w121 = 0.7 1

1 = 0. 5 w1

12 = 0.3 w122 = 0.3 1

2 = 0. 5 w2

11 = 0.7 w221 = -0.7 2

1 = 0. 5

2

(12

1

(11

1

(21

2

1w1

11

w122

w121

w112

w211

w221

a11 = 0.2 z1

1 = 1

a12 = -0.2 z1

2 = 0

a21 = 0.2 z2

1 = 1

x1 = 0 x2 = 1

XORw1

11 = 0.7 w121 = 0.7 1

1 = 0. 5 w1

12 = 0.3 w122 = 0.3 1

2 = 0. 5 w2

11 = 0.7 w221 = -0.7 2

1 = 0. 5

2

(12

1

(11

1

(21

2

1w1

11

w122

w121

w112

w211

w221

a11 = 0.9 z1

1 = 1

a12 = 0.1 z1

2 = 1

a21 = -0.5 z2

1 = 0

x1 = 1 x2 = 1

XORw1

11 = 0.7 w121 = 0.7 1

1 = 0. 5 w1

12 = 0.3 w122 = 0.3 1

2 = 0. 5 w2

11 = 0.7 w221 = -0.7 2

1 = 0. 5

The hidden layer maps the input in a new representation that is linearly separable

Input Desired Activation ofoutput hidden neurons

0 0 0 0 01 0 1 1 00 1 1 1 01 1 0 1 1

Supervised artificial neural networks

Feed-forward artificial neural networks can be trainedstarting from examples with known solution.

Error functionGiven a set of examples xi with known desired output, di andgiven a network with parameters w, the square error iscomputed starting from the output of the network z (j sumsover the output neurons )

2

,

),(2

1

ji

i

j

i

j dwxzE

The training procedure consists in finding the parameters wthat minimize the error: iterative minimization algorithmsare adopted. However they do NOT guarantee to reach theglobal minimum.

Training a perceptron

We consider a differentiable transfer function

aeag

1

1)(

)(1)(1

)('2

agage

eag

a

a

Given some initial parameters w:

ii

lj

i

j

i

j

i

j

i

jlj wxw

wxa

wxa

wxz

wxz

E

w

E

),(

),(

),(

),(

),(

i

j

i

ji

j

dwxzwxz

E

),(

),(

)('),(

),(ag

wxa

wxzi

j

i

j

i

lj

i

jlx

w

wxa

),(z2

z1

x2

x1

jj agz

j

id

i

ljj lxwa

1

2

,

),(2

1

ji

i

j

i

j dwxzE

deviation: d ij

Then

i

i

l

i

j

i

i

l

i

j

i

j

lj

xxagdwxzw

Ed)('),(

deviation: d ij

Using the gradient we can update the weights with the“steepest descent” procedure

lj

ljljw

Eww

is the learning rate:Too low: slow trainingToo high: the minima can be lost

Convergence:0

ljw

E

Training a perceptron

Training of multilayer network: Back-propagation

w1ij

w2ij

i

i

l

i

j

i

i

l

i

j

i zzagdwxzw

Ej

lj

,1,2,12

2)('),( d

For the layer 2, the perceptron formula holds, upon the substitution x z1,i

i

l

i

i

j

i

i

j

i

j

xw

a

a

E

w

E

ljlj

,1

1

,1

,11d

i

j

i

k

k

i

k

ki

j

i

k

i

k

i a

a

a

a

a

E

a

E

j

,1

,2,2

,1

,2

,2,1

d

w1ij

w2ij

For the layer 1:

Defining d 1,ij

m

mk

i

m

i

k waga 2,1,2 )(2,1

,1

,2

)(' jk

i

ji

j

i

k waga

a

2,1,2,1 )(' jk

i

j

k

i

k

i

j wag dd

Training of multilayer network: Back-propagation

Compute zl for each example (Feed-forward step) ;

Compute the deviation on the output layer, d 2l;

Compute the deviation on the hidden layer, d j1;

Compute the gradient of the Error with respect to the weights

Update the weights with the steepest-descent method

Input

Output

Training of multilayer network: Back-propagation

What does a neural network learn?

Considering the ideal case, consisting in a continuous set ofexample, x, each one represented with frequency P(x). Thedesired solutions t are associated to the input with probabilityP(t | x)

jj

j

jj dxxPxdPdwxzE dd)()|(),(2

1 2

0),(

wxz

E

jd

dTraining, after convergence:

jjjj

j

jj dxxxxPxdPdwxz d)d-()()|(),(0 , dd

jjjj dxdPdwxz d)|(),(

Functional derivative

The activation state of the j-th

output neuron is equal to the

average of the solution associated

to the input x in the training set.

Neural Networks for classification and regression

Networks can be used for classification or for regression

In regression: desired outputs are real numbersIn classification: desired outputs are 0 or 1

Error function

2

,

),(2

1

ji

i

j

i

j ywxzE

Increasing the hidden neurons increases the number of parameters and then increases risk of overfitting learning

data

Neural Networks and overfitting

1) Be sure that the number of parameters is far lower than the number of points to learn

(What is the number of parameters of a network with n inputs, k outputs and r hidden neurons?)

2) Use regularizers (if possible). E.g.

Many other formulation are possible

2)(2

,

),(2

1 k

ij

ji

i

j

i

j wywxzE

3) Use always an independent test set for deciding when to stop the training [EARLY STOPPING]

(and then validate the method on a third independent set)

Test

Stop the training at this iteration

Back propagation is not suitable for training the networks as the number of

layers increases: DEEP LEARNING PROCEDURES ARE NEEDED

Can we add more layers?

Stuttgart Neural Network Simulator

http://www.ra.cs.uni-tuebingen.de/SNNS/

http://www.opennn.net/

OpenNN

http://deeplearning.net/software/theano/

THEANO

https://grey.colorado.edu/emergent/index.php/

Comparison_of_Neural_Network_Simulators

More on: