New INTRODUCTION TO NEURAL NETWORKS - unibo.it · 2016. 4. 28. · Artificial Neural Networks W ij...
Transcript of New INTRODUCTION TO NEURAL NETWORKS - unibo.it · 2016. 4. 28. · Artificial Neural Networks W ij...
INTRODUCTION TO NEURAL NETWORKS
Complex computations: Mach’s Bands
Observe the transitions among the bands
Complex computations: Mach’s Bands
Da: R. Pierantoni, La trottola di Prometeo, Laterza (1996)
Complex computations: Mach’s Bands
Observe the transitions among the bands
Stimulus Percept
Inte
nsi
tyIn
ten
sity
Complex computations: Mach’s Bands
A simple model of the retina neuron
0
50
100
150
200
250
0 20 40 60 80 100
Incident Intensity (fotons/s)
Potential (mV)
Linear light-to-potential transducer
Light
Potential
Neuron transduction
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0
40
80
120
160
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Fo
ton
s/s
mV
Adding lateral inhibition
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Fo
ton
i/s
Each neuron inhibits its neighbor by a 10% of its non inhibited potential
160
-0.1x160-0.1x160
128
Adding lateral inhibition
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Fo
ton
s/s
mV
0
40
80
120
160
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
160 - 0.1 160-0.1 40=140
40 - 0.1 160-0.1 40=20
40 - 0.1 40-0.1 40=32
160 - 0.1 160-0.1 160=128
Each neuron inhibits its neighbor by a 10% of its non inhibited potential
Many identical computing units, each one performing verysimple operations, can perform very complex computationswhen they are widely and specifically connected.
The “knowledge” is stored in the topology and in the strength of the synapses
Complex computations: Mach’s Bands
Model Neuron: McCulloch and Pitts
A neuron is a computational unit that1) performs the weighted sum of the input signals,computing the activation signal (a)
2) transforms the activation signal following a tranferfunction g and computing then the output z
i
d
i
i xwa1
)(agz
w: synaptic weights: activation threshold
Transfer functions
0
0,5
1
-10 0 10
aeag
1
1)(
Usually, NON linear functions are adopted
Non linearity
0
0,5
1
-10 0 10
The same variation in the input can give very different variations on the transferred signal
Artificial Neural Networks
Wij Synaptic weitghs
Neuron i
-
i
d
i
i xwa1
)(agz
The threshold can beimplicitly considered by addingan extra-neuron, alwaysactivated and connected tothe current neuron withweight equal to -
Topology of artificial neural networks
The topology of the connections among the neurons definesthe network class. We will take into consideration only thefeed-forward architectures, where the neurons areorganized into hierarchical layers and the signal flows injust a direction.
Perceptrons2 layers: Input and Output
wij
ji
i
ijj xwgz
Neural networks and logical operators
2
1
3
ORw13 = 0.5 w23 = 0.5 3 = 0.25
a3 = 0.25
z3 = 1
a3 = 0.25
z3 = 1
a3 = 0.75
z3 = 1
a3 = -0.25
z3 = 0
2
1
3
ANDw13 = 0.5 w23 = 0.5 3 = 0.75
a3 = -0.25
z3 = 0
a3 = -0.25
z3 = 0
a3 = 0.25
z3 = 1
a3 = -0.75
z3 = 0
Neural networks and logical operators
2
1
3
NOT (1)w13 = -0.5 w23 = 0.1 3 = -0.25
a3 = -0.25
z3 = 0
a3 = 0.35
z3 = 1
a3 = -0.15
z3 = 0
a3 = 0.25
z3 = 1
Neural networks and logical operators
Supervised artificial neural networks
Feed-forward artificial neural networks can be trainedstarting from examples with known solution.
Error functionGiven a set of examples xi with known desired output, di andgiven a network with parameters w, the square error iscomputed starting from the output of the network z (j sumsover the output neurons )
2
,
),(2
1
ji
i
j
i
j dwxzE
The training procedure consists in finding the parameters wthat minimize the error: iterative minimization algorithmsare adopted. However they do NOT guarantee to reach theglobal minimum.
Training a perceptron
We consider a differentiable transfer function
aeag
1
1)(
)(1)(1
)('2
agage
eag
a
a
Given some initial parameters w:
ii
lj
i
j
i
j
i
j
i
jlj wxw
wxa
wxa
wxz
wxz
E
w
E
),(
),(
),(
),(
),(
i
j
i
ji
j
dwxzwxz
E
),(
),(
)('),(
),(ag
wxa
wxzi
j
i
j
i
lj
i
jlx
w
wxa
),(z2
z1
x2
x1
jj agz
j
id
i
ljj lxwa
1
2
,
),(2
1
ji
i
j
i
j dwxzE
deviation: d ij
Then
i
i
l
i
j
i
i
l
i
j
i
j
lj
xxagdwxzw
Ed)('),(
deviation: d ij
Using the gradient we can update the weights with the“steepest descent” procedure
lj
ljljw
Eww
is the learning rate:Too low: slow trainingToo high: the minima can be lost
Convergence:0
ljw
E
Training a perceptron
Steepest descent finds the minimum of a function by always pointing in the direction that leads downhill.
Steepest descent finds the LOCAL minimum of a function by always pointing in the direction that leads downhill.
f: RnR. If f(x) is of class C2, objective function
Gradient of f
Is a vector containing all the partial derivatives of the first order
Gradient
Given a function f(xy) and a level curve f(x,y) = c
the gradient of f is:
Consider 2 points of the curve: (x,y); (x+εx, x+εy), for small ε
The Gradient is locally perpendicular to level curves
y
f
x
ff ,
(x,y)(x+εx, y+εy)
),(
),(),(
,
,,
yx
T
yx
y
yx
xyx
gyxf
y
f
x
fyxfyxf
ε
Since both : (x,y); (x+εx, x+εy), points satisfy the curve equation:
The gradient is perpendicular to ε.For small ε, ε is parallel to the curve and,by consequence, thegradient is perpendicular to the curve.
The gradient points towards the direction of maximumincrease of f
The local perpendicular to a curve: Gradient
(x,y)(x+εx, x+εy)
0),(
),(
yx
T
yx
T
f
fcc
ε
ε
ε
grad (f)
Steepest descent finds the LOCAL minimum of a function by always pointing in the direction that leads downhill.
Example: OR
2
1
3
w13 = 0 w23 = 0 3 = 0 =2
Training examplesx1 x2 d a z E dE/dw13 dE/dw23 dE/d3
1 0 1 0 0.5 0.125 -0.125 0 0.125
0 1 1 0 0.5 0.125 0 -0.125 0.125
0 0 0 0 0.5 0.125 0 0 -0.125
0 0 0 0 0.5 0.125 0 0 -0.125
0.5 -0.125 -0.125 0
aeag
1
1)( )1()(1)()(' zzagagag
i
i
l
i
j
i
i
l
i
j
i
j
lj
xxagdxzw
Ed)(')(
Example: OR, Step 1
2
1
3
w13 = 0.25 w23 = 0.25 3 = 0 =2
Training examplesx1 x2 d a z E dE/dw13 dE/dw23 dE/d3
1 0 1 0.25 0.56 0.096 -0.108 0 0.108
0 1 1 0.25 0.56 0.096 0 -0.108 0.108
0 0 0 0 0.5 0.125 0 0 -0.125
0 0 0 0 0.5 0.125 0 0 -0.125
0.442 -0.108 -0.108 -0.035
Example: OR, Step 2
2
1
3
w13 = 0.466 w23 = 0.466 3 = 0.069 =2
Training examplesx1 x2 d a z E dE/dw13 dE/dw23 dE/d3
1 0 1 0.397 0.598 0.081 -0.097 0 0.097
0 1 1 0.397 0.598 0.081 0 -0.097 0.097
0 0 0 -0.069 0.483 0.117 0 0 -0.121
0 0 0 -0.069 0.483 0.117 0 0 -0.121
0.395 -0.097 -0.097 -0.048
Example: OR, Step 3
2
1
3
w13 = 0.659 w23 = 0.659 3 = 0.164 =2
Training examplesx1 x2 d a z E dE/dw13 dE/dw23 dE/d3
1 0 1 0.494 0.621 0.072 -0.089 0 0.089
0 1 1 0.494 0.621 0.072 0 -0.089 0.089
0 0 0 -0.164 0.459 0.105 0 0 -0.114
0 0 0 -0.164 0.459 0.105 0 0 -0.114
0.354 -0.089 -0.089 -0.05
Generalization
2
1
3
w13 = 0.659 w23 = 0.659 3 = 0.164 =2
And what happens for the input (1,1)?x1 x2 d a z
1 1 1 1.153 0.760
The network generalized the rules learned from known examples
Linear separability
Given a step-like transfer function, the output neuron of a perceptron is activated if the activation is positive:
0a
01
i
d
i
i xw
The input space is then divided into two regions
If the requested mapping cannot be separated by an hyperplane, the perceptron in insufficient.
Linear separability
AND OR NOT(1)
XOR
The XOR problem cannot be solved with a perceptron.
Multi-layer feed-forward neural networks
Neurons are organized into hierarchical layers
Each layer receive their inputs from the previous one and transmits the output to the next one
w1ij
w2ij
111
ji
i
ijj xwgz
2122
ji
i
ijj zwgz
2
(12
1
(11
1
(21
2
1w1
11
w122
w121
w112
w211
w221
XORw1
11 = 0.7 w121 = 0.7 1
1 = 0. 5 w1
12 = 0.3 w122 = 0.3 1
2 = 0. 5 w2
11 = 0.7 w221 = -0.7 2
1 = 0. 5
a11 = -0.5 z1
1 = 0
a12 = -0.5 z1
2 = 0
a21 = -0.5 z2
1 = 0
x1 = 0 x2 = 0
2
(12
1
(11
1
(21
2
1w1
11
w122
w121
w112
w211
w221
a11 = 0.2 z1
1 = 1
a12 = -0.2 z1
2 = 0
a21 = 0.2 z2
1 = 1
x1 = 1 x2 = 0
XORw1
11 = 0.7 w121 = 0.7 1
1 = 0. 5 w1
12 = 0.3 w122 = 0.3 1
2 = 0. 5 w2
11 = 0.7 w221 = -0.7 2
1 = 0. 5
2
(12
1
(11
1
(21
2
1w1
11
w122
w121
w112
w211
w221
a11 = 0.2 z1
1 = 1
a12 = -0.2 z1
2 = 0
a21 = 0.2 z2
1 = 1
x1 = 0 x2 = 1
XORw1
11 = 0.7 w121 = 0.7 1
1 = 0. 5 w1
12 = 0.3 w122 = 0.3 1
2 = 0. 5 w2
11 = 0.7 w221 = -0.7 2
1 = 0. 5
2
(12
1
(11
1
(21
2
1w1
11
w122
w121
w112
w211
w221
a11 = 0.9 z1
1 = 1
a12 = 0.1 z1
2 = 1
a21 = -0.5 z2
1 = 0
x1 = 1 x2 = 1
XORw1
11 = 0.7 w121 = 0.7 1
1 = 0. 5 w1
12 = 0.3 w122 = 0.3 1
2 = 0. 5 w2
11 = 0.7 w221 = -0.7 2
1 = 0. 5
The hidden layer maps the input in a new representation that is linearly separable
Input Desired Activation ofoutput hidden neurons
0 0 0 0 01 0 1 1 00 1 1 1 01 1 0 1 1
Supervised artificial neural networks
Feed-forward artificial neural networks can be trainedstarting from examples with known solution.
Error functionGiven a set of examples xi with known desired output, di andgiven a network with parameters w, the square error iscomputed starting from the output of the network z (j sumsover the output neurons )
2
,
),(2
1
ji
i
j
i
j dwxzE
The training procedure consists in finding the parameters wthat minimize the error: iterative minimization algorithmsare adopted. However they do NOT guarantee to reach theglobal minimum.
Training a perceptron
We consider a differentiable transfer function
aeag
1
1)(
)(1)(1
)('2
agage
eag
a
a
Given some initial parameters w:
ii
lj
i
j
i
j
i
j
i
jlj wxw
wxa
wxa
wxz
wxz
E
w
E
),(
),(
),(
),(
),(
i
j
i
ji
j
dwxzwxz
E
),(
),(
)('),(
),(ag
wxa
wxzi
j
i
j
i
lj
i
jlx
w
wxa
),(z2
z1
x2
x1
jj agz
j
id
i
ljj lxwa
1
2
,
),(2
1
ji
i
j
i
j dwxzE
deviation: d ij
Then
i
i
l
i
j
i
i
l
i
j
i
j
lj
xxagdwxzw
Ed)('),(
deviation: d ij
Using the gradient we can update the weights with the“steepest descent” procedure
lj
ljljw
Eww
is the learning rate:Too low: slow trainingToo high: the minima can be lost
Convergence:0
ljw
E
Training a perceptron
Training of multilayer network: Back-propagation
w1ij
w2ij
i
i
l
i
j
i
i
l
i
j
i zzagdwxzw
Ej
lj
,1,2,12
2)('),( d
For the layer 2, the perceptron formula holds, upon the substitution x z1,i
i
l
i
i
j
i
i
j
i
j
xw
a
a
E
w
E
ljlj
,1
1
,1
,11d
i
j
i
k
k
i
k
ki
j
i
k
i
k
i a
a
a
a
a
E
a
E
j
,1
,2,2
,1
,2
,2,1
d
w1ij
w2ij
For the layer 1:
Defining d 1,ij
m
mk
i
m
i
k waga 2,1,2 )(2,1
,1
,2
)(' jk
i
ji
j
i
k waga
a
2,1,2,1 )(' jk
i
j
k
i
k
i
j wag dd
Training of multilayer network: Back-propagation
Compute zl for each example (Feed-forward step) ;
Compute the deviation on the output layer, d 2l;
Compute the deviation on the hidden layer, d j1;
Compute the gradient of the Error with respect to the weights
Update the weights with the steepest-descent method
Input
Output
Training of multilayer network: Back-propagation
What does a neural network learn?
Considering the ideal case, consisting in a continuous set ofexample, x, each one represented with frequency P(x). Thedesired solutions t are associated to the input with probabilityP(t | x)
jj
j
jj dxxPxdPdwxzE dd)()|(),(2
1 2
0),(
wxz
E
jd
dTraining, after convergence:
jjjj
j
jj dxxxxPxdPdwxz d)d-()()|(),(0 , dd
jjjj dxdPdwxz d)|(),(
Functional derivative
The activation state of the j-th
output neuron is equal to the
average of the solution associated
to the input x in the training set.
Neural Networks for classification and regression
Networks can be used for classification or for regression
In regression: desired outputs are real numbersIn classification: desired outputs are 0 or 1
Error function
2
,
),(2
1
ji
i
j
i
j ywxzE
Increasing the hidden neurons increases the number of parameters and then increases risk of overfitting learning
data
Neural Networks and overfitting
1) Be sure that the number of parameters is far lower than the number of points to learn
(What is the number of parameters of a network with n inputs, k outputs and r hidden neurons?)
2) Use regularizers (if possible). E.g.
Many other formulation are possible
2)(2
,
),(2
1 k
ij
ji
i
j
i
j wywxzE
3) Use always an independent test set for deciding when to stop the training [EARLY STOPPING]
(and then validate the method on a third independent set)
Test
Stop the training at this iteration
Back propagation is not suitable for training the networks as the number of
layers increases: DEEP LEARNING PROCEDURES ARE NEEDED
Can we add more layers?
Stuttgart Neural Network Simulator
http://www.ra.cs.uni-tuebingen.de/SNNS/
http://www.opennn.net/
OpenNN
http://deeplearning.net/software/theano/
THEANO
https://grey.colorado.edu/emergent/index.php/
Comparison_of_Neural_Network_Simulators
More on: