Artificial Neural Networks. Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Overview...
-
Upload
molly-mclain -
Category
Documents
-
view
251 -
download
6
Transcript of Artificial Neural Networks. Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Overview...
Artificial Neural Networks
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Overview
Motivation & Goals Perceptron-Learning Gradient Algorithms & the -Rule Multi Layer Nets The Backpropagation Algorithm Example Application: Recognition of Faces More Network Architectures Application Areas of ANNs
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Model: The Brain
A complex learning system with simple learning units: the neurons.
A network of ~ neurons where each of the neurons has ~ connections.
Transmission time of a neuron: ~(speed versus flexibility)
Observation: face recognition time = ~ parallelism.
1010
10 1 sec
10 3 sec
104
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Goals of ANNs
Learning instead of programming Learning complex functions with simple learning units Parallel computation (e.g. layer model) Network parameter shall be automatically
found by a learning algorithm An ANN black box.
output
input
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
When are ANNs used? Input instances are described as a vector
of discrete or real values The output of a target function is
a single value or a vector of discrete or real valued attributes
Input data contains noise Target function unknown
or difficult to describe
output
input
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The Perceptron (as a NN Unit) (1/2)
A linear unit with threshold.
x1
i 1
n
wi x i 1 no x ,..., x
x2
xn 1 no x ,..., x = 1 if wx > Θ
1 otherwise
w1
w2
wn
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The Perceptron (as a NN Unit) (2/2)
x1
n
i ii=0
w x 1 no x ,..., x
x2
xn
w1
w2
wn
1 1 n n 0 0 n n 0 0w x +....+ w , x > θ w x +....+ w , x > 0; x = 1, w = θ
0 no x ,..., x = 1 if wx > 0
1 otherwise
θx0
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Geometrical Classification (Decision Surface)
A perceptron can classify only linear separable training data. We need networks of these units.
+ +
- +
linear separable
Ex. OR-Function
0.30.5
0.5
+
+
-
-
not linear separable
Ex. XOR-Function
x2
x1
x2
x1
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The Perceptron Learning Rule (1/2)
Training of a perceptron = Learning the best hypothesis, which classifies all training data
A hypothesis = a vector of weights
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The Perceptron Learning Rule (2/2)
Idea:
1.Initialise the weights with random values
2.Apply the perceptron iterative to each training example and modify the weights according to the learning rule
where: t : target output o: actual output : the learning rate
3.Step 2 is repeated for all training examples until all of them are correctly classified.
i iΔw = η t o x
i i iw w + Δw
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The Perceptron-Learning Rule: Convergence
The perceptron learning rule converges if: The training examples are linear separable and
is chosen small enough (e.g. 0.1). Intuitive explanation:
it = o Δw = 0 ok
i i i
i i i
w for x > 0 Δw > 0t = 1 o = 1 ok
w for x < 0 Δw < 0
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The Gradient Descend Algorithm & the -Rule (1/5)
Better: the -Rule converges even if the training examples are not linear separable.
Idea: Use the gradient descend algorithm to search for the best hypothesis in hypothesis space. The best hypothesis is the one which maximally minimises the square error.
Basis of the backpropagation algorithm.
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The Gradient Descend Algorithm & the -Rule (2/5)
Because of steadiness the -learning rule is applied on a linear unit instead of on the perceptron.
Linear unit:
The square error to be minimised:
,where:
x1
1
xn o x = w x
2
d dd D
1E w = t o
2
D: set of training examples : target output of example d : computed output of example dt d
od
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
● Geometric Interpretation:H-Space, error function (e.g. 2-dimensional).
The Gradient Descend Algorithm & the -Rule (3/5)
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The Gradient Descend Algorithm & the -Rule (4/5)
Gradient:
Learning Rule:
Derivation
0 n
E EE w = ...
w w
ii
EΔw = η E w Δw = η
w
2
d ddi i
E 1= t o
w w 2
d d d dd i
1= 2 t o t o
2 w
d d d dd i
= t o t w xw
d d i,dd
= t o x
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Standard methode:Do until termination criterion is satisfied
1. Initialise
2. For all Compute o For all
3. For all
i i iΔw Δw + η t o x
i i iw w + Δw
iΔw = 0
x, t
wi :
iw
The Gradient Descend Algorithm & the -Rule (5/5)
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The -rule
Stochastic methode:Do until termination criterion is satisfied
1. Initialise
2. For all Compute o For all
iΔw = 0
d = x, t D
wi : i i d d i,dw w + η t o x the Rule
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Remarks
Advantages of the stochastic approximation of the gradient:
quicker convergence (incremental update of the weights).
less likely to stuck in a local minimum.
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Remarks
Single perceptrons learn only linear separable training data. We need multi layer networks of several 'neurons'.
Example: the XOR problem:
0.5
0.5
0.5
x1
x2
1.
1.
-1. -1.
1.
1.x1+
+
-
-
not linear separable
x2
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
XOR-Function
0.5
0.50.5
0
0
1.
1.
-1. -1.1.
1.1.
0.5
0.50.5
0
11.
-1. -1.1.
1.
0.5
0.50.5
1
0
1.
1.
-1. -1.1.
1. 0.5
0.50.5
1
1
1.
1.
-1. -1.1.
1.
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Supervised Learning Backpropagation NN
Since 1985 the BP algorithm has become one of the widely spread and successful learning algorithms for NNs.
Idea: The minimum of the error function of a learning function is searched by descending in direction of the gradient.
The vector of weights which minimises the error in the network is seen as the solution of the learning problem.
So the gradient of the error function must exist for all points inside the weight space. must be differentiableo w , x
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Learning in Backpropagation Networks
The sigmoid unit:
Properties of the sigmoid unit:
with dσ x
= σ x 1 σ xdx
x
1σ x =
1+ e
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Definitions used by the BP Algorithm
Hidden units
Output units
Input units
Bac
kpro
paga
tion
i
j
j,iw
j,ix
: input from node i to unit j : weight of the jth input to unit Ioutputs: set of output units : output of unit i : target output of unit i : error term of unit n
j,ix
j,iw
oi
t i
nδ
nn
Eδ =
net
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
The Backpropagation Algorithm
Initialise all weights to small random numbers Until termination criterion satisfied do
For each training example do
1. Compute the network's outcome
2. For each output unit k
3. For each hidden unit h
4. Update each network weightwhere
j,i j j,iΔw = ηδ xj,i j,i j,iw w + Δw
h h h k,h kk outputs
δ o 1 o w δ
k k k k kδ o 1 o t o
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Derivation of the BP Algorithm For each training example d:
where
i
j
Output unitsInput units
Hidden units
jd d dj,i
j,i j j,i j
netE E E= = x
w net w net
w j ,i
j j,i j,ii
net = w x(weighted sum of inputs for unit j)
j,i j,i j,iw w + Δw with dj,i
j,i
EΔw = η
w
2
d k kk outputs
1E w = t o
2
and
x j , i
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Derivation of the BP Algorithm Output layer:
Hidden layer:
Downstream(j): the set of units whose immediate inputs include the output of unit j
jd d
j j j
j
oE E=
net o net
j j
d d kj j k k, j
k Downstream j k Downstream jj k j
E E net= = ... = o (1 o ) w
net net net
And therefore j,i j j,iΔw = ηδ x
dj j
j
jjj j
j j
E= .... = t o
o
σ neto= = o 1 o
net net
j
j,i j j j j j,iΔw = η(t o ) o (1 o ) x
dσ x= σ x 1 σ x
dx
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Derivation of the BP Algorithm (Explanation)
22 j jdk k j j j j j j
k outputsj j j j
t oE 1 1 1= t o = t o = 2 t o = t o
o o 2 o 2 2 o
d d k
k Downstream jj k j
E E net= =
net net net
j j j k k, jk Downstream j
δ = o 1 o δ w
dj
j
Ewhy δ =
net
jk kk k
k Downstream j k Downstream jj j j
onet netδ = δ =
net o net
j
k k, j k k, j j jk Downstream j k Downstream jj
oδ w = δ w o 1 o
net
nn
Eδ =
net
k k,i k,ii
net = w x
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Convergence of the BP Algorithm
Generalisation to arbitrary acyclic directed network architectures is simple.
In practice it works well, but it sometimes sticks in a local but not always global minimum introduction of a momentum (“escape routes”) :
Disadvantage: global minima can be left out by this “jumping”!
Training can take thousands of iterations slow (accelerated by momentum).
Over-fitting versus adaptability of the NN.
i, j n i, j i, jΔw n = ηδ x + αΔw n 1
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Example: Recognition of Faces
Given: 32 photos of 20 persons, in different positions:
Direction of view: right, left, up or straight.
With and without sunglasses.
Expression: happy, sad, neutral...
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Goal: Classification of the photos concerning the direction of view
Preparation of the input:
• Rastering the photos acceleration of the learning process
• Input vector = the grayscale values of the 30 * 32 pixels.
• Output vector = (left, straight, right, up).
Solution = max(left, right, up, straight).
e.g. o = (0.9, 0.1, 0.1, 0.1) looking to the left
Example: Recognition of Faces
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Recognition of the direction of view
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
Recurrent Neural Networks They are directed cyclic networks “with memory”
Outputs at time t = Inputs at time t+1 The cycles allow to feed results back into the network.
(+) They are more expressive than acyclic networks (-) Training of recurrent networks is expensive. In some cases recurrent networks can be trained using a
variant of the Backpropagation algorithm. Example: Forecast of the next stock market prices y(t+1),
based on the current indicator x(t) and the last indicator x(t-1).
Lehrstuhl für Informatik 2
Gabriella Kókai: Maschine Learning
x(t-1)
b
x(t) c(t)
y(t+1)y(t+1)
x(t)
x(t)
x(t-2)
c(t)
c(t-1)
c(t-2)
Feedforward network
Recurrentnetwork
Recurrent network(unfolded in time)
Recurrent NNs