University of Kurdistan Artificial Intelligence Methods (AIM) Lecturer: Kaveh Mollazade, Ph.D....
-
Upload
charlene-felicia-lang -
Category
Documents
-
view
221 -
download
0
Transcript of University of Kurdistan Artificial Intelligence Methods (AIM) Lecturer: Kaveh Mollazade, Ph.D....
University of Kurdistan
Artificial Intelligence Methods (AIM)
Lecturer:
Kaveh Mollazade, Ph.D.
Department of Biosystems Engineering, Faculty of Agriculture, University of Kurdistan, Sanandaj, IRAN.
Lecture 2: Artificial Neural Networks (ANNs)
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade1
Contents
• This lecture will cover:
– Definitions of ANNs
– Biological inspiration
– ANNs vs. biological neural networks
– Perceprton training
– Gradient descent algorithm
– Multilayer neural networks (MLPs)
– Backpropagation algorithm
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade2
Definition
DARPA Neural Network Study (1988, AFCEA International Press, p. 60):
An artificial neural network is a system composed of many simple processing elements
operating in parallel whose function is determined by network structure, connection
strengths, and the processing performed at computing elements or nodes.
Haykin, S. (1994), Neural Networks: A Comprehensive Foundation, NY: Macmillan, p. 2:
An artificial neural network is a massively parallel distributed processor that has a natural
propensity for storing experiential knowledge and making it available for use.
An artificial neural network is an oversimplified representation of the neuron
interconnections in the human brain.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade3
Biological inspiration
• Animals are able to react adaptively to changes in their external and internal
environment, and they use their nervous system to perform these behaviors.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade4
Biological inspiration
• An appropriate model/simulation of the nervous system should be able to produce
similar responses and behaviors in artificial systems.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade5
Biological inspiration
• The nervous system is build by relatively simple units, the neurons, so copying their
behavior and functionality should be the solution.
• Neuron is a simple computing element.
• Each individual neuron is not intelligent.
• Brain is composed of massively parallel network
of neurons.
• Brain is intelligent.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade6
Biological neuron
• Soma or "cell body" is the bulbous end of a neuron, containing the cell nucleus. The word
"soma" comes from the Greek σῶμα, meaning "body". There are many different specialized
types of neurons, and their sizes vary from as small as about 5 micrometers to over 10
millimeter for some of the smallest and largest neurons of invertebrates, respectively.
Dendrite
Soma
Nucleus
Axon terminal
Axon
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade7
Biological neuron
• Dendrites (also dendron) are the branched projections of a neuron that act to propagate the
electrochemical stimulation received from other neural cells to the cell body, or soma, of the
neuron from which the dendrites project. Electrical stimulation is transmitted onto dendrites by
upstream neurons (usually their axons) via synapses which are located at various points
throughout the dendritic tree.
Dendrite
Soma
Nucleus
Axon terminal
Axon
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade8
Biological neuron
• Axon also known as a nerve fiber, is a long, slender projection of a nerve cell, or neuron, that
typically conducts electrical impulses away from the neuron's cell body. The function of the axon
is to transmit information to different neurons, muscles and glands.
Dendrite
Soma
Nucleus
Axon terminal
Axon
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade9
Biological neuron
• Axon terminals (also called synaptic boutons) are distal terminations of the branches of an
axon. Neurons are interconnected in complex arrangements, and use electrochemical signals
and neurotransmitter chemicals to transmit impulses from one neuron to the next; axon
terminals are separated from neighboring neurons by a small gap called a synapse, across
which impulses are sent.
Dendrite
Soma
Nucleus
Axon terminal
Axon
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade10
Biological neuron
• The spikes travelling along the axon of the
pre-synaptic neuron trigger the release of
neurotransmitter substances at the synapse.
• The neurotransmitters cause excitation or
inhibition in the dendrite of the post-synaptic
neuron.
• The integration of the excitatory and
inhibitory signals may produce spikes in the
post-synaptic neuron.
• The contribution of the signals depends on
the strength of the synaptic connection.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade11
Artificial neuron vs. biological neuron
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade12
Artificial neuron vs. biological neuron
Biological Neural Network Artificial Neural Network Soma
Dendrite
Axon
Synapse
Neuron
Input
Output
Weight
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade13
Artificial neurons: The McCullogh-Pitts model
• Neurons work by processing information. They receive and provide information in
form of spikes.
Inputs
Outputw2
w1
w3
wn
wn-1
..
.
x1
x2
x3
…
xn-1
xn
y)(;
1
zHyxwzn
iii
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade14
Artificial neurons: The McCullogh-Pitts model
Spikes are interpreted as spike rates.
Synaptic strength are translated as synaptic weights.
Excitation means positive product between the incoming spike rate and the
corresponding synaptic weight.
Inhibition means negative product between the incoming spike rate and the
corresponding synaptic weight.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade15
Artificial neurons: The McCullogh-Pitts model
• Nonlinear generalization of the McCullogh-Pitts neuron:
),( wxfy
y is the neuron’s output, x is the vector of inputs, and w is the vector of synaptic weights.
axwTey
1
1Sigmoidal neuron: Gaussian neuron:
2
2
2
||||
a
wx
ey
• Examples:
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade16
Artificial neural networks (ANNs)
Inputs
Output
• Each of the perceptrons receives inputs, processes inputs and delivers a single
output. The input can be raw input data or the output of other perceptrons. The
output can be the final result (e.g. 1 means yes, 0 means no) or it can be inputs to
other perceptrons.
• An ANN is composed of processing elements called perceptrons that are linked
together according to a specific network architecture.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade17
Learning in biological systems
• Learning = learning by adaptation
• The young animal learns that the green fruits are sour, while the yellowish/reddish
ones are sweet. The learning happens by adapting the fruit picking behavior.
• At the neural level the learning happens by changing of the synaptic strengths,
eliminating some synapses, and building new ones.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade18
Learning as optimization
• The objective of adapting the responses on the basis of the information received
from the environment is to achieve a better state. E.g., the animal likes to eat many
energy rich, juicy fruits that make its stomach full, and makes it feel happy.
• In other words, the objective of learning in biological organisms is to optimize the
amount of available resources, happiness, or in general to achieve a closer to optimal
state.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade19
Learning in biological neural networks
• Hebbian theory is a theory in neuroscience that proposes an explanation for the
adaptation of neurons in the brain during the learning process.
“When an axon of cell A is near
enough to excite a cell B and
repeatedly or persistently takes part in
firing it, some growth process or
metabolic change takes place in one
or both cells such that A’s efficiency,
as one of the cells firing B, is
increased “ (D. O. Hebb, 1949).
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade20
Learning in biological neural networks
• The learning rules of Hebb:
• Synchronous activation increases the synaptic strength;
• Asynchronous activation decreases the synaptic strength.
• These rules fit with energy minimization principles.
• Maintaining synaptic strength needs energy, it should be maintained at those places
where it is needed, and it shouldn’t be maintained at places where it’s not needed.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade21
Learning principle for artificial neural networks
• ENERGY MINIMIZATION
We need an appropriate definition of energy for artificial neural networks, and having
that we can use mathematical optimization techniques to find how to change the
weights of the synaptic connections between neurons.
• ENERGY = measure of task performance error
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade22
Perceptrons
• A perceptron takes a vector of real-valued inputs, calculates a linear combination of
these inputs, then outputs
1 if the result is greater than some threshold
–1 otherwise.
• Given real-valued inputs x1 through xn, the output o(x1, …, xn) computed by the
perceptron is:
o(x1, …, xn) = 1 if w0 + w1x1 + … + wnxn > 0
o(x1, …, xn) = -1 otherwise
where wi is a real-valued constant, or weight.
• Notice the quantify (-w0) is a threshold that the weighted combination of inputs w1x1
+ … + wnxn must surpass in order for perceptron to output 1.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade23
Perceptrons
• To simplify notation, we imagine an additional constant input x0 = 1, allowing us to
write the above inequality as:
• Learning a perceptron involves choosing values for the weights w0, w1,…, wn.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade24
Representation power of perceptrons
• We can view the perceptron as representing a hyperplane decision surface in the
n-dimensional space of instances (i.e. points). The perceptron outputs 1 for instances
lying on one side of the hyperplane and outputs –1 for instances lying on the other
side. The equation for this decision hyperplane is:
Some sets of positive and negative
examples cannot be separated by any
hyperplane. Those that can be separated
are called linearly separated set of
examples.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade25
Example: Represent boolean functions by single perceptrons
• AND function
x1 x2 output
0 0 -10 1 -11 0 -11 1 1
<Training examples>Decision hyperplane :
w0 + w1 x1 + w2 x2 = 0-0.8 + 0.5 x1 + 0.5 x2 = 0
x1 x2 Swixi output
0 0 -0.8 -10 1 -0.3 -11 0 -0.3 -11 1 0.2 1
<Test Results>
-
-
-
+
x1
x2
-0.8 + 0.5 x1 + 0.5 x2 = 0
-
-
-
+
x1
x2
-0.8 + 0.5 x1 + 0.5 x2 = 0
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade26
Example: Represent boolean functions by single perceptrons
• OR function
x1 x2 o utput
0 0 - 1
0 1 11 0 11 1 1
<Training e x ample s >
Decision hyperplane :w0 + w1 x1 + w2 x2 = 0-0.3 + 0.5 x1 + 0.5 x2 = 0
x1 x2 Swixi output
0 0 -0.3 -10 1 0.2 11 0 0.2 11 1 0.7 1
<Test Results>
-
+
+
+
x1
x2
-0.3 + 0.5 x1 + 0.5 x2 = 0
-
+
+
+
x1
x2
-0.3 + 0.5 x1 + 0.5 x2 = 0
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade27
Example: Represent boolean functions by single perceptrons
• XOR function
It’s impossible to implement the XOR function by a single perceptron.
x1 x2 output0 0 -10 1 11 0 11 1 -1
<Trai ni ng exampl es>
-
+
+
-
x1
x2
-
+
+
-
x1
x2
A two-layer network of perceptrons can represent XOR function.
Refer to this equation,
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade28
Perceptron training rule
• Although we are interested in learning networks of many interconnected units, let us
begin by understanding how to learn the weights for a single perceptron.
• Here learning is to determine a weight vector that causes the perceptron to produce
the correct +1 or –1 for each of the given training examples.
• Several algorithms are known to solve this learning problem. Here we consider two:
the perceptron rule and the delta rule.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade29
Perceptron training rule (cont…)
• One way to learn an acceptable weight vector is to begin with random weights, then
iteratively apply the perceptron to each training example, modifying the perceptron
weights whenever it misclassifies an example. This process is repeated, iterating
through the training examples as many as times needed until the perceptron
classifies all training examples correctly.
• Weights are modified at each step according to the perceptron training rule, which
revises the weight wi associated with input xi according to the rule.
t is target output value for the current training example o is perceptron output is small constant (e.g., 0.1) called learning rate
wi wi + wi
where wi = (t – o) xi
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade30
Perceptron training rule (cont…)
• The role of the learning rate is to moderate the degree to which weights are
changed at each step. It is usually set to some small value (e.g. 0.1) and is
sometimes made to decrease as the number of weight-tuning iterations increases.
• We can prove that the algorithm will converge
If training data is linearly separable
and sufficiently small.
• If the data is not linearly separable, convergence is not assured.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade31
Gradient descent and the delta rule
• Although the perceptron rule finds a successful weight vector when the training
examples are linearly separable, it can fail to converge if the examples are not
linearly separatable. A second training rule, called the delta rule, is designed to
overcome this difficulty.
• The key idea of delta rule: to use gradient descent to search the space of possible
weight vector to find the weights that best fit the training examples. This rule is
important because it provides the basis for the backpropagration algorithm, which
can learn networks with many interconnected units.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade32
Gradient descent and the delta rule (cont…)
• The delta training rule:
Considering the task of training an unthresholded perceptron, that is a linear unit, for
which the output o is given by:
o = w0 + w1x1 + ··· + wnxn (1)
• Thus, a linear unit corresponds to the first stage of a perceptron, without the
threhold.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade33
Gradient descent and the delta rule (cont…)
where D is set of training examples, td is the target output for the training example d
and od is the output of the linear unit for the training example d.
• In order to derive a weight learning rule for linear units, let specify a measure for the
training error of a weight vector, relative to the training examples. The training error
can be computed as the following squared error:
(2)
• Here we characterize E as a function of weight vector because the linear unit
output O depends on this weight vector.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade34
Gradient descent and the delta rule (cont…)
• To understand the gradient descent algorithm, it is helpful to visualize the entire
space of possible weight vectors and their associated E values
Here the axes wo,w1 represents possible values for
the two weights of a simple linear unit. The wo,w1
plane represents the entire hypothesis space.
The vertical axis indicates the error E relative to
some fixed set of training examples. The error
surface shown in the figure summarizes the
desirability of every weight vector in the hypothesis
space.
• For linear units, this error surface must be parabolic with a single global minimum. And we
desire a weight vector with this minimum.
• How can we calculate the direction of steepest descent along the error surface? This direction
can be found by computing the derivative of E w.r.t. each component of the vector w.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade35
Derivation of the gradient descent rule
• This vector derivative is called the gradient of E with respect to the vector <w0,
…,wn>, written E .
(3)
Notice E is itself a vector, whose components are the partial derivatives of E with respect to
each of the wi. When interpreted as a vector in weight space, the gradient specifies the
direction that produces the steepest increase in E. The negative of this vector therefore gives
the direction of steepest decrease.
• Since the gradient specifies the direction of steepest increase of E, the training rule
for gradient descent is: w w + w
(4)where
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade36
Derivation of the gradient descent rule (cont…)
• Here is a positive constant called the learning rate, which determines the step
size in the gradient descent search. The negative sign is present because we want to
move the weight vector in the direction that decreases E. This training rule can also
be written in its component form:
wi wi + wi
(5)where
which makes it clear that steepest descent is achieved by altering each component
wi of weight vector in proportion to E/wi.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade37
Derivation of the gradient descent rule (cont…)
• The vector of E/wi derivatives that
form the gradient can be obtained by
differentiating E from Equation (2), as: (6)
where xid denotes the single input
component xi for the training example d.
We now have an equation that gives
E/wi in terms of the linear unit inputs xid,
output od and the target value td associated
with the training example. Substituting
Equation (6) into Equation (5) yields the
weight update rule for gradient descent.
(7)
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade38
Summary of gradient descent algorithm
• The gradient descent algorithm for training linear units is as follows:
1. Pick an initial random weight vector.
2. Apply the linear unit to all training examples
3. Compute wi for each weight according to Equation (7).
4. Update each weight wi by adding wi.
5. Repeat the steps 2 to 4 until reach the termination condition.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade39
Some additional notes
• Because the error surface contains only a single global minimum, this algorithm
will converge to a weight vector with minimum error, regardless of whether the
training examples are linearly separable, given a sufficiently small is used.
• If is too large, the gradient descent search runs the risk of overstepping the
minimum in the error surface rather than settling into it. For this reason, one
common modification to the algorithm is to gradually reduce the value of as the
number of gradient descent steps grows.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade40
Stochastic approximation to gradient descent
• The key practical difficulties in applying gradient descent are:
Converging to a local minimum can sometimes be quite slow (i.e., it can require many
thousands of steps).
If there are multiple local minima in the error surface, then there is no guarantee that the
procedure will find the global minimum.
• One common variation on gradient descent intended to alleviate these difficulties is
called incremental gradient descent (or stochastic gradient descent).
• In standard gradient descent, the error is summed over all examples before
upgrading weights, whereas in stochastic gradient descent weights are updated upon
examining each training example.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade41
Stochastic approximation to gradient descent (cont…)
• Stochastic gradient descent (i.e.
incremental mode) can sometimes
avoid falling into local minima
because it uses the various
gradient of E rather than overall
gradient of E to guide its search.
• Both stochastic and standard gradient descent methods are commonly used in practice.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade42
Multilayer neural networks
• Single perceptrons can only express linear decision surfaces. In contrast, the kind of
multilayer networks learned by the backpropagation algorithm are capable of
expressing a rich variety of nonlinear decision surfaces.
• The MLP networks:
Overcome problem of perceptrons.
A number of layers (1,2,3) are used.
Sigmoid (logistic, squashing function) unit
is used for threshold.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade43
The sigmoid threshold unit
• Like the perceptron, the sigmoid unit
first computes a linear combination of
its inputs, then applies a threshold to
the result. In the case of sigmoid unit,
however, the threshold output is a
continuous function of its input.
• Interesting property of sigmoid function:
• Output ranges between 0 and 1, increasing monotonically with its input.
• Multilayer networks of sigmoid units Backpropagation
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade44
The backpropagation (BP) algorithm
• The BP algorithm learns the weights for a multilayer network, given a network with a
fixed set of units and interconnections. It employs a gradient descent to attempt to
minimize the squared error between the network output values and the target values
for these outputs.
• Because we are considering networks with multiple output units rather than single
units as before, we begin by redefining E to sum the errors over all of the network
output units:
E(w) = ½ (tkd – okd)2
(8)
d D koutputs
where outputs is the set of output units in the network, and tkd and okd are respectively the target
and output values associated with the kth output unit and training example d.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade45
The backpropagation (BP) algorithm (cont…)
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade46
The backpropagation (BP) algorithm (cont…)
• In the BP algorithm, step1 propagates the input forward through the network. And
the steps 2, 3 and 4 propagates the errors backward through the network.
• The main loop of BP repeatedly iterates over the training examples. For each
training example, it applies the ANN to the example, calculates the error of the
network output for this example, computes the gradient with respect to the error on
the example, then updates all weights in the network. This gradient descent step is
iterated until ANN performs acceptably well.
• A variety of termination conditions can be used to halt the procedure.
One may choose to halt after a fixed number of iterations through the loop, or
Once the error on the training examples falls below some threshold, or
Once the error on a separate validation set of examples meets some criteria.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade47
Adding momentum
• Because BP is a widely used algorithm, many variations have been developed. The most
common is to alter the weight-update rule in Step 4 in the algorithm by making the weight
update on the nth iteration depend partially on the update that occurred during the (n -1)th
iteration, as follows:
Here wi,j(n) is the weight update performed during the n-th iteration through the main loop of the algorithm.
- n-th iteration update depend on (n-1)th iteration
- : constant between 0 and 1 is called the momentum.
• Role of momentum term:
- keep the ball rolling through small local minima in the error surface.
- Gradually increase the step size of the search in regions where the gradient is
unchanging, thereby speeding convergence.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade48
Remarks on the backpropagation algorithm
• Convergence and Local Minima
Gradient descent to some local minimum
Perhaps not global minimum...
• Heuristics to alleviate the problem of local minima
Add momentum
Use stochastic gradient descent rather than true gradient descent.
Train multiple nets with different initial weights using the same data.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade49
Generalization, Overfitting, and Stopping Criterion
• Termination condition:
Until the error E falls below some predetermined threshold.
This is a poor strategy.
• Overfitting problem:
Backpropagation is susceptible
to overfitting the training examples
at the cost of decreasing
generalization accuracy over
other unseen examples.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade50
Techniques to overcome overfitting problem
1. Weight decay:
Decrease each weight by some small factor during each iteration. The motivation for this
approach is to keep weight values small.
2. Cross validation:
a set of validation data in addition to the training data. The algorithm monitors the error w.r.t.
this validation data while using the training set to drive the gradient descent search.
How many weight-tuning iterations should the algorithm perform? It should use the number of
iterations that produces the lowest error over the validation set.
Two copies of the weights are kept: one copy for training and a separate copy of the best weights
thus far, measured by their error over the validation set.
Once the trained weights reach a higher error over the validation set than the stored weights, training
is terminated and the stored weights are returned.
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade51
Expressive capabilities of ANNs
• Boolean functions:
Every boolean function can be represented by network with two layers of units
where the number of hidden units required grows exponentially.
• Continuous functions:
Every bounded continuous function can be approximated with arbitrarily small
error, by network with two layers of units [Cybenko 1989; Hornik et al. 1989].
• Arbitrary functions:
Any function can be approximated to arbitrary accuracy by a network with three
layers of units [Cybenko 1988].
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade52
Other types of ANNs
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazade53
Summary
• We have learned:
– Similarities between biological neural networks and ANNs
– Percepron is the base processing element in ANNs.
– How to use gradient descent algorithm for perceptron learning.
– The structure of multilayer neural networks (MLPs)
– Principles of backpropagation algorithm
– Some fine notes about the appropriate ANNs models
Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan
http://agri.uok.ac.ir/kmollazadeKurdistan natureWinter