University of Kurdistan Artificial Intelligence Methods (AIM) Lecturer: Kaveh Mollazade, Ph.D....

University of Kurdistan

Artificial Intelligence Methods (AIM)

Lecturer:

Kaveh Mollazade, Ph.D.

Department of Biosystems Engineering, Faculty of Agriculture, University of Kurdistan, Sanandaj, IRAN.

Lecture 2: Artificial Neural Networks (ANNs)

Artificial Intelligence Methods – Department of Biosystems Engineering – University of Kurdistan

http://agri.uok.ac.ir/kmollazade1

Contents

• This lecture will cover:

– Definitions of ANNs

– Biological inspiration

– ANNs vs. biological neural networks

– Perceprton training

– Gradient descent algorithm

– Multilayer neural networks (MLPs)

– Backpropagation algorithm



Definition

DARPA Neural Network Study (1988, AFCEA International Press, p. 60):

An artificial neural network is a system composed of many simple processing elements

operating in parallel whose function is determined by network structure, connection

strengths, and the processing performed at computing elements or nodes.

Haykin, S. (1994), Neural Networks: A Comprehensive Foundation, NY: Macmillan, p. 2:

An artificial neural network is a massively parallel distributed processor that has a natural

propensity for storing experiential knowledge and making it available for use.

An artificial neural network is an oversimplified representation of the neuron

interconnections in the human brain.



Biological inspiration

• Animals are able to react adaptively to changes in their external and internal

environment, and they use their nervous system to perform these behaviors.




• An appropriate model/simulation of the nervous system should be able to produce

similar responses and behaviors in artificial systems.




• The nervous system is build by relatively simple units, the neurons, so copying their

behavior and functionality should be the solution.

• Neuron is a simple computing element.

• Each individual neuron is not intelligent.

• Brain is composed of massively parallel network

of neurons.

• Brain is intelligent.



Biological neuron

• Soma or "cell body" is the bulbous end of a neuron, containing the cell nucleus. The word

"soma" comes from the Greek σῶμα, meaning "body". There are many different specialized

types of neurons, and their sizes vary from as small as about 5 micrometers to over 10

millimeter for some of the smallest and largest neurons of invertebrates, respectively.

Dendrite

Soma

Nucleus

Axon terminal

Axon



Biological neuron

• Dendrites (also dendron) are the branched projections of a neuron that act to propagate the

electrochemical stimulation received from other neural cells to the cell body, or soma, of the

neuron from which the dendrites project. Electrical stimulation is transmitted onto dendrites by

upstream neurons (usually their axons) via synapses which are located at various points

throughout the dendritic tree.

Dendrite

Soma

Nucleus

Axon terminal

Axon



Biological neuron

• Axon also known as a nerve fiber, is a long, slender projection of a nerve cell, or neuron, that

typically conducts electrical impulses away from the neuron's cell body. The function of the axon

is to transmit information to different neurons, muscles and glands.

Dendrite

Soma

Nucleus

Axon terminal

Axon



Biological neuron

• Axon terminals (also called synaptic boutons) are distal terminations of the branches of an

axon. Neurons are interconnected in complex arrangements, and use electrochemical signals

and neurotransmitter chemicals to transmit impulses from one neuron to the next; axon

terminals are separated from neighboring neurons by a small gap called a synapse, across

which impulses are sent.

Dendrite

Soma

Nucleus

Axon terminal

Axon



Biological neuron

• The spikes travelling along the axon of the

pre-synaptic neuron trigger the release of

neurotransmitter substances at the synapse.

• The neurotransmitters cause excitation or

inhibition in the dendrite of the post-synaptic

neuron.

• The integration of the excitatory and

inhibitory signals may produce spikes in the

post-synaptic neuron.

• The contribution of the signals depends on

the strength of the synaptic connection.



Artificial neuron vs. biological neuron



Artificial neuron vs. biological neuron

Biological Neural Network Artificial Neural Network Soma

Dendrite

Axon

Synapse

Neuron

Input

Output

Weight



Artificial neurons: The McCullogh-Pitts model

• Neurons work by processing information. They receive and provide information in

form of spikes.

Inputs

Outputw2

w1

w3

wn

wn-1

..

.

x1

x2

x3

…

xn-1

xn

y)(;

1

zHyxwzn

iii




Spikes are interpreted as spike rates.

Synaptic strength are translated as synaptic weights.

Excitation means positive product between the incoming spike rate and the

corresponding synaptic weight.

Inhibition means negative product between the incoming spike rate and the

corresponding synaptic weight.




• Nonlinear generalization of the McCullogh-Pitts neuron:

),( wxfy

y is the neuron’s output, x is the vector of inputs, and w is the vector of synaptic weights.

axwTey

1

1Sigmoidal neuron: Gaussian neuron:

2

2

2

||||

a

wx

ey

• Examples:



Artificial neural networks (ANNs)

Inputs

Output

• Each of the perceptrons receives inputs, processes inputs and delivers a single

output. The input can be raw input data or the output of other perceptrons. The

output can be the final result (e.g. 1 means yes, 0 means no) or it can be inputs to

other perceptrons.

• An ANN is composed of processing elements called perceptrons that are linked

together according to a specific network architecture.



Learning in biological systems

• Learning = learning by adaptation

• The young animal learns that the green fruits are sour, while the yellowish/reddish

ones are sweet. The learning happens by adapting the fruit picking behavior.

• At the neural level the learning happens by changing of the synaptic strengths,

eliminating some synapses, and building new ones.



Learning as optimization

• The objective of adapting the responses on the basis of the information received

from the environment is to achieve a better state. E.g., the animal likes to eat many

energy rich, juicy fruits that make its stomach full, and makes it feel happy.

• In other words, the objective of learning in biological organisms is to optimize the

amount of available resources, happiness, or in general to achieve a closer to optimal

state.



Learning in biological neural networks

• Hebbian theory is a theory in neuroscience that proposes an explanation for the

adaptation of neurons in the brain during the learning process.

“When an axon of cell A is near

enough to excite a cell B and

repeatedly or persistently takes part in

firing it, some growth process or

metabolic change takes place in one

or both cells such that A’s efficiency,

as one of the cells firing B, is

increased “ (D. O. Hebb, 1949).



Learning in biological neural networks

• The learning rules of Hebb:

• Synchronous activation increases the synaptic strength;

• Asynchronous activation decreases the synaptic strength.

• These rules fit with energy minimization principles.

• Maintaining synaptic strength needs energy, it should be maintained at those places

where it is needed, and it shouldn’t be maintained at places where it’s not needed.



Learning principle for artificial neural networks

• ENERGY MINIMIZATION

We need an appropriate definition of energy for artificial neural networks, and having

that we can use mathematical optimization techniques to find how to change the

weights of the synaptic connections between neurons.

• ENERGY = measure of task performance error



Perceptrons

• A perceptron takes a vector of real-valued inputs, calculates a linear combination of

these inputs, then outputs

1 if the result is greater than some threshold

–1 otherwise.

• Given real-valued inputs x1 through xn, the output o(x1, …, xn) computed by the

perceptron is:

o(x1, …, xn) = 1 if w0 + w1x1 + … + wnxn > 0

o(x1, …, xn) = -1 otherwise

where wi is a real-valued constant, or weight.

• Notice the quantify (-w0) is a threshold that the weighted combination of inputs w1x1

+ … + wnxn must surpass in order for perceptron to output 1.



Perceptrons

• To simplify notation, we imagine an additional constant input x0 = 1, allowing us to

write the above inequality as:

• Learning a perceptron involves choosing values for the weights w0, w1,…, wn.



Representation power of perceptrons

• We can view the perceptron as representing a hyperplane decision surface in the

n-dimensional space of instances (i.e. points). The perceptron outputs 1 for instances

lying on one side of the hyperplane and outputs –1 for instances lying on the other

side. The equation for this decision hyperplane is:

Some sets of positive and negative

examples cannot be separated by any

hyperplane. Those that can be separated

are called linearly separated set of

examples.



Example: Represent boolean functions by single perceptrons

• AND function

x1 x2 output

0 0 -10 1 -11 0 -11 1 1

<Training examples>Decision hyperplane :

w0 + w1 x1 + w2 x2 = 0-0.8 + 0.5 x1 + 0.5 x2 = 0

x1 x2 Swixi output

0 0 -0.8 -10 1 -0.3 -11 0 -0.3 -11 1 0.2 1

<Test Results>

-

-

-

+

x1

x2

-0.8 + 0.5 x1 + 0.5 x2 = 0

-

-

-

+

x1

x2

-0.8 + 0.5 x1 + 0.5 x2 = 0




• OR function

x1 x2 o utput

0 0 - 1

0 1 11 0 11 1 1

<Training e x ample s >

Decision hyperplane :w0 + w1 x1 + w2 x2 = 0-0.3 + 0.5 x1 + 0.5 x2 = 0

x1 x2 Swixi output

0 0 -0.3 -10 1 0.2 11 0 0.2 11 1 0.7 1

<Test Results>

-

+

+

+

x1

x2

-0.3 + 0.5 x1 + 0.5 x2 = 0

-

+

+

+

x1

x2

-0.3 + 0.5 x1 + 0.5 x2 = 0




• XOR function

It’s impossible to implement the XOR function by a single perceptron.

x1 x2 output0 0 -10 1 11 0 11 1 -1

<Trai ni ng exampl es>

-

+

+

-

x1

x2

-

+

+

-

x1

x2

A two-layer network of perceptrons can represent XOR function.

Refer to this equation,



Perceptron training rule

• Although we are interested in learning networks of many interconnected units, let us

begin by understanding how to learn the weights for a single perceptron.

• Here learning is to determine a weight vector that causes the perceptron to produce

the correct +1 or –1 for each of the given training examples.

• Several algorithms are known to solve this learning problem. Here we consider two:

the perceptron rule and the delta rule.



Perceptron training rule (cont…)

• One way to learn an acceptable weight vector is to begin with random weights, then

iteratively apply the perceptron to each training example, modifying the perceptron

weights whenever it misclassifies an example. This process is repeated, iterating

through the training examples as many as times needed until the perceptron

classifies all training examples correctly.

• Weights are modified at each step according to the perceptron training rule, which

revises the weight wi associated with input xi according to the rule.

t is target output value for the current training example o is perceptron output is small constant (e.g., 0.1) called learning rate

wi wi + wi

where wi = (t – o) xi



Perceptron training rule (cont…)

• The role of the learning rate is to moderate the degree to which weights are

changed at each step. It is usually set to some small value (e.g. 0.1) and is

sometimes made to decrease as the number of weight-tuning iterations increases.

• We can prove that the algorithm will converge

If training data is linearly separable

and sufficiently small.

• If the data is not linearly separable, convergence is not assured.



Gradient descent and the delta rule

• Although the perceptron rule finds a successful weight vector when the training

examples are linearly separable, it can fail to converge if the examples are not

linearly separatable. A second training rule, called the delta rule, is designed to

overcome this difficulty.

• The key idea of delta rule: to use gradient descent to search the space of possible

weight vector to find the weights that best fit the training examples. This rule is

important because it provides the basis for the backpropagration algorithm, which

can learn networks with many interconnected units.



Gradient descent and the delta rule (cont…)

• The delta training rule:

Considering the task of training an unthresholded perceptron, that is a linear unit, for

which the output o is given by:

o = w0 + w1x1 + ··· + wnxn (1)

• Thus, a linear unit corresponds to the first stage of a perceptron, without the

threhold.




where D is set of training examples, td is the target output for the training example d

and od is the output of the linear unit for the training example d.

• In order to derive a weight learning rule for linear units, let specify a measure for the

training error of a weight vector, relative to the training examples. The training error

can be computed as the following squared error:

(2)

• Here we characterize E as a function of weight vector because the linear unit

output O depends on this weight vector.




• To understand the gradient descent algorithm, it is helpful to visualize the entire

space of possible weight vectors and their associated E values

Here the axes wo,w1 represents possible values for

the two weights of a simple linear unit. The wo,w1

plane represents the entire hypothesis space.

The vertical axis indicates the error E relative to

some fixed set of training examples. The error

surface shown in the figure summarizes the

desirability of every weight vector in the hypothesis

space.

• For linear units, this error surface must be parabolic with a single global minimum. And we

desire a weight vector with this minimum.

• How can we calculate the direction of steepest descent along the error surface? This direction

can be found by computing the derivative of E w.r.t. each component of the vector w.



Derivation of the gradient descent rule

• This vector derivative is called the gradient of E with respect to the vector <w0,

…,wn>, written E .

(3)

Notice E is itself a vector, whose components are the partial derivatives of E with respect to

each of the wi. When interpreted as a vector in weight space, the gradient specifies the

direction that produces the steepest increase in E. The negative of this vector therefore gives

the direction of steepest decrease.

• Since the gradient specifies the direction of steepest increase of E, the training rule

for gradient descent is: w w + w

(4)where



Derivation of the gradient descent rule (cont…)

• Here is a positive constant called the learning rate, which determines the step

size in the gradient descent search. The negative sign is present because we want to

move the weight vector in the direction that decreases E. This training rule can also

be written in its component form:

wi wi + wi

(5)where

which makes it clear that steepest descent is achieved by altering each component

wi of weight vector in proportion to E/wi.



Derivation of the gradient descent rule (cont…)

• The vector of E/wi derivatives that

form the gradient can be obtained by

differentiating E from Equation (2), as: (6)

where xid denotes the single input

component xi for the training example d.

We now have an equation that gives

E/wi in terms of the linear unit inputs xid,

output od and the target value td associated

with the training example. Substituting

Equation (6) into Equation (5) yields the

weight update rule for gradient descent.

(7)



Summary of gradient descent algorithm

• The gradient descent algorithm for training linear units is as follows:

1. Pick an initial random weight vector.

2. Apply the linear unit to all training examples

3. Compute wi for each weight according to Equation (7).

4. Update each weight wi by adding wi.

5. Repeat the steps 2 to 4 until reach the termination condition.



Some additional notes

• Because the error surface contains only a single global minimum, this algorithm

will converge to a weight vector with minimum error, regardless of whether the

training examples are linearly separable, given a sufficiently small is used.

• If is too large, the gradient descent search runs the risk of overstepping the

minimum in the error surface rather than settling into it. For this reason, one

common modification to the algorithm is to gradually reduce the value of as the

number of gradient descent steps grows.



Stochastic approximation to gradient descent

• The key practical difficulties in applying gradient descent are:

Converging to a local minimum can sometimes be quite slow (i.e., it can require many

thousands of steps).

If there are multiple local minima in the error surface, then there is no guarantee that the

procedure will find the global minimum.

• One common variation on gradient descent intended to alleviate these difficulties is

called incremental gradient descent (or stochastic gradient descent).

• In standard gradient descent, the error is summed over all examples before

upgrading weights, whereas in stochastic gradient descent weights are updated upon

examining each training example.



Stochastic approximation to gradient descent (cont…)

• Stochastic gradient descent (i.e.

incremental mode) can sometimes

avoid falling into local minima

because it uses the various

gradient of E rather than overall

gradient of E to guide its search.

• Both stochastic and standard gradient descent methods are commonly used in practice.



Multilayer neural networks

• Single perceptrons can only express linear decision surfaces. In contrast, the kind of

multilayer networks learned by the backpropagation algorithm are capable of

expressing a rich variety of nonlinear decision surfaces.

• The MLP networks:

Overcome problem of perceptrons.

A number of layers (1,2,3) are used.

Sigmoid (logistic, squashing function) unit

is used for threshold.



The sigmoid threshold unit

• Like the perceptron, the sigmoid unit

first computes a linear combination of

its inputs, then applies a threshold to

the result. In the case of sigmoid unit,

however, the threshold output is a

continuous function of its input.

• Interesting property of sigmoid function:

• Output ranges between 0 and 1, increasing monotonically with its input.

• Multilayer networks of sigmoid units Backpropagation



The backpropagation (BP) algorithm

• The BP algorithm learns the weights for a multilayer network, given a network with a

fixed set of units and interconnections. It employs a gradient descent to attempt to

minimize the squared error between the network output values and the target values

for these outputs.

• Because we are considering networks with multiple output units rather than single

units as before, we begin by redefining E to sum the errors over all of the network

output units:

E(w) = ½ (tkd – okd)2

(8)

d D koutputs

where outputs is the set of output units in the network, and tkd and okd are respectively the target

and output values associated with the kth output unit and training example d.



The backpropagation (BP) algorithm (cont…)



The backpropagation (BP) algorithm (cont…)

• In the BP algorithm, step1 propagates the input forward through the network. And

the steps 2, 3 and 4 propagates the errors backward through the network.

• The main loop of BP repeatedly iterates over the training examples. For each

training example, it applies the ANN to the example, calculates the error of the

network output for this example, computes the gradient with respect to the error on

the example, then updates all weights in the network. This gradient descent step is

iterated until ANN performs acceptably well.

• A variety of termination conditions can be used to halt the procedure.

One may choose to halt after a fixed number of iterations through the loop, or

Once the error on the training examples falls below some threshold, or

Once the error on a separate validation set of examples meets some criteria.



Adding momentum

• Because BP is a widely used algorithm, many variations have been developed. The most

common is to alter the weight-update rule in Step 4 in the algorithm by making the weight

update on the nth iteration depend partially on the update that occurred during the (n -1)th

iteration, as follows:

Here wi,j(n) is the weight update performed during the n-th iteration through the main loop of the algorithm.

- n-th iteration update depend on (n-1)th iteration

- : constant between 0 and 1 is called the momentum.

• Role of momentum term:

- keep the ball rolling through small local minima in the error surface.

- Gradually increase the step size of the search in regions where the gradient is

unchanging, thereby speeding convergence.



Remarks on the backpropagation algorithm

• Convergence and Local Minima

Gradient descent to some local minimum

Perhaps not global minimum...

• Heuristics to alleviate the problem of local minima

Add momentum

Use stochastic gradient descent rather than true gradient descent.

Train multiple nets with different initial weights using the same data.



Generalization, Overfitting, and Stopping Criterion

• Termination condition:

Until the error E falls below some predetermined threshold.

This is a poor strategy.

• Overfitting problem:

Backpropagation is susceptible

to overfitting the training examples

at the cost of decreasing

generalization accuracy over

other unseen examples.



Techniques to overcome overfitting problem

1. Weight decay:

Decrease each weight by some small factor during each iteration. The motivation for this

approach is to keep weight values small.

2. Cross validation:

a set of validation data in addition to the training data. The algorithm monitors the error w.r.t.

this validation data while using the training set to drive the gradient descent search.

How many weight-tuning iterations should the algorithm perform? It should use the number of

iterations that produces the lowest error over the validation set.

Two copies of the weights are kept: one copy for training and a separate copy of the best weights

thus far, measured by their error over the validation set.

Once the trained weights reach a higher error over the validation set than the stored weights, training

is terminated and the stored weights are returned.



Expressive capabilities of ANNs

• Boolean functions:

Every boolean function can be represented by network with two layers of units

where the number of hidden units required grows exponentially.

• Continuous functions:

Every bounded continuous function can be approximated with arbitrarily small

error, by network with two layers of units [Cybenko 1989; Hornik et al. 1989].

• Arbitrary functions:

Any function can be approximated to arbitrary accuracy by a network with three

layers of units [Cybenko 1988].



Other types of ANNs



Summary

• We have learned:

– Similarities between biological neural networks and ANNs

– Percepron is the base processing element in ANNs.

– How to use gradient descent algorithm for perceptron learning.

– The structure of multilayer neural networks (MLPs)

– Principles of backpropagation algorithm

– Some fine notes about the appropriate ANNs models


http://agri.uok.ac.ir/kmollazadeKurdistan natureWinter

University of Kurdistan Artificial Intelligence Methods (AIM) Lecturer: Kaveh Mollazade, Ph.D....

Documents

Transcript of University of Kurdistan Artificial Intelligence Methods (AIM) Lecturer: Kaveh Mollazade, Ph.D....