Artificial Neural Networks an Overview

24
7/23/2019 Artificial Neural Networks an Overview http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 1/24  Network: Computation in Neural Systems March–December 2011; 22(1–4): 208–230 Artificial Neural Networks: An Overview EVANGELIA MICHELI-TZANAKOU Department of Biomedical Engineering, Rutgers University, Piscataway,  New Jersey, U.S.A. (Received 31 October 2011; accepted 1 November 2011) Reprinted with permission from the Encyclopedia of Biomaterials and Biomedical Engineering, 2nd ed. Gary E. Wnek and Gary L. Bowlin, editors. New York: Informa Healthcare. p 119–132. Keywords:  attention, auditory system Introduction Neural networks have been a much publicized topic of research in recent years and are now beginning to be used in a wide range of subject areas. One of the strands of interest in neural networks is to explore possible models of biological computation. Human brains contain about 1.5 l0 11 neurons of various types, with each receiving signals through 10 to 10 4 synapses. The response of a neuron happens in about 1 to 10 msec. Yet, one can recognize an old friend’s face and call him in about 0.1 sec. This is a complex pattern recognition task that must be performed in a highly parallel way, because the recognition occurs in about 100 to 1000 steps. This indicates that highly parallel systems can perform pattern recognition tasks more rapidly than current conventional sequential computers. As yet, very- large-scale integration (VLSI) technology, which is essential planar implementation with at most two or three layer cross-connections, is far from achieving these parallel connections that require 3-D interconnections. Correspondence: Evangelia Micheli-Tzanakou, Department of Biomedical Engineering, Rutgers University, Piscataway, New Jersey, USA E-mail: claire bonnett@informa com

Transcript of Artificial Neural Networks an Overview

Page 1: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 1/24

 Network: Computation in Neural SystemsMarch–December 2011; 22(1–4): 208–230

Artificial Neural Networks: An Overview

EVANGELIA MICHELI-TZANAKOU

Department of Biomedical Engineering, Rutgers University, Piscataway,

 New Jersey, U.S.A.

(Received 31 October 2011; accepted 1 November 2011)

Reprinted with permission from the Encyclopedia of Biomaterials and Biomedical

Engineering, 2nd ed. Gary E. Wnek and Gary L. Bowlin, editors. New York: InformaHealthcare. p 119–132.

Keywords:  attention, auditory system

Introduction

Neural networks have been a much publicized topic of research in recent years and

are now beginning to be used in a wide range of subject areas. One of the strands of 

interest in neural networks is to explore possible models of biological computation.

Human brains contain about 1.5 l011 neurons of various types, with each receiving

signals through 10 to 104 synapses. The response of a neuron happens in about 1 to

10 msec. Yet, one can recognize an old friend’s face and call him in about 0.1 sec.This is a complex pattern recognition task that must be performed in a highly 

parallel way, because the recognition occurs in about 100 to 1000 steps.

This indicates that highly parallel systems can perform pattern recognition tasks

more rapidly than current conventional sequential computers. As yet, very-

large-scale integration (VLSI) technology, which is essential planar implementation

with at most two or three layer cross-connections, is far from achieving these parallel

connections that require 3-D interconnections.

Correspondence: Evangelia Micheli-Tzanakou, Department of Biomedical Engineering, Rutgers University,

Piscataway, New Jersey, USA E-mail: claire bonnett@informa com

Page 2: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 2/24

 Artificial Neural Networks

Although originally the neural networks were intended to mimic a task-specific

subsystem of a mammalian or human brain, recent research has mostly concen-

trated on the ANNs, which are only vaguely related to the biological system. Neural

networks are specified by (1) the net topology; (2) node characteristics; and(3) training or learning rules.

Topological consideration of the Artificial Neural Networks (ANNs) for different

purposes is provided in review papers (Lippmann 1987; Hush and Home 1993).

Because our interests in the neural networks are in classification only, the feed-

forward multilayer perceptron topology is considered, leaving the feedback 

connections to the references.

The topology describes the connection with the number of the layers and the

units in each layer for feed-forward networks. Node functions are usually non-linear

in the middle layers but can be linear or non-linear for output layer nodes. However,

all of the units in the input layer are linear and have fan-out connections from theinput to the next layer.

Each output y j  is weighted by  wij  and summed at the linear combiner, represented

by a small circle in Figure 1. The linear combiner thresholds its inputs before it

sends them to the node function  ’ j . The unit functions are non-linear, monoton-

ically increasing, and bounded functions as shown at the right in Figure 1.

Usage of Neural Networks

One use of a neural network is classification. For this purpose, each input pattern is

forced adaptively to output the pattern indicators that are part of the training data;the training set consists of the input covariate  x  and the corresponding class labels.

Feed forward networks, sometimes called multilayer perceptrons (MLP), are trained

adaptively to transform a set of input signals,   X , into a set of output signals,   G .

Feedback networks begin with the initial activity state of a feedback system. After state

transitions have taken place the asymptotic final state is identified as the outcome of 

the computation. One use of the feedback networks is in the case of associative

memories: upon being presented with a pattern near a prototype  X  it should output

Figure 1. (I) The linear combiner output   x j   ¼ Pni ¼1 yi wik   is input to the node function  ’ j ,

given the output  y j . (II) Possible node functions: hard limiter (a), threshold (b), and sigmoid

 Artificial Neural Networks   209

Page 3: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 3/24

pattern X 0, and as autoassociative memory or contents addressable memory by which

the desired output is completed to become  X .

In all cases the network learns or is trained by the repeated presentation of 

patterns with known required outputs (or pattern indicators). Supervised neural

networks find a mapping   f :  X ! G   for a given set of input and output pairs.

Other Neural Networks

The other dichotomy of the neural networks family is unsupervised learning, that is

clustering. The class information is not known or is irrelevant. The networks find

groups of similar input patterns.

The neighboring code vectors in a neural network compete in their activities by 

means of mutual lateral interactions and develop adaptively into specific detectors of 

different signal patterns. Examples are the Self Organizing Map (Kohonen 1990)and the Adaptive Resonance Theory (ART) networks (Carpenter and Grossberg

1987). The ART is different from other unsupervised learning networks in that it

develops new clusters by itself. The network develops a new code vector if there

exist sufficiently different patterns. Thus, the ART is truly adaptive, whereas other

methods require the number of clusters to be specified in advance.

Feed-Forward NetworksIn feed-forward networks the signal flows only in the forward direction; no feedback 

exists for any node. This is perhaps best illustrated graphically in Figure 2. This is

the simplest topology and is good enough for most practical classification problems

(Ripley 1993).

Figure 2. A generic feed-forward network with a single hidden layer. For bias terms theconstant inputs with 1 are shown and the weights of the constant inputs are the bias values,

which will be learned as training proceeds.

210   E. Micheli-Tzanakou

Page 4: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 4/24

The general definition allows more than one hidden layer and also allows ‘‘skip

layer’’ connections from input to output. With this skip layer, one can write a

general expression for a network output  yk  with one hidden layer,

 yk  ¼ ’k   bk þ X j !k wikxi  þ X j !k w jk’i    b j  þXi ! j  wij xi ( )( ),   ð1Þ

where the b j  and  bk represent the thresholds for each unit in the  j th hidden layer and

the output layer, which is the kth layer. Because the threshold values  b j ,  bk are to be

adaptive, it is useful to have a threshold for the weights for constant input value of 1,

as in Figure 2. The function  ’() is almost inevitably taken to be a linear, sigmoidal

ð’ðxÞ ¼ ex=ð1 þ exÞÞ  or threshold function  ð’ðxÞ ¼ I ðx4 0ÞÞ.

Rumelhart et al. (1986a) demonstrated that the feed forward multilayer

perceptron networks can learn using gradient values obtained by an algorithm

called error backpropagation.1

This contribution is a remarkable advance since 1969, when Minsky and Papert

(1969) claimed that the nonlinear boundary required for the XOR problem can be

obtained using a multilayer perceptron. The learning method was unknown at

the time.

Rosenblatt (1959) introduced the one-layer single perceptron learning method

called the   perceptron convergence   procedure and research on the single perceptron

was active until the counter example of the XOR problem, which the single

perceptron could not solve, was introduced.

In multilayer network learning the usual objective or error function to be

minimized has the form of a squared error

E ðwÞ ¼ X p

 p¼1

t  p  f  ðx p; wÞ   2

ð2Þ

that is to be minimized with respect to   w, the weights in the network. Here   p

represents the pattern index,   p ¼ 1, 2, . . . ,P and   t  p is the target or desired value

when  x p is the input to the network. Clearly, this minimization can be obtained by 

any number of unconstrained optimization algorithms; gradient methods or

stochastic optimization are possible candidates.

The updating of weights has a form of the steepest descent method,

wij   ¼ wij   @E 

@wij ,   ð3Þ

where the gradient value   @E =@wij   is calculated for each pattern being present; the

error term E (w) in the online learning is not the summation of the squared error for

all  P  patterns.

Note that the gradient points are in the direction of maximum increasing error. In

order to minimize the error it is necessary to multiply the gradient vector by  1 and

by a learning rate  .

The updating method (Equation 3) has a constant learning rate    for all weights

and is independent of time. The original method of steepest descent has the time-

dependent parameter  k; hence,  k  must be calculated as iterations progress.

 Artificial Neural Networks   211

Page 5: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 5/24

Error Backpropagation

Backpropagation was first discussed by Bryson and Ho (1969), later researched by 

Werbos (1974) and Parker (1985), and rediscovered and popularized by Rumelhart

et al. (1986a, 1986b) each pattern is presented to the network and the input  x j  and

output  y j  are calculated as in Figure 3. The partial derivative of the error functionwith respect to weights is

r E ðt Þ ¼  @E ðt Þ

@wðt Þ

  @E ðt Þ

@wnðt Þ

ð4Þ

where, n  is the number of weights and t  is the time index representing the instance

of the input pattern presented to the network.

The former indexing is for the on-line learning in which the gradient term of each

weight does not accumulate. This is the simplified version of the gradient method

that makes use of the gradient information of all training data. In other words, thereare two ways to update the weights using Equation 4.

wð pÞij     w

ð pÞij   

@E 

@wij 

ð pÞ

temporal learning   ð5Þ

wij    wij   X   @E 

@wij 

ð pÞ

epoch learning   ð6Þ

One way is to sum all the P  patterns to get the sum of the derivatives in Equation 6

and the other way (Equation 5) is to update the weights for each input and output

pair temporally without summation of the derivatives. The temporal learning, also

called on-line learning (Equation 5), is simple to implement in a VLSI chip, because

it does not require the summation logic and storing of each weight, whereas the

epoch learning in Equation 6 does. However, the temporal learning is an asymptotic

approximation version of the epoch learning, which is based on minimizing an

objective function (Equation 2).

With the help of Figure 3, the first derivatives of  E  with respect to a specific weight

w jk  can be expanded by the chain rule:

@E 

@w jk

¼  @E 

@xk

@xk

@w jk

¼  @E 

@xk

 y j  ¼ ’0

k

ðxkÞ @E 

@ yk

 y j    ð7Þ

Figure 3. Error backpropagation. The    j    for weight   wij    is obtained.   k’s are then

212   E. Micheli-Tzanakou

Page 6: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 6/24

Or

@E 

@w jk

¼ @’kðxÞ

@xk

@E 

@ yk

 y j  ¼ k y j :   ð8Þ

For output units,  @E =@ yk  is readily available, i.e., 2ð yk  t  pÞ, where  yk  and   t  p

arethe network output and the desired target value for input pattern  x

 p.  The  ’0

kxk   is

straightforward for the linear and logistic non-linear node functions; the hard

limiter, on the other hand, is not differentiable.

For the linear node function,

’0ðxÞ ¼ 1 with  y ¼  ’x  ¼ x,

where as for the logistic unit the first-order derivative becomes

’0ðxÞ ¼ yð1  yÞ ð9Þ

when

’ðxÞ ¼  ex

1 þ ex  ð10Þ

The derivative can be written in the form

@E 

@wij 

¼X

 p

 y pi  

 p j  ,   ð11Þ

which has become known as the  generalized delta rule.

The  ’s in the generalized delta rule (Equation 11) for output nodes, therefore,

become

k  ¼ 2 ykð1  ykÞð yk  t  pÞ for a logistic output unit   ð12aÞ

and

k  ¼ 2ð y p  t  pÞ for a linear output unit:   ð12bÞ

The interesting point in the backpropagation algorithm is that the   ’s can be

computed from output to input through hidden layers across the network;  ’s for the

units in earlier layers can be obtained by summing the  ’s in the higher layers. As

illustrated in Figure 3, the   j  are obtained as

 j  ¼ ’0 j ðx j  Þ

@E 

@ y j 

¼ ’0 j ðx j  Þ

X j !k

@E 

@k j 

¼ ’0 j ðx j  Þ

X j !k

w jkk   ð13Þ

The   k’s are available from the output nodes. As the updating or learning

progresses backward, the previous (or higher)  k  are weighted by the weights  w Jk’s

and summed to give the k. Because Equation 13 for k only contains terms at higherlayer units, it can be calculated backward from the output to the input of the

 Artificial Neural Networks   213

Page 7: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 7/24

 Madaline Rule III for multilayer network with sigmoid function.   Widrow took an

independent path in learning as early as the 1960s (Widrow 1962; Widrow and Lehr

1990). After some 30 years of research in adaptive filtering, Widrow and colleagues

returned to neural network research (Widrow and Lehr 1990) and extended

Madaline I with the goal of developing a new technique that could adapt multiplelayers of adaptive elements using the simpler hard-limiting quantizer. The result was

Madaline Rule II (or simply MRII), a multilayer linear combiner with a hard-

limiting quantizer.

Andes (unpublished, 1988) modified the MRII by replacing the hard-limiting

quantizer, resulting in MRIII by a sigmoid function in the Madaline, i.e., a single-

layer linear combiner with a hard-limiting quantizer. It was proven later that MRIII

is, in essence, equivalent to backpropagation. The important difference from the

gradient-based backpropagation method is that the derivative of the sigmoid

function is not required in this realization; thus, the analog implementation becomes

feasible with this MRIII multilayer learning rule.

 A comment on the terminology backpropagation.   The terminology backpropagation

has been used differently from what it should mean. To obtain the partial derivatives

of the error function (at the system output node) with respect to the weights of the

units lower than the output unit, the terms in the output unit are propagated

backward, as in Equation 13. However, the network (actually the weights) learns (or

weights are updated) using the pseudo steepest descent method (Equation 3); it is

pseudo because a constant term is used, whereas the steepest descent method

requires an optimal learning rate for each weight and time instance, i.e.,  ij (k). The

error backpropagation is indeed used to find the necessary gradient values in theupdating rule. Thus, it is not a good idea to call the back-propagation a learning

method: the learning method is a simple version of the steepest descent method,

which is one of the classical minimizer finding algorithms. Backpropagation is an

algorithm used to find the gradient of   E   in a feed-forward multilayer perceptron

network.

Optimization machines with feed forward multilayer perceptrons.   Optimization in

multilayer perceptron structures can be easily realized by gradient-based optimi-

zation methods with the help of backpropagation. In the multilayer perceptron

structure, the functions can be minimized/maximized via any gradient-based

unconstrained optimization algorithm, such as Newton’s method or the steepest

descent method.

The optimization machine is functionally described as depicted in Figure 4 and

consists of two parts: gradient calculation and weight (or parameter) updating.

The gradient of   E   of the multilayer perceptron network is obtained by error

backpropagation. If this gradient is used in an on-line fashion with the constant

learning rate, as in Equation 3, then this structure is the neural network used earlier

(Rumelhart et al. 1986b). This on-line learning structure possesses a desirable

feature in VLSI implementation of the algorithm because it is temporal: no

summation over all the patterns is required but the weights are updated as theindividual pattern is presented to the network. It requires little memory but

214   E. Micheli-Tzanakou

Page 8: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 8/24

The other branch in Figure 4 illustrates unconstrained optimization of the non-

linear function. The optimization machine obtains the gradient information as

before, but various and well-developed unconstrained optimizations can be used to

find the optimizer. The unconstrained non-linear minimization is divided basically 

into two categories: gradient methods and stochastic optimization. The gradient

methods are deterministic and use the gradient information to find the direction forthe minimizer. Stochastic optimization methods such as the ALgorithm Of 

Pattern Extraction (ALOPEX) are discussed in another section of this chapter

and in other references (Micheli-Tzanakou 1995; Micheli-Tzanakou et al. 1995;

Zahner and Micheli-Tzanakou 2000).

Comparisons of ALOPEX with backpropagation are reported by Zahner and

Micheli-Tzanakou (2000).

 Justification for gradient methods for non-linear function approximation.   Getting stuck 

in local minima is a well-known problem for gradient methods. However, the

weights (or the dimensionality of the weight space in neural networks) are usually much larger than the dimensionality of the input space   X    R p that is preferred

when searching for optimization. The redundant degrees of freedom used to find a

better minimizer justify the gradient methods used in neural networks.

Another justification for use of the gradient method in optimization may result

from the approximation of highly nonlinear functions using Taylor expansion

(Ripley 1993), where the first- and second-order approximations, i.e., a quadratic

approximation to the non-linear function, are used. The quadratic function in a

covariate x  has a unique minimum or maximum.

Training Methods for Feed-Forward Networks

Two basic methods exist to train the feed-forward networks: gradient-based learning

and stochastic learning. Training or learning is essentially an unconstrained

optimization problem. A number of algorithm optimizations can be applied to the

function approximated by the network in a structured way defined by the network 

topology.

In gradient-based methods, the most popular learning involves the steepest

descent/ascent method with the error backpropagation algorithm to obtain the

required gradient of the minimizing/maximizing error function with respect to

the weights in the network (Rumelhart et al. 1986a, 1986b). Another method usingthe gradient information is Newton’s method, which is basically used for zero

Figure 4. Functional diagram for an optimization machine.

 Artificial Neural Networks   215

Page 9: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 9/24

the zero finding of the first derivative of a function; hence, Newton’s method

is valid.

All the deterministic (as opposed to stochastic) minimization techniques are

based on one or both of the steepest descent and Newton’s methods. The objective

function to be optimized is usually limited to a certain class in network optimization.The square of the error   kt  b yk2 and the information theoretic measuring the

Kullback-Leibler distance are objective functions used in feed-forward networks.

This is because of the limitation in calculating the gradient values of the network 

utilized by the error back-propagation algorithm.

The recommended method of optimization according to Broyden, Fletcher,

Goldfarb, and Shannon (BFGS) is the well-known Hessian matrix update in

Newton’s method of unconstrained optimization (Peressini et al. 1988), which

requires gradient values. For the optimization machine in Figure 4, the feed-forward

network with backpropagation provides the gradients and the Hessian approxima-

tion is obtained using the BFGS method.

The other dichotomy of the minimization of an unconstrained non-linear

multivariate function is grouped into the so-called stochastic optimization. The

representative algorithms are simulated annealing (Kirkpatrick et al. 1983),

Boltzmann machine learning (Hinton and Sejnowski 1986), and ALOPEX 

(Tzanakou and Harth 1973; Harth and Tzanakou 1974). Simulated annealing

(Kirkpatrick et al. 1983) has been successfully used in combinatoric optimization

problems such as the traveling salesman problem, VLSI wiring, and VLSI

placement problems. An application of feed-forward network learning was reported

(Engel 1988), with the weights being constrained to be integers or discrete values

rather than a continuum of the weight space.

Boltzmann machine learning, by Hinton and Sejnowski (1986), is similarto simulated annealing except that the acceptance of randomly chosen

weights is possible even when the energy state is decreased. In simulated annealing,

the weights yielding the decreased energy state are always accepted, but

in the Boltzmann machine probability is used in accepting the increased energy states.

The simulated annealing and the Boltzmann machine learning (a general form of 

Hopfield network (Hopfield and Tank 1985) for the associative memory applica-

tion) algorithms are mainly for combinatoric optimization problems with binary 

states of the units and the weights. Extension from binary to  M -ary in the states of 

the weights has been reported for classification problems (Hopfield and Tank 1985)

in simulated annealing training of the feed-forward perceptrons.The ALOPEX was originally used for construction of visual receptive fields

(Tzanakou and Harth 1973; Harth and Tzanakou 1974; Tzanakou et al. 1979;

Micheli-Tzanakou 1984) but with some modifications it was later applied to

the learning of any type of network not restricted to multilayer perceptrons. The

ALOPEX is a random walk process in each parameter in which the direction of 

the constant jump is decided by the correlation between the weight changes and the

energy changes.

The ALOPEX The ALOPEX process is an optimization procedure that has been successfully 

216   E. Micheli-Tzanakou

Page 10: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 10/24

field mapping in the visual pathway of frogs, ALOPEX’s usefulness, combined with

its flexible form, increased the scope of its applications to a wide range of 

optimization problems. Since its development by Tzanakou and Harth in 1973

(Tzanakou and Harth 1973; Harth and Tzanakou 1974), ALOPEX has been

applied to real-time noise reduction (Ciaccio and Tzanakou 1990) and patternrecognition (Dasey and Micheli-Tzanakou 1989), in multilayer neural network 

training, among other applications.

Optimization procedures in general attempt to maximize or minimize a function

 F ( ). The function   F ( ) is called the cost function and its value depends on many 

parameters or variables. When the number of parameters is large, finding the set that

corresponds to the optimal solution is exceedingly difficult. If   N  were small, then

one could perform an exhaustive search of the entire parameter space to find the

"best" solution. As  N  increases, intelligent algorithms are needed to quickly locate

the solution. Only an exhaustive search can guarantee that a global optimum is

found; however, near-optimal solutions are acceptable because of the tremendousspeed improvement over exhaustive search methods.

Backpropagation, described earlier, being a gradient descent method, often gets

stuck in local extrema of the cost function. The local stopping points often represent

unsatisfactory convergence points. Techniques have been developed to avoid the

problem of local extrema, with simulated annealing (Kirkpatrick et al. 1983) being

the most common. Simulated annealing incorporates random noise, which acts to

dislodge the process from local extremes. It is crucial to the convergence of the

process that the random noise be reduced as the system approaches the global

optimum. If the noise is too large, the system will never converge and can be

mistakenly dislodged from the global solution.

The ALOPEX is another process that incorporates a stochastic element to avoid

local extremes in search of the global optimum of the cost function. The cost

function or response is problem dependent and is generally a function of a large

number of parameters. The ALOPEX iteratively updates all parameters simulta-

neously based on the cross-correlation of local changes,  D X h and the global response

change DR, plus additive noise. The cross-correlation term D X t DR helps the process

move in a direction that improves the response (Equation 14). Table I illustrates

how this can be used to find a global maximum of  R.

All parameters  X i   are changed simultaneously in each iteration according to

 X i   nð Þ ¼ X i ðn 1Þ þ  D X i   nð ÞDR nð Þ þ r i   nð Þ:   ð14Þ

The basic concept is that this cross-correlation provides a direction of movement

for the next iteration. For example, take the case where  X i #  and  R ". This means

that the parameter X t  decreased in the previous iteration, and the response increased

for that iteration. The product  D X t DR  is a negative number, and thus  X i  would be

Table I. Combinations of parameter changes and their contribution to the ALOPEX convergence

D X i    DR   D X i DR

 X i " þ   R " þ þ

 X i " þ   R #  X i #   R " þ

Xi # R # þ

 Artificial Neural Networks   217

Page 11: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 11/24

decreased again in the next iteration. This makes perfect sense because a decrease in

 X i   produced a higher response; if one is looking for the global maximum then   X i should be decreased again. Once  X i   is decreased and R  also decreases, then  D X i DR

is now positive and  X i  increases.

These movements are only tendencies because the process includes a randomcomponent   r i (n), which will act to move the weights unpredictably, avoiding local

extrema of the response. The stochastic element of the algorithm helps avoid local

extrema at the expense of slightly longer convergence or learning periods.

The general ALOPEX updating equation (Equation 14) is explained as follows.

 X t (n) are the parameters to be updated,  n  is the iteration number, and  R( ) is the cost

function, of which the best solution in terms of  X t  is sought. Gamma,  , is a scaling

constant,   r i (n) is a random number from a gaussian distribution whose mean and

standard deviation are varied, and  D X i (n) and  DR(n) are found using

D X i   nð Þ ¼ X i ðn 1Þ  X i ðn 2Þ ð15Þ

DR nð Þ ¼ Rðn 1Þ Rðn 2Þ:   ð16Þ

The calculation of   R() is problem dependent and can be easily modified to fit

many applications. This flexibility was demonstrated in the early studies of Harth

and Tzanakou (1974). In mapping receptive fields, no a priori knowledge or

assumptions were made about the calculation of the cost function, instead a

response was measured with microelectrodes inserted in the brain. Using action

potentials as a measure of the response (Harth and Tzanakou 1974; Tzanakou et al.

1979; Micheli-Tzanakou 1984), receptive fields could be determined using the

ALOPEX process to iteratively modify the stimulus pattern until it produced

the largest response. This response was obtained from the stimulated neurons.(This was the first computer-to-brain interfacing ever used with electrodes inserted

in an animal brain to measure neuronal activity and use it as feedback to the

computer.)

Because of its stochastic nature, efficient convergence depends on the proper

control of both the additive noise and the gain factor y. Initially, all parameters  X t are random and the additive noise has a gaussian distribution with mean 0 and

standard deviation  , initially large. The standard deviation decreases as the process

converges to ensure a stable stopping point (Figure 5). Conversely,    increases with

iterations (Figure 6). As the process converges,  DR  becomes smaller and smaller,

requiring an increase in     to compensate.Additional constraints include a maximal change permitted for   X h   for a single

iteration. This bounded step size prevents the algorithm from drastic changes from

one iteration to the next.

These drastic changes often lead to long periods of oscillation, during which the

algorithm fails to converge.

 Multilayer Perceptron (MLP) Network Training with ALOPEX 

An MLP can also be trained for pattern recognition using ALOPEX. A response is

calculated for the   j th input pattern based on the observed and desired output,

R nð Þ ¼ Odes Oobs nð Þ Oobs 2

ð17Þ

218   E. Micheli-Tzanakou

Page 12: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 12/24

where   Oobsk   and   Odes

k   (the observed and desired outputs, respectively) are vectors

corresponding to   Ok   for all   k.  The total response for iteration   n   is the sum of all

individual template responses,  R j (n).

RðnÞ ¼ Xm

 j ¼1

R j  ðnÞ ð18Þ

In Equation 18,   m   is the number of templates used as inputs. The ALOPEX 

iteratively updates the weights using both the global response information and the

local weight histories, according to the following equations,

W ij   nð Þ ¼ r t   nð Þ þ  DW ij   nð ÞDR nð Þ þ W ij ðn 1Þ ð19Þ

W  jk   nð Þ ¼ r i   nð Þ þ  DW  jk   nð ÞDR nð Þ þ W ikðn 1Þ,   ð20Þ

where      is an arbitrary scaling factor,   r i (n) is the additive gaussian noise,   DW 

represents the local weight change, and   DR   represents the global response

information. These values are calculated by 

DW ðnÞ W ðn 1Þ W ðn 2Þ ð21Þ

0

1

2

3

4

5

6

7

Iterations

      S     i    g    m    a

Figure 6. Gaussian noise standard deviation vs. iterations.

0

0.2

0.4

0.6

0.8

1

1.2

Iterations

      G    a    m

    m    a

Figure 5. Gamma vs. iterations.

 Artificial Neural Networks   219

Page 13: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 13/24

DW  jk   nð Þ ¼ W  jkðn 1Þ W  jkðn 2Þ ð22Þ

DR nð Þ ¼ Rðn 1Þ Rðn 2Þ:   ð23Þ

After the network was trained, it was tested for correct recognition using

incomplete or noisy input patterns. The results demonstrate the robustness of the

system to noise-corrupted data. Regardless of which training procedure was used,

backpropagation or ALOPEX, the recognition ability of the system was the same.

The only difference was in how the response grew with iterations.

The neural network’s robustness is derived from its parallel architecture and

depends on the network topology rather than the learning scheme used to train. The

network used was a three-layer feed-forward network with 35 input nodes, 20 hidden

nodes, and 10 output nodes. The network’s recognition ability was tested with noisy 

input patterns. Each 5 7 digit of the training set was subjected to noise of varying

gaussian distribution and tested for correct recognition. The single-value response

feedback, as opposed to the extensive error propagation schemes of other neuralnetwork training algorithms, makes ALOPEX suitable for fast VLSI implementation.

In our laboratory, an analog VLSI chip was designed to implement the ALOPEX 

algorithm, as described in another section. Making full use of the algorithm’s

tolerance to noise, an analog design was chosen. Analog designs offer larger and

faster implementations than digital designs.

Issues in Neural Networks

Universal approximation.   More than 30 years ago, Minsky and Papert (1969)reported that simple two-layer (no hidden layers) networks cannot approximate the

non-linearly separating functions (e.g., XOR problems), but a multilayer neural

network could do the job. Many results on the capability of multilayer perceptron

have been reported. Some theoretical analyses for the network capability of the

universal approximator are listed below and are extensively discussed by Hornik 

et al. (1989).

Kolmogorov (1957) tried to answer the question of Hilbert’s 13th problem,

i.e., the multivariate function approximation by a superposition of the functions

of one variable. The superposition theory sets the upper limit of the number of 

hidden units to (2nþ   1) units, where   n   is the dimensionality of the multivariate

function to be approximated. However, the functional units in the network are

different for the different functions to be approximated, while one would like to find

an adaptive method to approximate the function from the given training data at

hand. Thus, Kolmogorov’s superposition theory says nothing about the capability of 

a multilayer network or which method is best.

More general views have been reported. Le Cun (1987) and Lapedes and Farber

(1988) demonstrated that monotone squashing functions can be used in the two

hidden layers to approximate the functions. Fourier series expansion of a function

was realized by a single-layer network by Gallant and White (1988) with cosine

functions in the units. Further related results using the sigmoidal or logistic units are

demonstrated by Hecht-Nielsen (1989). Hornik et al. (1989) presented a generalapproximation theory of one hidden layer network using arbitrary squashing

220   E. Micheli-Tzanakou

Page 14: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 14/24

many hidden units are available. However, the number of hidden units is not

considered to attain any given degree of approximation (Hornik et al. 1989).

The number of hidden units obviously depends on the characteristics of the

training data set, i.e., the underlying function to be estimated. It is intuitive to say that

the more complicated functions to be trained, the more hidden units are required.For the number of the hidden units Baum and Haussler (1989) limit the size of 

general networks (not necessarily the feed-forward multilayer perceptrons) by 

relating it to the size of the training sample. The authors analytically demonstrated

that if the size of the sample is   N,   and we want to correctly classify future

observations with at least a fraction (1 "/2), then the size of the sample has a lower

bound given by 

 N   0  W 

"log

 N 

"

,

where   W   is the number of the weights and   N   is the number of the nodes in a

network. This, however, does not apply to the interesting feed-forward neuralnetworks and the given bound is not useful for most applications.

There seems to be no rule of thumb for the number of hidden units (Ripley 

1993). Finding the size of the hidden units can usually be performed by cross-

validation or any other resampling methods. The usual starting value for the size is

suggested to be about the average of the number of the input and output nodes

(Zahner and Micheli-Tzanakou 2000). Failure in learning can be attributed to three

main reasons (Hornik et al. 1989):

.   Inadequate learning;

.  Inadequate number of hidden units; or

.  The presence of a stochastic rather than a de

.  terministic relation between input and target in the

.   training data, i.e., noisy training data.

The VLSI implementation of ALOPEX 

Artificial neural networks have existed for many years, yet because of 

recent advances in technology they are again receiving much attention. Major

obstacles in ANN, such as a lack of effective learning algorithms, have been

overcome in recent years. Training algorithms have advanced considerably, and

now VLSI technology may provide the means for building superior networks. In

hardware, the networks have much greater speed, allowing for much larger

architectures.

The tremendous advancement in technology during the past decades, particularly 

in VLSI technology, renewed interest in ANNs. Hardware implementations of 

neural networks are motivated by a dramatic increase in speed over software models.

The emergence of VLSI technology has and will continue to lead neural network 

research in new directions. The past few years saw a considerable advancement in

VLSI. Chips are now smaller, faster, contain larger memories, and are becoming

cheaper and more reliable to fabricate.Neural network architectures are varied, with over 50 different types explored in

 Artificial Neural Networks   221

Page 15: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 15/24

optical, or electro-optical in design. A major problem in hardware realizations is

often not a result of the network architecture, but the physical realities of the

hardware design. Optical computers, while they may eventually become commer-

cially available, suffer far greater problems than do VLSI circuits. Thus, for the

immediate and near future, neural network hardware designs will be dominatedby VLSI.

Much debate exists as to whether digital or analog VLSI design is better suited for

neural network applications. In general, digital designs are easier to implement and

better understood. In digital designs, computational accuracy is only limited by the

chosen word length. Although analog VLSI circuits are less accurate, they are

smaller, faster, and consume less power than digital circuits (Mead and Ismail

1989). For these reasons, applications that do not require great computational

accuracy are dominated by analog designs.

Learning algorithms, especially backpropagation, require high precision and

accuracy in modifying the weights of the network, which led some to believethat analog circuits are not well suited for implementing learning algorithms

(Ramacher and Ruckert 1991). Analog circuits can achieve high precision at the cost

of increasing the circuit size. Analog circuits with high precision (8 bits) tend to

be equally large compared with their digital counterparts (Graf and Jackel 1989).

Thus, high-precision analog circuits lose their size advantage over digital circuits.

Analog circuits are of greater interest in applications requiring only moderate

precision.

Early studies demonstrate that analog circuits can realize learning algorithms,

provided that the algorithm is tolerant to hardware imperfections such as low

precision and inherent noise. Macq et al. present a fully analog implementation of a

Kohonen map, one type of neural network, with on-chip learning (Macq et al.

1993). Because analog circuits have been proven capable of the computational

accuracy necessary for weight modification, they should continue to be the choice of 

neural network research.

Analog circuits are far superior to digital circuits in size, speed, and power

consumption, and these areas constrain most neural network applications. To

achieve greater network performance, the size of the network must be increased.

The ability to implement larger, faster networks is the major motivation for

hardware implementation, and analog circuits are superior in these areas. Power

consumption is also a major concern as networks become larger (Andreou 1991).

As the number of transistors per chip increases, power consumption becomes a

major limitation. Analog circuits dissipate less power than digital circuits, thus

permitting larger implementations.

Beside its universality to a wide variety of optimization procedures, the nature of 

the ALOPEX algorithm makes it suitable for VLSI implementation. The ALOPEX 

is a biologically influenced optimization procedure that uses a single-value global

response feedback to guide weight movements toward their optimum. This single-

value feedback, as opposed to the extensive error propagation schemes of other

neural network training algorithms, makes ALOPEX suitable for fast VLSI

implementation.

Recently, a digital VLSI approach to implementing the ALOPEX algorithmwas undertaken by Pandya et al. (1990). Results of their study indicated

222   E. Micheli-Tzanakou

Page 16: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 16/24

(SIMD) architecture. A simulation of the design was carried out in software and

good convergence for a 4 4 processor array was demonstrated.

The importance of VLSI to neural networks has been demonstrated. For neural

networks to achieve greater abilities, larger and faster networks must be built. In

addition to size and speed advantages, other factors, including cost and reliability,make VLSI implementations the current trend in neural network research. The

design of a fast analog optimization algorithm, ALOPEX, is discussed below.

Let us assume that we have an array of 64 pixels that we call   I i (n), where   n

represents the iteration. The additive gaussian white noise is denoted by   r i (n) and

R j (n) is the response (or cost function) of the   j th template at iteration  n.

Let us assume that there are 16 templates to choose from, each with 64 pixels.

The ALOPEX process is run on each of them with the objective being to recognize

(converge to) an input pattern. As a result of the iterative behavior, if allowed to run

long enough ALOPEX will eventually converge to each of the templates. However a

‘‘match’’ can be found by choosing the template that took the least amount of timeto converge.

By convergence we mean finding either the global maximum or the minimum of 

the response function. This response function can be calculated many different

ways, depending on the application. To allow this chip to be general enough to

handle many applications, the response is computed off the chip. A Programmable

Read Only Memory (PROM) is used to compute the response based on the error

between the input,   I i (n), and the template. The PROM enables the response

function to be changed to meet the needs of the application.

Although the chip design is limited to only 64 ALOPEX subunits, the parallel

nature of ALOPEX enables many chips to be wired together for larger applications.

Parallel implementations are made easy because each subunit receives a single global

response feedback, which governs its behavior. Backpropagation, on the other hand,

requires dense interconnections and communication between each node. This

flexibility is a tremendous advantage when it comes to hardwired implementations.

Originally, the ALOPEX chip was designed using digital VLSI techniques. Digital

circuitry was chosen over analog because it is easier to test and design. Floating

point arithmetic was used to insure a high degree of accuracy. The digital design

consisted of shift registers, floating point adders, and floating point multipliers.

However, after having done much work toward the digital design it was abandoned

in favor of an analog design. The performance of the digital design was estimated

and was much slower than that of an analog design. The chip area of the digital

design was much larger than that of an analog design would be. Also, the ALOPEX 

algorithm would be tolerant of analog imperfections because of its stochastic nature.

For these reasons, it seemed clear that a larger, faster network could be designed

with analog circuitry.

The analog design required components similar to the digital design to

implement the algorithm. The main requirements were an adder, a multiplier, a

difference amplifier, a sample and hold mechanism, and a multiplexing scheme.

These cells each perform a specific function and are wired together in a way that

implements the ALOPEX process.

The chip is organized into 64 ALOPEX subunits, one for each pixel in the inputimage. They are stacked vertically, wiring by abutment. Each subunit is composed

 Artificial Neural Networks   223

Page 17: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 17/24

following cells: a group selector, demultiplexer, follower aggregator, multiplier,

transconductance amplifier, multiplexor, and another group selector.

The gaussian white noise required for the ALOPEX process is added to the input

before it reaches the chip. This will allow precise control of the noise, which is very 

important in controlling the stability of the algorithm. If there is too much noise, thesystem will not converge. If there is too little noise, the system will get stuck in local

minima of the cost function. By controlling the noise during execution, using a

method similar to simulated annealing (Kirkpatrick et al. 1983) where the noise

decays with time, the convergence time can be improved (Dasey and Micheli-

Tzanakou 1989). By having direct control of the added noise, the component and

functional testing can be performed with no noise added, greatly simplifying the

testing.

The addition, multiplication, and subtraction required by the ALOPEX 

algorithm are performed by the follower aggregator, the Gilbert multiplier (Mead

and Ismail 1989), and the transconductance amplifier, respectively. The followeraggregation circuit computes the weighted average of its inputs. By weighing the

inputs equally, the circuit computes the average of the two inputs. The average was

chosen instead of the sum because the circuit is more robust, in that the output

never has to exceed the supply voltage. A straight summer is more difficult to design

because voltages greater than the supply voltage could be required. The output

of the follower aggregator is sent to the multiplier, where a C-switch acts as a

sample and hold, to store the value of the previous iteration. The difference

between these signals is one input to the multiplier. The previous two responses,

calculated off chip, are the other two inputs. The output of the multiplier

is controlled by the control signal    . The error signal is computed by a

transconductance amplifier.

In designing the chip, much effort was made in making it controllable and

testable, while making the chip general enough that it could be used in a wide variety 

of applications. This is why the gaussian white noise is added off chip and also why 

the error signal is taken off chip for the computation of the response. This not only 

allows the response function to be changed to meet the requirements of the specific

application, but also provides the operator with accessible test points.

Although backpropagation is the most widely used software tool for training

neural networks, it is less suitable for VLSI hardware implementation than

ALOPEX for many reasons. Whereas backpropagation converges quickly as a

result of its gradient descent method, it can often get stuck in local extrema.

The ALOPEX tends to avoid local extrema by incorporating the random noise

component at the expense of slightly longer convergence times.

The major differences arise when hardware implementation is discussed.

Backpropagation is computationally taxing because of the error computation

required for each node in the network. Each error is a function of many parameters

(i.e., all the weights of the following layer). In hardware, very complex intercon-

nections between all nodes are required to compute this error.

The ALOPEX is ideal for VLSI implementation for a couple of reasons. First, the

algorithm is tolerant to small amounts of noise; in fact, noise is incorporated to help

convergence. Second, all parameters change based on their local history and asingle-value global response feedback. This single-value feedback is much simpler to

224   E. Micheli-Tzanakou

Page 18: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 18/24

Neuromorphic Models

Since the era of the vacuum tube, a multitude of neuronal models composed of 

discrete components and off-the-shelf Integrated Circuits (ICs) have been

published. Similar efforts in custom VLSI, however, are far fewer in number. A

good introduction to a number of neuronal attributes was presented by Linares-Barranco et al. (1991). Complementary Metal Oxide Semiconductor (CMOS)-

compatible circuits for approximating a number of mathematical models of cell

behavior are described. In its simplest form, this model represents the cell

membrane potential in the axon hillock as nothing more than a linear combination

of an arbitrary number,   n,  of dendritic inputs,   X,  each of which is weighted by a

unique multiplier,   W,   summed, and processed by a non-linear range-limiting

operator,  f  The mathematical equation for this relationship is

Y k  ¼ f  Xn

i ¼1

W i  X i ( ) ¼  f fS kg,   ð24Þ

This relationship is realized in a CMOS circuit. This circuit is totally static and

makes no provision for time courses of changes in input or output signals or intra-

cellular relationships. An operational transconductance amplifier (OTA), as

described by Wolpert and Micheli-Tzanakou (1986), is used in lieu of operational

amplifiers for this and most other VLSI neural network applications. Highly 

compatible with CMOS circuit technology, it is structurally simple and compact,

realizable with only nine transistors, and provides reasonable performance. The only 

consideration it warrants is that its transfer function is a transconductance. As such,

operations performed on its output signals must be oriented to its current, rather

than its voltage. When driving high load impedances, as is usually the case with

CMOS circuitry, this is only a minor inconvenience, necessitating buffering for

lower load impedances. In fact, under some circumstances, such as when algebraic

summation is being performed, a current output may actually be an advantage,

allowing output nodes to be simply tied together.

The non-linear range-limiting operator,   f , mentioned earlier, is necessitated by 

the observation that, for a given biological neuron, there are limits on the strength of 

the electrochemical gradients the cell’s ionic pumps can generate. This imposes

limits on how positive and negative cell membrane potential may go. Because a

neuron may receive inputs from many other neurons, there is no such limit on the

aggregate input voltage applied. As a result, an activation function, a non-linearity of 

the relationship between aggregate input potential and output potential of a neuron,

must be imposed. This is typically done in one of three ways: the binary hard-

limiter, which assumes one of only two possible states, active or inactive; the linear-

graded threshold, which assumes a linear continuum of active states between its

minimal and maximal values; and the sigmoid, which assumes a sigmoidal

distribution of values between its negative minimal and positive maximal output

values.

Which type of activation function is employed depends on the type of artificial

neuron and network in which it is implemented. In networks where cell outputs are

all-or-none, such as the McCullouch and Pitts models (McCullouch andPitts 1943), the binary threshold model is used. In networks where neurons are

 Artificial Neural Networks   225

Page 19: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 19/24

and inhibitory inputs, such as Hopfield networks (Hopfield and Tank 1985),

the linear threshold model is used. In networks where a synaptic connection must

be both excitatory and inhibitory, depending on the level of activity, the

sigmoid threshold is used. In either of the latter two activation functions, the

slope of the overall characteristic can be varied to suit the sensitivity of the cell inquestion.

Another well-developed implementation of individual artificial nerve cells is that

by Wolpert and Micheli-Tzanakou (1986, 1995). Whereas most neuromorphic

models are based on the Hodgkin-Huxley equations, this one uses a sequencer to

synthesize the action potential in three distinct phases. It also employs a different

formulation for cell membrane and threshold potentials known as an integrate-

and-fire model, presented and implemented in discrete components by French and

Stein in 1970 (French and Stein 1970). It makes use of the aforementioned leaky 

integrator and provides off-chip control over the response and persistence of stimuli

assimilated into the membrane potential. The model affords similar controls overthe resting level and time constant of the cell threshold potential and allows

refraction to be recreated. This organization also affords control over the shape,

resting level, and duration of the action potential and produces a Transistor-

Transistor Logic (TTL)-compatible pulse in parallel with the action potential.

These controls, all of which are continuously and precisely adjustable, make this

model ideal for replicating the behavior of a wide variety of individual nerve cells,

and it has been successfully applied as such.

The Wolpert and Micheli-Tzanakou model is organized around three critical

nodes, the somatic potential, the axonal potential, and the threshold potential. Each

of these nodes is biased off-chip with an R-C network so that its resting level and

time constant are independently and continuously controllable. Stimuli to the cell

are buffered and standardized by truncation into 10-msec impulses. Synaptic weight

inputs on the excitatory and inhibitory pathways allow this value to be increased or

decreased from off-chip. The impulses are then applied to a somatic potential by a

push-pull Metal-Oxide-Semiconductor Field-Effect Transistor (MOSFET) stage

and compared with threshold potential by an OTA acting as a conventional voltage

comparator. When threshold is exceeded, an action potential is synthesized and

outputted. This waveform is then binarized and buffered to form a binary-

compatible output pulse. At the same time, threshold is elevated to form the

refractory period. The circuit consists of approximately 130 transistors plus a few

on-chip and discrete resistors and capacitors and was implemented in a conven-

tional CMOS technology, requiring a single-ended DC supply of 4-10 V DC and

occupying 0.6 mm2 of chip area.

With its critical nodes bonded out off-chip, Wolpert and Micheli-Tzanakou’s

neuromime rate of operation may be accelerated from a biologically compatible time

frame over several orders of magnitude. This model was first implemented in 1986

(Wolpert and Micheli-Tzanakou 1986) and was intended as a flexible and accurate

esthetic, rather than a mathematical model of cell behavior. Since then, it has been

used to successfully recreate a number of networks from well-documented

biological sources. Waveforms obtained in these recreations have demonstrated a

striking similarity to intracellular recordings taken from their biological counter-parts. It has also been successfully applied to problems in robotics and

226   E. Micheli-Tzanakou

Page 20: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 20/24

 Neurological Process Modeling 

Lateral inhibition is the process in which a cell containing some level of information

encoded as its output level acts to inhibit and is inhibited by a similar adjoining cell.

For many years, this process has been observed with striking regularity in both 1-

and 2-D arrays of sensory receptors in a variety of systems and a variety of organisms. In numerous morphological, mathematical, and circuit studies, it has

been identified as a key image preprocessing step that optimizes a sensory image to

facilitate fast and accurate recognition in subsequent operations. Lateral inhibition

accomplishes this by amplifying differences, enhancing image contrast, lending

definition to its outward shape, and isolating the image from its background.

Whereas a digital computer would accomplish this process one pixel at a time,

biological systems manage it in a manner that is both immediate and simultaneous.

Laterally inhibited behavior has been observed in pairs of cells implemented in

hardware and software models by many researchers, but in dedicated VLSI by only 

a few. Among them, Nabet and Pinter of Drexel University and Darling of theUniversity of Washington have extensively studied the stability and effectiveness of 

both pairs and linear strings of mutually inhibiting cells in CMOS VLSI and

obtained results that correlate well with biological data (Nabet and Pinter 1991).

This line of work has been explored in two dimensions in another series of VLSI-

based models by Wolpert and Micheli-Tzanakou (1993). Arrays of mutually 

inhibiting cells that inhibit via continuously active connections and cells that inhibit

by dynamic or strobed controls both offered stable and variable control over the

degree of inhibition. Arrays of hexagonally interconnected cells were more stable

than the square array, which tended to ‘‘checkerboard’’ when significant levels of 

inhibition were attempted. Feedback inhibition, where one array is used to storeboth the initial and the inhibited images, was as effective, but less convenient to

access than feed-forward inhibition, where separate input and inhibited images are

maintained.

Characterization of lateral inhibition in the context of a more specific biological

model has been pursued in another noteworthy effort by Andreou et al. (1991).

Multiple facets of cell-cell interactions, including both mutual inhibition and

leakage of information between adjoining cells, were implemented in VLSI as a

model of early visual processing in the mammalian retina (Andreou et al. 1991).

There, adjacent cells on the photoreceptor layer intercommunicate through gap

junctions, where their cell membrane potentials couple through a resistive path.

Simultaneously, optical information from the photo-receptor cells is downloaded to

corresponding cells of the horizontal layer, which have mutually inhibitory 

connections. One-dimensional arrays, and, subsequently, 2-D models of these

relationships were implemented in analog VLSI and tested. Although little

numerical data were published from these arrays, the 2-D array produced a

number of optical effects associated with the human visual system, including Mach

bands, simultaneous contrast enhancement, and the Herman-Herring illusion, all of 

which are indicative of the real-time image processing known to occur in the

mammalian retina.

Finally, the definitive VLSI implementation of a 2-D array is the well-known

silicon retina devised by Mead and described in his text, ‘‘Analog VLSI and NeuralSystems’’ (Mead 1989). Mead presents a comprehensive treasury of analog VLSI

 Artificial Neural Networks   227

Page 21: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 21/24

networks in VLSI. The book then goes on to discuss several applications of analog

ANNs, culminating in an auditory model of the cochlea and a visual model of the

retina.

Articles of further interest

Biopotential Amplifiers; Biosensors; Compression of Digital Biomedical Signals; Control of 

 Movement; Digital Biomedical Signal Acquisition and Processing: Basic Topics; Digital 

Biomedical Signal Processing: Advanced Topics; Excitable Tissue, Electrical Stimulation

Of; Modeling, Biomedical 

Declaration of interest:  The authors report no conflicts of interest. The authors

alone are responsible for the content and writing of the paper.

Note

[1] A comment on the terminology backpropagation is given in the next section. There,

backpropagation is interpreted as a method of finding the gradient values of a feed-

forward multilayer perceptron network rather than as a learning method. A

pseudosteepest descent method is the learning mechanism used in the network.

References

Andreou A, et al. 1991. VLSI Neural Systems. IEEE Trans. Neural Netw. 2(2):205–213.

Andreou AG, Boahen KA, Pouliquen PO, Pava-sovic A, Jenkins RE, Strohbehn K. 1991. Current-mode

subthreshold MOS circuits for analog VLSI neural systems. IEEE Trans. Neural Syst. 2(2).

Baum E, Haussler D. 1989. What size net gives valid gen eralization? Neural Comput. 1:151–160.

Bryson AE, Ho YC. 1969. Applied optimal control. New York: Bleisdell.

Carpenter GA, Grossberg S. 1987. Art2 Self-organization of stable category recognition codes for analog

input patterns. Appl. Optics 26:4919–4930.

Ciaccio E, Tzanakou E. 1990. The ALOPEX Process: Application to Real-Time Reduction of Motion

Artifact. Annual International Conference of IEEE-EMBS, Philadelphia, 12(3):1417–1418.Dasey TJ, Micheli-Tzanakou E. A Pattern Recognition Application of the ALOPEX Process with

Hexa gonal Arrays. International Joint Conference on Neural Networks, Washington, DC, 1989,

II: 119–125.

Engel J. 1988. Teaching feed-forward neural networks by simulated annealing. Complex Syst.

2:641–648.

French AS, Stein RB. 1970. A flexible neuronal analog using integrated circuits. IEEE Trans. Biomed.

Eng. 17:248–253.

Gallant AR, White J. There Exists A Neural Network That Does Not Make Avoidable Mistakes, IEEE

Second International Conference on Neural Networks; San Diego, CA, 1988, I: 657–664.

Graf HP, Jackel LD. 1989. Analog electronic neural network circuits. IEEE Circuits and Devices

Magazine July:44–55.

Harth E, Tzanakou E. 1974. ALOPEX: A stochastic method for determining visual receptive fields. Vis.Res. 14:1475–1482.

Hecht-Nielsen R. Theory of the Back Propagation Neural Network, Proceedings of the International

228   E. Micheli-Tzanakou

Page 22: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 22/24

Hinton GE, Sejnowski TJ. 1986. Learning and relearning in Boltzmann machines. In: Rumelhart DE,

McClelland JL, the PDP Research Group, editors. Parallel distributed processing: Explorations in the

microstructure of cognition I: Foundations. Chap. 7. Cambridge, MA: MIT Press.

Hopfield JJ, Tank DW. 1985. Neural computation of de cisions in optimization problems. Biol.

Cybernet. 52:141–152.

Hornik K, Stichcombe M, White H. 1989. Multilayer feedforward networks are universal approximators.Neural Netw. 2:359–366.

Hush D, Home B. 1993. Progress in supervised neural networks. IEEE Signal Processing Magazine

 January:8–39.

Kirkpatrick S, Gelatt Jr CD, Vecchi MP. 1983. Optimization by simulated annealing. Science

220(4598):671–680.

Kohonen T. 1990. The self-organizing map. Proc. IEEE 78(9):1464–1480.

Kolmogorov AN. 1957. On the representation of continuous functions of many variables by 

superposition of continuous functions of one variable and addition. Doklady Akad. Nauk SSR 

114:953–956.

Lapedes A, Farber R. 1988. How neural networks work, Technical report. NM: Los Alamos National

Laboratory: Los Alamos.

Le Cun Y. Modeles Connexionistes de L’apprentissage, Ph.D. thesis; Universite Pierre et Marie Curie:

Paris, France, 1987.

Linares-Barranco B, Sanchez-Sinencio E, Rodriguez-Vazquez A, Huertas JL. 1991. A CMOS

implementation of Fitzhugh-Nagumo model IEEE. J. Solid-State Circuits 26:956–965.

Lippmann RP. 1987. An introduction to computing with neural nets. IEEE ASSP Magazine April:4–22.

Macq D, Verlcysen M, Jespers P, Legat J. 1993. Analog implementation of a Kohonen map with on-chip

learning. IEEE Trans. Neural Netw. 4(3):456–61.

McCullouch WS, Pitts W. 1943. A logical calculus of the ideas imminenet in nervous activity. Bull.

Math. Bio-phys. 5:115–133.

Mead CA. 1989. Analog VLSI and neural systems. Boston: Addison-Wesley.

Mead C, Ismail M, editors. 1989. Analog VLSI Implementation of Neural Systems. Boston: Kluwer

Academic.

Micheli-Tzanakou E. 1984. Non-linear characteristics in the frog’s visual system. Biol. Cybernet.

51:53–63.

Micheli-Tzanakou E. 1995. Neural networks in biomedical signal processing. In: Bronzino J, editor.

The biomedical eng. handbook. 2nd ed. Chap. 60. Boca Raton, Florida: CRC Press. pp 917–931.

Micheli-Tzanakou E, Uyeda E, Sharma A, Ramanujan KS, Dong J. 1995. Face recognition: Comparison

of neural networks algorithms. Simulation 64:37–51.

Minsky M, Papert S. 1969. Perceptrons: An introduction to computational geometry. Cambridge, MA:

MIT Press.

Nabet B, Pinter RB. 1991. Sensory neural networks: Lateral inhibition. Florida: CRC Press: Boca Raton.

Pandya AS, Shandar R, Freytag L. 1990. An SIMD Architecture for the ALOPEX Neural Network.

In Parallel Architectures for Image Processing; SPIE: Portland, Oregon, 1246:275–287.

Parker DB. 1985. Learning logic. Technical Report; Center for Computational Research in Economics

and Man agement Science, MIT: Cambridge, MA.

Peressini AL, Sullivan FE, Uhl Jr JJ. 1988. The mathematics of nonlinear programming. New York:Springer-Verlag.

Ramacher U, Ruckert U, editors. 1991. VLSI Design of Neural Networks. Boston: Kluwer Academic.

Ripley BD. 1993. Statistical Aspects of Neural Networks. In: Barndorff-Neilsen OE, Cox DR, Jensen JL,

Kendall SS, editors. Chaos and networks: Statistical and probabilistic aspects. London: Chapman &

Hall.

Rosenblatt F. 1959. Principles of neuro dynamics. New York: Spartan Books.

Rumelhart DE, Hinton GE, Williams RJ. 1986a. Learning Internal Representations by Error Back 

Propagation. In: Rumelhart DE, McClelland JL, the PDP Research Group, editors. Parallel

distributed processing explorations in the microstructure of cognition, I: Foundations. Chap. 8.

Cambridge, MA: MIT Press.

Rumelhart DE, Hinton GE, Williams RJ. 1986b. Learn ing representations by back-propagating errors.

Nature 323:533–536.Tzanakou E, Harth E. 1973. Determination of visual receptive fields by stochastic methods. Biophys. J.

15:42a.

 Artificial Neural Networks   229

Page 23: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 23/24

Tzanakou E, Michalak R, Harth E. 1979. The ALOPEX process: Visual receptive fields by response

feedback. Biol. Cybernet. 55:161–174.

Werbos PJ. Beyond Regression: New Tools for Predic tion and Analysis in the Behavioral Sciences,

Ph.D. Thesis Ed.; Harvard University: Cambridge, MA, 1974.

Widrow B. 1962. Generalization and information storage in networks of adaline ‘‘Neurons.’’.

In: Yovitz M, Jacobi G, Goldstein G, editors. Self-organizing systems. Washington, DC: SpartanBooks. pp 435–461.

Widrow B, Lehr M. 1990. 30 years of adaptive neural net works: Perceptron, madaline and

backpropagation. Proc. IEEE 78(9):1415–1442.

Wolpert S, Micheli-Tzanakou E. An Integrated Circuit Realization of a Neuronal Model, Proceedings of 

the IEEE Northeast Bioengeering Conference. 1986. New Haven, CT. March pp 13–14.

Wolpert S, Micheli-Tzanakou E. 1993. Silicon models of lateral inhibition. IEEE Trans. Neural Netw.

4(6):955–961.

Wolpert S, Micheli-Tzanakou E. 1995. A neuromime in VLSI. IEEE Trans. Neural Netw.

6(6):1560–1561.

Zahner D, Micheli-Tzanakou E. 2000. Alopex and back-propagation. In: Micheli-Tzanakou E, editor.

Supervised and unsupervised pattern recognition, feature extraction in computational intelligence.

Chap. 2. CRC Press: Boca Raton, Florida, 61–78.

230   E. Micheli-Tzanakou

Page 24: Artificial Neural Networks an Overview

7/23/2019 Artificial Neural Networks an Overview

http://slidepdf.com/reader/full/artificial-neural-networks-an-overview 24/24

Copyright of Network: Computation in Neural Systems is the property of Taylor & Francis Ltd and its content

may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express

written permission. However, users may print, download, or email articles for individual use.