Dynamically Reconfigurable Neuron Architecture for the...

Dynamically Reconfigurable Neuron Architecture for the Implementation of Self-

Organizing Learning Array

Janusz A. Starzyk,Yongtao Guo, and Zhineng Zhu

School of Electrical Engineering & Computer Science

Ohio University, Athens, OH 45701

{starzyk, gyt,zhineng}@bobcat.ent.ohiou.edu

Abstract: In this paper, we describe a new dynamically

reconfigurable neuron hardware architecture based on the

modified Xilinx Picoblaze microcontroller and self-

organizing learning array (SOLAR) algorithm reported

earlier. This architecture is aiming at using hundreds of traditional reconfigurable field programmable gate arrays

(FPGAs) to build the SOLAR learning machine. SOLAR

has many advantages over traditional neural network

hardware implementation. Neurons are optimized for

area and speed, and the whole system is dynamically self-

reconfigurable during the runtime. The system architecture is expandable to a large multiple-chip system.

1. Introduction

The immense computing power of the brain is

believed to be the result of the parallel and distributed

computing performed by approximately 1011

neurons, each

with an average of 103 – 10

4 connections. To prototype

networks that mimic the biological neurons using

hardware, FPGA technology can be used with its flexible

and programmable interconnect and computing resources.

During the last decade, numbers of approaches have been

attempted to apply reconfigurable hardware to the neural

networks [1,2]. However, these implementations are

focused on densely interconnected, weight dominated

architectures which are not easily expandable to multiple

FPGAs with identical cell architecture. Due to the

additional hardware cost of adjusting interconnection

weights of traditional neural networks, often their

connections and functionalities are fixed and predefined by

off-line simulation. The gradient method is difficult to

implement and therefore, it is hard to expand the network

to a large-scale system. One exception here are FPGA

based arrays of identical cells to implement evolutionary

computing and genetic algorithms [3,4,5]. New FPGA

structures, adopting pulse density [6], pulse position and

time difference simultaneous perturbation are presented to

avoid the hardware gradient evaluation in NNs. Also,

recently, cellular neural networks used in image

processing [7] and their hardware implementation have

attracted a lot of attention.

In this paper, we present a dynamically reconfigurable

(i.e. during runtime), data-based and expandable neuron

architecture to implement a self-organizing learning array

(SOLAR). The neuron’s functionality is represented in the

form of flexible connections and organization of an array

of evolvable signal processing blocks. Each neuron can

implement various arithmetic and logic functions. The

neurons can self-reconfigure and use local interconnect for

maximum performance.

The rest of the paper is organized as follows. Section

2 talks about our pervious work on SOLAR. Section 3

deals with the architecture of the dynamically

reconfigurable neuron. Section 4 reports the simulation

and experimental results. Finally, Section 5 concludes this

paper.

2. Previous Work

Self-organization is important in artificial neural

networks (ANNs) and machine learning. Previously, a

self-organizing learning algorithm that combines neural

networks and information theory was presented in [8].

This entropy-based neural network learning algorithm was

simulated on standard benchmarks and proved to be

advantageous over many existing neural networks and

machine learning algorithms in a wide range of technical

applications [9], in particular while dealing with noisy or

incomplete data. Based on this algorithm, we proposed

SOLAR system which is different from classical ANNs in

the way it is organized and how it learns. SOLAR self-

organizes its hardware resources to perform classification

and recognition tasks. Similar to the structure of cellular

neural networks, SOLAR has a fixed array of elemental

processing units acting as single neurons, and

programmable interconnections between them. Initially,

SOLAR neurons are randomly connected to previously

generated neurons. They learn adaptively using both

primary inputs and inputs from other neurons. Controlled

by signals from other neurons, they perform basic

transformations of their input signals. A neuron

parameters and connections are re-configured as a result of

training, and effectively, SOLAR’s structure self-organizes

establishing its final wiring.

Earlier work mainly focused on SOLAR algorithm

simulation, and its hardware implementation was limited

to a single FPGA chip with the cooperation from software

[10]. This hardware implementation explored the

adaptability of SOLAR to FPGA implementation,

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

although, the cellular neuron architecture was not fully

explored since a shared block memory was used for all

neurons. In SOLAR simulation, we adopted feed forward

network structure for its stability and fast learning.

SOLAR architecuture is divided into three main layers as

shown in Fig. 1– input, processing and output layers.

The interconnections are randomly initialized at the

beginning of the network training. The basic building

blocks of the architecture are the small neurons, trained

using the entropy-based algorithm. The simplified learning

algorithm is introduced as follows:

Let ( ){ }cS ,ν= be a training set of N vectors, where

nv ℜ∈ is a feature vector and Ζ∈c is its class label

from an index set Ζ . A classifier is a mapping

Ζ→ℜ nC : , which assigns a class label in Ζ to each

vector in nℜ . A training pair ( ) Sc ∈,ν is misclassified

if ( ) cC ≠ν . The performance measure of the classifier is

the probability of error, i.e. the fraction of the training set

that it misclassifies. Our goal is to minimize this

misclassification probability, and it is achieved through a

simple threshold searching based on a maximum

information index calculated from estimated subspace

probability and local class probabilities. The set of

training data is searched sequentially class by class to

establish the optimum thresholds, which best separate the

signals of various training classes. A partition quality is

measured by the entropy based information index defined

by:

max

1E

EI

∆−=

where +−=∆s

sssc

s c

sc PPPPE )log()log(

and −=c

cc PPE )log(max,

where cP ,

sP ,scP represent the probabilities of each

class, attribute probability, and joint probability,

respectively. The summation is performed with respect to

all classes and subspaces. The information index should

be maximized to provide an optimum separation of the

input training data. Entropy based information index is

also used to evolve the neural interconnect network

structure and to select neuron’s transformation function.

The neuron learning is realized by maximizing the

information index. If a neuron reaches specified

information index threshold, then it becomes a voting

neuron. Voting neurons’ outputs are weighted together to

solve a given problem.

3. Neuron Architecture

The hardware architecture presented in this paper is

based on the identical neuron modules. The constant

Coded Programmable State Machine (KCPSM)[11], an 8-

bit micro-controller developed by Xilinx Corp., has been

modified and embedded into the neuron module. A block

level of single neuron architecture is shown on Fig.2.

Each neuron module contains circuits to be reprogrammed

dynamically and to execute its program without affecting

other neurons’ operations. An assembler is used to

implement neuron’s functions and its operation is

controlled by locally stored parameters’ values.

After translation from the assembler to binary codes,

these binary codes are written into the dual port memory

via calling the Peripheral Component Interconnect (PCI)

functions. The dynamical programming is implemented

by a dual-port 256x16-bit memory. The KCPSM reads the

current program on one port while the other port can be

used to store the new program. The two ports of the dual-

port Random Access Memory (RAM) operate

independently, and the operation is via shared

programming bus among all neurons. Therefore, the self-

reconfiguration process can be performed affecting only

the current neuron. The configuration time and contents

can be controlled by the software outside the chip or by the

Fig. 1 Basic SOLAR structure

Fig.2 Single neuron’s schematic

instruction <15:0>

in_port <7:0>

clk

interrupt

reset

address

out_port <7:0>

port_id<7:0>

read_strobe

write_strobe

x30 inputs

Dual port

memory

R

R

Neural Controller

(use KCPSM)

address

addr_bus<7:0>

data_bus<15:0>

clk

enable

instruction<15:0>

0-7695-2132-0/04/$17.00 (C) 2004 IEEE


configuration data that are located in the system’s

distributed memory cells. The neuron inputs are obtained

either from the primary inputs or other neurons’ outputs

via a 30 to 1 multiplexer (MUX). The selection signals are

decided by the content of the programming dual-port

memory via the execution of the programming commands

for the particular neuron.

The single neuron architecture is expanded to multiple

neurons in a single FPGA chip. In this work, we use a

Nallatech board with Xilinx Virtex XCV800 FPGA [12].

It can contain an array of up to 28 neurons organized as

shown in Fig.3 (The 480 Virtex XCV1000 FPGAs we are

using to build 3D learning machine will contain 64

neurons on each chip).

These neurons are fully interconnected via the

connection bus and a 30 to 1 MUX thus forming a local

cluster of neurons. The full connections implementation

mimics the dense connection scheme in the neighborhood

neurons. A 3D expansion of these chips represents the

sparse connections between remote neurons. These

interchip connections scale linearly with the number of

neurons added. The neurons connections are decided by

each neuron based on its learning results. The

programming contents can be dynamically updated via the

configuration bus or set locally by a neuron. The

configuration bus used to configure every single neuron is

divided into 16-bit data, 8-bit address and 5-bit neuron

selection buses. To demonstrate functionality of 28

neurons cluster, we integrate a PCI interface controllers to

transfer the data/configuration via the PCI bus to neurons.

4. Routing Scheme

A traditional neural network system may consist of

many sub-networks, which take up a lot of resources in the

hardware implementation. If the resources used to

implement a single sub-network can be reused to

implement multiple sub-networks sequentially, then many

resources will be saved. This of course will be

accomplished at the cost of additional processing time for

sequential operations of neural network blocks; however,

this is not a problem, since a neural network response can

be obtained in a fraction of time required by a practical

application. For instance, in NTSC video standard, one

full frame is displayed about every 1/30 of a second; in

PAL video standard, the video images are displayed 25

frames per second, leaving more than 40 ms for

recognition of each frame. A practical time limit for

recognition may be even larger.

Since different networks could have different

connections, a new dynamically reconfigurable routing

and addressing scheme is proposed. This routing and

addressing scheme is suitable for general artificial neuron

networks, whether or not their connections are local or

global. Fig. 4 gives an example where 4 sub-networks are

replaced by a single sub-network based on the proposed

routing and addressing scheme.

Fig. 4 One network replacing multiple networks

Fig. 5 gives an example of a single sub-network with

more detailed interconnect structure.

Fig. 5 Structure of a single network

It consists of an array of 4*4 neurons. The column order

of the neurons in this single network is not natural: the

original order of 1, 2, 3 and 4 is replaced by 1, 4, 2 and 3.

This kind of routing scheme takes advantage of the

Fig. 3 Array neurons’ organization

0-7695-2132-0/04/$17.00 (C) 2004 IEEE


reconfigurable resources in FPGA that can easily be

configured into a long shift register

Scheme of the routing channel

The neural network structure shown in Fig. 5 has two

basic elements: the routing channel and the neuron cell.

The routing channel is implemented by RAM in

Configurable Logic Blocks (CLBs) of FPGA chip. These

storage RAM cells are configured into a closed loop shift

register, shifting data among neurons. This shift register

creates a routing channel carrying neurons’ input data and

their operation results. Each sub-network has its unique

connections and information about these connections is

stored in individual neurons. The input data for a specific

neuron in this sub-network will appear on neuron’s input

port after a fixed number of clock cycles. The task of

addressing these data is equivalent to counting the clock

cycles and comparing them against the pre-stored

addresses (pre-stored number of clock cycles). When the

clock cycle matches one of its pre-stored values, this

neuron will read data from the routing channel. So, each

neuron requires some memory to store the addresses of

neurons which connect to this neuron.

For example, consider an original network system that

consists of 2 sub-networks with the size of 4 by 5 neurons

each (Fig. 6). Assume, that this network system is

implemented by a single sub-network. Consider a neuron

in row 3 and column 4 for analysis. Assume that this

neuron has 4 inputs in the first sub-network of the original

network system, and 5 inputs in the second sub-network.

When these 2 sub-networks are replaced by a single sub-

network, then this neuron will have two groups of

addresses: the first group of 4 addresses storing the

number of clock cycles corresponding to the 4 inputs in the

first sub-network, and the other group of 5 addresses

storing the number of clock cycles corresponding to the 5

inputs in the second network. Output data of each neuron

is handled in a similar way.

Fig. 6 Hardware reuse to implement 2 sub-networks.

In order to transfer data from one sub-network to the

other one, the same routing channel is used. This dynamic

reconfigurability and data transfer between sub-networks

can be explained as follows.

The first column of the routing channel is connected

to the system’s input. So, the system data is applied

directly to the routing channel. For the network in Fig. 5,

after the first row finishes processing its input data, the

routing channel corresponding to the first column of

neurons stores the results of those neurons operations.

These results are shifted into the routing channel and

mixed with the input signal values which are already in the

channel, are connected to the neurons in the second

column, then into the third and forth columns. After the

neurons in the fourth column processed their input data, the

routing channel connecting to the neurons in forth column

stores the outputs of the first sub-network. By now, the

first sub-network in the original network system has

finished its operations, but at this time, it is not prepared to

receive the next set of input data because this single

network will change its connections to implement the

original second sub-network. So, the outputs of the first

sub-network, which are stored in the fourth column of the

routing channel, will be shifted back to the routing channel

in the first column, which now represents the first column

of the second sub-network of the original network system.

The data is processed in the same way as was described

above. If the original network system has four sub-

networks, then those data will be iterated for four times

before this single sub-network receives next set of input

data (The feedback network implementation in this

iterative scheme will iterate more than 4 iterations).

Changing connections for different networks can be

implemented by comparing clock cycle values against

different groups of pre-stored addresses.

Structure of routing channel

In the SOLAR’s hardware implementation, the neuron

function is implemented by the PicoBlaze [Xilinx]. The

PicoBlaze is a micro-core with 8-bit parallel input and

output. A diagram of the routing channel connected to the

input port of a PicoBlaze is given in Fig. 7.

Fig.7 Routing channel with a micro controller

0-7695-2132-0/04/$17.00 (C) 2004 IEEE


The RAM in each CLB is configured into a shift register,

with length configured from 1 to 16 - here we choose 8.

There are eight such shift registers connecting to the

corresponding bus line of the PicoBlaze’s input bus to

provide the input data in parallel.

In the routing scheme presented in this paper, neurons are

directly connected to the local routing channels, and don’t

need direct physical connection among the neurons, no

matter local or global. The connections among the neurons

are set up by stored addresses. For the network with

random connections, those stored addresses can be

generated on-line randomly. For other types of learning

networks with fixed connections, the stored addresses are

based on specified network’s interconnect structure. The

best advantage of this kind of routing and addressing

scheme is that it makes on-line learning and

reconfiguration easy to implement. For the on-line

network reconfiguration, we don't need to change the

physical connections between network components. What

we need to do is only to change pre-stored communication

channel addresses according to the learning rules.

At present, the implemented neural network consists

of 28 fully interconnected neurons on a single chip. In a

neural network system, composed of many chips, some

neurons in one chip need to be connected to the neurons in

other chips. In order to transfer data from one chip to the

others, the same concept of routing channel as shown in

Fig. 5 is used. This routing channel is laid around

perimeter of each chip. Channels in different chips can be

connected serially to form a long chain of shift registers.

Neurons in one chip can be connected to this routing

channels and output their operation results. These results

will be shifted to the channel, and after shifting they will be

picked up by neurons in other chips. This way, remote

connections between neurons in different chips are set up.

If the neurons are connected to the neurons in the sub-

network of same neighboring row or column, then those

neurons can be directly connected, without passing through

the routing channel (as illustrated in Fig. 3). One

advantage of this scheme is that, with the introduction of

the routing channel, the neurons can be connected in 3

dimensions. The neurons output their results to the routing

channel, and then those results can be shifted to the routing

channels in different layers of the neural system.

5. Simulation and Results

In this section, results from prototyping a simple

SOLAR architecture onto a single VIRTEX FPGA chip

are discussed. The neurons self-organizing learning

process can be configured during chip programming or

dynamically updated while running. To demonstrate this

process, a simple example is given to illustrate the

implementation step of the SOLAR algorithm. The

implementation of the whole SOLAR array is organized in

a similar way. In this simulation example, we have six out

of twenty-eight neurons configured as shown in Fig.8.

The initial connections are shown in solid lines.

Every neuron simply adds two inputs together, for

instance, neuron 1 adds input 1 and 2 to its content; neuron

3 adds input 2 and the output of neuron 1; neuron 5 adds

the outputs of neuron 2 and 3, etc. Later, we dynamically

reconfigure the connections of neuron 3 and neuron 5. So

neuron 3 has inputs from the outputs of neuron 1 and 2,

and neuron 5 has inputs from the outputs of neuron 3 and 4

shown by the dotted lines. The results read out from the

chip via PCI bus are shown in the Matlab console. In the

inserted Matlab command console in Fig.8, “initial” values

show primary input values (6 and 2) and neuron outputs

for two rows of neurons, while “updated” values show

inputs and neuron outputs after dynamical reconfiguration

step. We developed the Matlab DLLs to implement the

I/O functions including read and write. We can see the

results from the real experiment corresponding to VHDL

simulation results as shown in Fig. 9. In this plot,

“Enable_bus” representing the neuron selection signal,

selects a particular neuron to be configured. Once the

configuration process for all neurons is over, the outputs

from neurons are stable and ready to be read out. This way

we can update any neuron’s configuration information

without affecting the other neurons. In this example, we

only update neurons 3 and 4 connections represented by

“Enable_bus” content 4 and 5. The simulation results after

updating correspond to the real experiments read back

from hardware as the “updated” values in Fig.8.

Fig. 9 VHDL simulation for partial configuration

neu

ron

ou

tpu

ts

initial updated

Fig.8 Experimental network and result

1 3 5

2 4 6

0-7695-2132-0/04/$17.00 (C) 2004 IEEE


The design has been described in VHDL and

synthesized using the Xilinx XST and the Xilinx Alliance

tools for place and route. The implementation results are

summed up in Fig. 10.

The implementation results show that this neural

network architecture realizes a maximum parallel

instruction throughput of 23.16x28 MIPs with 28 fully

connected neurons. The neuron number is limited to 28

since there are only 28 BRAM modules on the chip

although we still have some other resources unused. If we

further lower the neuron’s program memory size, we can

put more neurons on a single chip to overcome the BRAM

bottleneck. Due to the parallel processing in hardware and

the distributed network memory, the speed improvement

using FPGA implementation is significant comparing to

the software simulation.

The mapping result is shown in Fig. 11. We can see

how the neurons are distributed inside the chip after the

mapping process. Every single neuron occupies a compact

and concentrated logic area. The compactness depends on

the structure-oriented hardware design, for instance, we

adopt LUTs and dedicated multiplexer module in addition

to the optimized Picoblaze structure to build every single

neuron.

6. 3D SOLAR

The proposed neuron structure can be expanded to

hundreds of FPGA chips to form a large 3D architecture.

This architecture is based on many printed circuit boards

(PCBs) with 4 interconnected VIRTEX XCV1000 FPGA

chips on every single PCB board. Four PCB boards are

interconnected vertically through the connectors on upper

and lower sides of each board to build up the rack

SOLAR architecture as shown in the left side of Fig. 12.

The rack SOLAR will be expanded to a rack array as

shown in the right side of Fig.12 through the flat

connectors on both the left and the right sides of the

board. In this rack array, one PCB board in a specific

rack will be chosen to be the first board from which all

signals will be broadcast to other boards in the whole

rack array - the identical configuration bits are sent to

every PCB board and FPGA chips are programmed

through JTAG port in a daisy chain or through Slave

Serial mode in parallel. These broadcast signals contain

the configuration signals and data signals. These signals

are buffered to assure signal integrity. These chips can

communicate with each other through the data bus via

various connectors. All of the chips have identical

neuron’s distribution as shown in Fig. 3. Every neuron

can be dynamically updated without affecting other

neurons. Finally, 3D SOLAR system, containing 5x5

SOLAR racks with close to 400 million gates, as shown

in Fig. 12 will be built The 3D SOLAR learning machine

will have a tremendous self-organizing learning ability

based on numerous computing cells (neurons) working in

parallel to process the incoming data.

Fig. 10 Implementation reports

Fig. 11 Mapping result

Fig.12 RACK and CUBE SOLAR

0-7695-2132-0/04/$17.00 (C) 2004 IEEE


7. Conclusions

This paper has described the architecture and chip

implementation of an array of neurons aimed at

implementation of the SOLAR architecture. The self-

organizing learning is based on a new machine-learning

algorithm that combines knowledge from NNs and

information theory. SOLAR represents a new idea in

hardware design of neural networks. It is modular and

expandable system. It also defines a new breed of

dynamically reconfigurable architectures that can

dynamically reconfigure themselves based on information

included in the input data. This presented architecture is a

novel dynamically reconfigurable (via dual-port memory)

neural network implementation that used simple general-

purpose processor (KCPSM) architecture. Firstly, it has a

regular expandable parallel architecture. Therefore, its

speed and learning abilities can be greatly improved

comparing to the software simulation. Secondly, it has

data-driven self-organizing learning structure based on a

new self-organizing learning algorithm. Furthermore,

design flexibility is attained by exploiting the features of

self-reconfigurable neuron units. Finally, hardware

reconfigurability is achieved in this self-organizing

learning array by involving reconfigurable routing

modules. According to the implementation results, this

neuron architecture realizes a maximum parallel

instruction throughput of 648 MIPs with 28 fully

connected neurons. System performance increases as

more neurons are connected. The PicoBlaze for Virtex-II

Series FPGAs reaches performance levels of up to 55

MIPS. With up to 336 PicoBlaze nodes on the chip

SOLAR will reach the performance of 18.5 GIPS on a

single chip. So its parallel processing ability can be

further improved. With the popular PowerPC on VIRTEX

II Pro FPGAs used as the main CPU, the PicoBlaze

neurons can be used as slave peripherals to further

improve the throughput for system on a chip

implementation of neural networks. Hence it can be of a

practical use for the embedded hardware applications in

signal processing, wireless communications, multimedia

systems, data networks, and so forth.

In our design, the number of neurons/chip is limited

by the number of BRAMs, so the total logic utilization is

comparatively low. The remaining LUT RAM will be

used for the communication between neurons on other

chips and to reuse the same Picoblaze for several neurons,

to fully utilize the hardware resource and accommodate

more neurons in a single FPGA chip. In this way, our

design seems device-dependent, since we need to carefully

arrange additional resource for different FPGAs. Each

device type has to be individually optimized. In scaling

our design to 3D system, this will be a necessary work to

be computed in order to accommodate more neurons in a

single selected FPGA chip.

References

[1] Misra, M., 1997, Parallel Environment for Implementing

Neural Networks. Neural Computing Survey, Vol.1, 48-60,1997.

[2] Glesner, M. and Pochmuller, W., 1994, An Overview of

Neural Networks in VLSI, Chapman & Hall, London, 1994.

[3] G. Tempesti, D. Mange, A. Stauffer, C. Teuscher “The

BioWall: an Electronic Tissue for Prototyping Bio-Inspired

Systems”, Proceedings, NASA/DoD Conf. on Evolvable

Hardware, Los Alamitos, Calif., pp.221-230, 2002.

[4] L. Sekanina, ”Towards Evolvable IP Cores for FPGAs,” In:

Proc. NASA/DoD Conf. on Evolvable Hardware, Los Alamitos,

US, ICSP, 2003, pp. 145-154, 2003.

[5] Hugo de Garis, Michael Korkin, "The CAM-Brain Machine

for Real Time Robot Control", J. Neurocomputing, Elsevier, Vol.

42, Issue 1-4, Feb., 2002.

[6] Maeda, Y., Tada, T., “FPGA implementation of a pulse

density neural network with learning ability using simultaneous

perturbation,” IEEE Transactions on Neural Networks, Vol. 14

Issue3, May 2003.

[7] T.Roska et. al., "The Computational Infrastructure of

Analogic CNN Computing – Part I.: the CNN-UM Prototyping

System”, IEEE Trans. Circuits and Systems I, vol. 46, pp. 261-

268, 1999.

[8] J. A. Starzyk and T-H.Liu, “Design of a self-organizing

learning array system,” IEEE Int. Symposium on Circuits and

Systems (ISCAS), Bangkok, Thailand, May 2003.

[9] J. A. Starzyk and Z. Zhu, “Software simulation of a self-

organizing learning array system,” The 6th IASTED Int. Conf.

Artificial Intelligence & Soft Comp (ASC 2002), Canada.

[10] J. A. Starzyk, and Y. Guo, “Dynamically Self-

Reconfigurable Machine Learning Structure for FPGA

Implementation” Proc. Int. Conf. on Engineering of

Reconfigurable Systems and Algorithms (ERSA) Las Vegas,

Nevada, USA, June,2003.

[11] K. Chapman,“8-Bit Microcontroller for Virtex Devices.”

Xilinx XAPP213, Online http://www.xilinx.com/xapp, Oct.

2000.

[12] Nallatech Ltd, “Ballynuey 2 VIRTEX PCI card user guide”,

1993-1999.

0-7695-2132-0/04/$17.00 (C) 2004 IEEE


Dynamically Reconfigurable Neuron Architecture for the...

Documents

Transcript of Dynamically Reconfigurable Neuron Architecture for the...