A a The Choice Between Electrical and Optical...

A Case Study for the Implementation of a Stochastic Bit Stream Neuron ; The Choice Between Electrical and Optical Interconnects

M.A.Hands, W.Peiffer, H.Thienpont, A.Kirk, Lab for Photonic Computing and Perception, Dept of Applied Physics, TW-TONA, Vrije Universiteit Brussel, B- 1050, Brussel, BELGIUM.

T.J.Hal1, Dept of Electronic and Electrical Engineering, King's College, University of London, Strand, London WC2R 2LS.

16384 I :O 11 262.144

Contact : Email - MAH@IF'G.PH.KCL.AC.UK

I

Abstract

I. I D I 1

A neural network architecture is described which uses stochastic processing techniques to perform the weighted-input multiplication, summation and rhresholding processes of a neuron using the optimal amount of hardware. It will be argued that the advantage of this approach is that it will allow large neural networks to be fabricated with relatively small amounts of hardware. The architecture allows a choice to be made between the speed and accuracy of processing, as well as a choice of hardware. Implementarions of a bit stream neuron using electronic, optoelectronic and optical hardware are developed and their capabilities are compared based on speed of processing and network size. The aim of this study is to investigate the capabilities of optical logic in distributed processing systems, together with the use of the optical thyristor as logic elements. It is shown that optical processing and optical interconnection allows a simplification of the processing sequence and allows the parallelism of distributed systems to be utilised.

3 6 7 8 10 12

16

1.0 Introduction

16 1 /8 87.5 1024 1/64 98.44 4096 11128 99.22

16384 11256 99.61 262,144 111024 99.90

4,194,304 114096 99.976

10'O 1165536 99.9985

The complexities associated with the fabrication of distributed processing systems, in particular neural network processing systems, stem from the fact that these architectures distribute the processing between a number of cells in their system structure. Each cell performs localised processing of the inputs it receives and provides output to other cells. In a neural network. the cells or neurons may receive excitation from ,many inputs (1-1000, or more), each of which have to be multiplied by a weight which signifies the importance of that input to the neuron's decision. Conventionall analog and digital electronic hardware implementations of neural architectures often use much of the available hardware to implement the calculation of the product of the weights and inputs. The philosophy behind our research is to simplify the hardware required to perform the computations by simplifying the form of the data which it has to process.

1.1 Simplified Data; Simplified Hardware.

Consider that the data for each input and weight is coded as a stream of bits (1's and (3's) in time. The coding we will consider in this report is a linear relationship between the occurrence probability of a 1 in a bit stream and the analogue value that is being represented. In fact, an occurrence probability for the neuron we will consider in the report is given by %!!, where v is the normalised real value being represented which exists in the range [ -1 .. I]. The occurrence probabilities of 0.0.5 and 1 correspond to the real values of -1.0 and 1, respectively.

Bit streams having lengths of, at least 16-bits to 100,OOO bits, or more, are used to represent the values with a varying amount of accuracy. The advantage of using this representation as opposed to using the weighted-binary form is that components as simple as AND-gates or

XNOR-gatesill can be used to perform computations such as multiplication and addition. The main disadvantage is that to obtain a high degree of accuracy a long bit stream is required. Table (1) shows the accuracies for a number of different length bit streams where the occurrence probability is linearly proportional to the value being represented. Neural network systems handle noisy data well. The neuron presented in this report will be considered to require bit streams with lengths of 10,OOO bits, which will provide sufficient accuracy for correct neuron operation. In fact, it is likely, for neural network applications, that bit stream lengths of IO00 bits and below will allow correct neuron operation, but this is dependent on the neural structure that is being considered. It should be noted that the search for the minimum bit stream length required for correct network operation could be included as part of the network learning algorithm.

1.2 The Structure of a Bit Stream Neuron

A neuron is an element which can determine whether the excitation it is receiving falls into the characteristic it has

63 0-8186-7101-7/95 $4.00 0 1!>95 IEEE

mailto:MAH@IF'G.PH.KCL.AC.UK

XXOR muinpiiczaon of input bit and w e i g b bit

X;dwisn = S x a - TGzstoid

I - Stochastic Input Bit (Sinpie bit in Ih+ sequence generated to represent chis input). W - Stochastic Weight Bit (Single bit in the sequence genemed CO represent this weight). -b - Sipd I-bit wide. ----*x->- Comment / tabel

Figure 1: Schematic of the functional components of a Bit Stream Neuron

Maximum possible count of counter Minus

Neuron’s Threshold Value. n V

Tn ........................ To n-Inputs and Weights

over

4 . . Figure 2 : Digital Electronic Implementation of Summation and Thresholding using a Counter

been trained to identify. Three distinct processes are required by a neuron. Firstly, the inputs to the neuron are multiplied by a weighting factor which signifies the importance of that particular input in determining the neuron’s output. Secondly, all the weighted-input values are summed and compared to a threshold value. An excitation greater than the threshold value indicates that the input resembles the trained characteristic. The requirement of the third process is to translate the excess excitation into an output value which signifies the probability that the characteristic has been identifd.

Figure 1 shows a schematic of a bit stream neural element. The temporal bit streams representing the weights and inputs of the neuron are presented to it and the corresponding weight and input bit pairs are multiplied using an XNOR gate. In this example, each input and

weight is considered to have its own individual XNOR gate. Each temporal bit of the bit streams from the spatial input channels gives an indication of the total input probability of the neuron. The multiplication of the weight bit streams with the input bit streams results in a spatial bit stream (across the output plane of the XNOR gates) which represents the excitation probability of the neuron.

L2.1 Thresholding

For the neuron to assess the level of excitation encoded by the spatial output of the XNOR plane, it must determine what proportion of the channels contain bits set to 1. Therefore, a spatial summation is required. In the digital electronic implementation of this architecture[2], the summation is performed by a counter (figure (2)) which counts the output of a single XNOR gate which is provided with n- input and weight bits over n cycles, one input and weight being processed by the XNOR gate each cycle and providing its contribution to the counter value. In this implementation example, time multiplexing of the inputs and weights to a neuron is necessary to reduce the amount of hardware used by the system. The threshold value for the activation of the neuron, before which it is not to provide an output, has been previously determined by the learning algorithm of the network and a value of the Maximum Possible Count of the counter minus the neuron’s Threshold Value is loaded into the counter before the counting begins. If the counter overflows when it counts the XNORed inputs, the neuron will f i i , resulting in a single output bit for that particular time-slice of the processing of the bit streams. An added complication in understanding the thresholding process is that the Threshold Value is not a constant value for all time-slices

64

Stochvac

3 it

0 I

Output

- - r~zshsid

1 - Stochastic Input Bit (Single bit in the sequence generated to represent this input). W - Stochastic Weight Bit (Single bit in the sequence generated to represent this weight).

-S ignal I-bit wide. .....--i.....> . Cc"e"t / label

Figure 1: Schematic of the functional components of a Bit Stream Neuron

Maximum possible count of counter Minus

Tn ........................ To

I Dn .......................... i n-Inputs and Weights

over n time slices. Single output bit everv n-cvcles Counth outplt r .

Figure 2 : Digital Electronic Implementation of Summation and Thiresholding using a Counter

been trained to identify. Three distinct processes are required by a neuron. Firstly, the inputs to the neuron are multiplied by a weighting factor which signifies the importance of that particular input in determining the neuron's output. Secondly., all the weighted-input values are summed and compared tot a threshold value. An excitation greater than the threshold value indicates that the input resembles the trained characteristic. The requirement of the third process is to translate the excess excitation into an output value which signifies the probability that the characteristic has been identified.

Figure 1 shows a schematic of a bit stream neural element. The temporal bit streams representing the weights and inputs of the neuron are presented to it and the corresponding weight and input bit pairs are multiplied using an XNOR gate. In this example, each input and

weight is considered to have its own individual XNOR gate. Each temporal bit of the bit streams from the spatial input channels gives an indication of the total input probability of the neuron. The multiplication of the weight bit streams with the input bit streams results in a spatial bit stream (across the output plane of the XNOR gates) which represents the excitation probability of the neuron.

1.2.1 Thresholding

For the neuron to assess the level of excitation encoded by the spatial output of the XNOR plane, it must determine what proportion of the channels contain bits set to 1. Therefore, a spatial summation is required. In the digital electronic implementation of this architecture[21, the summation is performed by a counter (figure (2)) which counts the output of a single XNOR gate which is provided with n- input and weight bits over n cycles, one input and weight being processed by the XNOR gate each cycle and providing its contribution to the counter value. In this implementation example, time multiplexing of the inputs and weights to a neuron is necessary to reduce the amount of hardware used by the system. The threshold value for the activation of the neuron, before which it is not to provide an output, has been previously determined by the learning algorithm of the network and a value of the Maximum Possible Count of the counter minus the neuron's Threshold Value is loaded into the counter before the counting begins. If the counter overflows when it counts the XNOR'ed inputs, the neuron will fire, resulting in a single output bit for that particular time-slice of the processing of the bit streams. An added complication in understanding the thresholding process is that the Threshold Value is not a constant value for all time-slices

65

of the processing. In order to achieve both Sigmoidal and Linear Activation Functions, the threshold value is chosen according to a threshold probability distribution. The choice of the threshold probability dismbution determines the shape of the activation function. Choosing a distribution like that shown in Figure 3(a), a sigmoid translation of the input probability to the output probability (Figure 3@)) is achieved as a consequence of the Central Limit Theorem[4]. This threshold distribution has a central probability, representing the threshold level, and a variance from this central probability, which allows the threshold to take into account the stochastic nature of the inputs. If the threshold level is chosen such that each threshold value in the allowed range has an equal probability of being chosen then a linear activation function is realised (Figure 3@,c)). It is also advantageous to perform the translation of the activation function along the x-axis. This can be accomplished by directly altering the threshold value during the generation process, i.e. the central threshold probability could be altered.

Threshold Distribution Translation of Input to output

COUnl of Inpur* Input Robabilily Pi

Figure 3: Realisation of the Sigmoid and Linear Activation Functions.

1.22 The Electronic Implementation of the Neural System

A simple architecture[2] for the implementation of a bit stream neural network has been researched by John Shawe- Taylor and co-workers. Their research is based towards implementing this architecture in digital electronics, making use of Field Programmable Gate Array p G A ] technology. Due to the limits on the number of pins in an electronic package the architecture of the electronic implementation makes use of time-multiplexing to a great extent The layout of a group of m neurons within a chip is shown in Figure(4). At the input side of the neuron there is an m bit long shift register which is consecutively loaded with the weight bit of each of the m neurons corresponding to the particular input they are currently expecting. The input is then presented and the XNOR performed independently for each neuron's excitation. The counter is then clocked with the result of the XNOR multiplication and the procedure repeats until all the inputs and weights

have been considered. The output of the neurons are then clocked into an output shift register which is used to shift out the output of each of the neurons in this layer.

Threshold

U Layer

Figure 4 : Structure of a Layer of m Electronic Bit Stream Neurons

The use of time-multiplexing in this implementation reduces the pin usage of the chip but increases the processing times of the neurons. The processing time of a layer of m neurons of this electronic neural system can be stated as

Processing Time = B * m * I * C where ... B - Number of bits in the bit stream. m - Number of neurons in a layer. I - Maximum number of inputs per neuron.

The processing time is scaled by the number of neurons in a layer and the number of inputs to the neurons. These factors are only due to the movement of the weight bits to the correct neurons. Consider a network processing bit streams 10.000 bits long with a clock period of 20ns. Without any time-multiplexing, the processing time would be 2 0 0 ~ For a network of 10 neurons with 10 inputs per neuron there is a delay of 20ms. Decreasing the clock period can improve the situation, but making large networks from multi-chip digital components will still be limited by the time-multiplexing delays. A realistic option for the fabrication of larger networks would be to section the neurons of the architecture into small groups where each group receives its weight inputs via a separate pin on

C - Clock period.

66

the chip. There are many applications which can benefit from the versatility of this architecture and its fabrication in digital electronic technologies, but producing large networks working at layer processing speeds of less than 40ms will be difficult unless the addressing time delays of the weight signals to the neurons are overcome.

11.3 Application Requirements and the Limits of Electronic Technology

A good overview of the requirements of neural network systems for demanding real-time processing applications such as RADAR pulse identification and vision processing is given in[4]. Considering these and other applications, the goals for the development of hardware neural systems can be stated as;

(1) A modular network system with 1000 neurons per module, each with upito 100 inputs per neuron.

(2) Speed of processing to match real-time application. In image processing applications, a reaction within -40ms is required; for RADAR pulse identification 50ps is required.

(3) Low power consumption - dependent on fan-out requirements of processing architecture and processing speed.

(4) Global interconnectiviity of neurons in system. Realistic systems fabricated in electronic hardware limit the connectivity to reduce: wiring costs.

(5) Hardware to implement learning. (6) Compactly constructed and cheap to produce.

The bit stream architecture discussed in this paper lends itself to layered networks as the output of one layer is easily pipelined to the next However, the philosophy of bit stream processing can be applied to other structures of network. Fabrication of a large electronic bit stream neural system with the requirements as stated iabove would be difficult even in wafer scale circuits, even though the neuron structure is extremely simple.

Electronic implementatioins of neural systems have been produced which can provide 100-inputs per neuron and process with speeds of a few micro seconds but only a few neurons (50-250) are realisable[4]. Large sequential networks have been fabricated in electronics[5], however, if we are to meet the requirements for large networks and quick processing speeds we must implement cells which process concurrently. RAM based neural networksl51 and Wafer Scale Integration CWSI) of electronic systems have been investigated[6,7] and networks containing approximately 500 parallel processing neurons can be fabricated per wafer. Neural circuits can take advantage of WSI if they are fabricated in autonomous cells. Fabrication errors are likely in the circuit growth and a few cells may not function but the loss of one cell will not significantly effect the performance of a1 neural system. Regular circuitry also reduces the interconnection lengths of the wiring between cells and therefore different propagation delays for different connections will not be experienced. Connection lengths are an important factor in circuit fabrication, especially now that device and connect dimensions are being reduced to the sub1 micron level. Reducing these

dimensions not only allows more devices to be fabricated on a chip but allows faster switching of devices. Transistor switching times are being reduced below the effective RC time of the wiring. To take advantage of the increase in switching speed of the active elements the wire lengths need to be short. Short wiring will have a large effect on increasing the packing density of the circuitry as typically a higher percentage of the wafer surface is consumed by the wiring than the devices themselves. Another major limitation of the electronic hardware is seen to be the number of pins on the electronic packaging.

A much simpler approach to interconnect large cellular processing systems is to use optical signals between cells. Optical transceivers fabricated in GaAs material can be flip-chip bonded to the surface of the silicon wafer in the vacinity of the circuitry it will serve. Cells on other wafers or even on the same wafer can communicate via an optical link. Our research aims to use optoelectronic devices not only to act as the communicators but as the logic of the system. The following sections detail the investigations that have been performed into the fabrication of the bit stream neural architecture using the optical thyristors as the main logic element of the system.

1.4 Optical Logic

Before we consider the functional blocks of the system we need to consider which optoelectronic device we are going to investigate as the logical component of the system. The requirements of smart optoelectronic devices to be used as the logical element of the bit stream neural processing system are :-

(a) Logical functions like AND, OR, NOT, XNOR etc., should be possible.

0) The fabrication of arrays of devices should be possible. (c) It must detect optical inputs signals and require low

optical energies (Optical sensitivity in the fJ range). (d) It must have a fast switching speed( lOMHz +> range). (e) The pixel size should be small (< lOOpm* 100pmJ.

In studying the merits of a range of optical switching or modulating devices, the optical thyristor device[8] was chosen. Our studies in using this element in the neural network architecture have benefited from the logical functionality of the device, as well as its differential behaviour. It is our aim to perform further studies in implementing stochastic optical bit stream neural architectures and to compare the merits of using the optical thyristor, SElf Electro-optic Devices (FET-SEED) and Ferro-Electric Liquid Crystal (FELC) devices. An optical thyristor switching speed of 5OMHz has been achieved[8] with a light detection sensitivity of 15aJ/pm2. This is an impressive performance from a device which is in its infancy of development. A higher operating speed has been achieved by the ET-SEED device[9] of 155MHz. but with a significantly lower sensitivity. FELC devices offer approximately 20KHz switching speeds, operate at low power and large arrays of pixels can be produced.

67

L ~opcontacts 1.5 The Optical Thyristor Differential Pair - The Logical Element

The differential pair configuration of an optical thyristor uses the competition for common current to aid the switching of the device. The device consists of a layered PnpN structure of AlGaAs and GaAs. The mesa structure is shown in Figure(5). A light signal enters the structure through the window in the top N layer and carriers are generated when the light is absorbed. Each mesa in the differential pair can receive an optical input. The difference in the amount of light that the individual thyristor elements of the pair receive determines which device will turn-on. The thyristors also receive an electrical switching signal VAK which can place the device into one of four states; RESET, RECEIVE, WAIT and EMIT. The RESET signal clears the memory of the device from the most recent output state. This signal is required because the thyristor which turns-on during the competition process stores carriers in its central np regions and the memory of the state of the pair would otherwise remain for several microseconds (the devices are bistable). The WAIT signal uses this memory to hold the state of the thyristors without emitting light. The RECEIVE state leaves the thyristors electrically unbiased and allows each of the pair to accumulate charge carriers in its central regions. The EMIT signal forces a competition for the current flow through the common connections of the device. The device with the most absorbed charge will win and emit light.

The optical inputs to the thyristor pair do not only represent the data signals but also bias signals which set the device functionality. Devices can be fabricated so that the absorption and emission of the data light signals are on the same side of the plane, whilst the bias light is applied from the other side of the plane. This can be appreciated from the schematic of Figure(6) where the bias light is applied from the backside and is masked so that only one element of the pair sees this signal. The other element of the pair is imaged with two optical signals representing the inputs I1 and 12. I1 and I2 are considered to have equal intensities. To facilitate an OR operation the bias light should have an intensity which is half the intensity of the inputs I1 or E!. Thus, if I1 or I2 is a digital optical input with a HIGH intensity representing logic 1 then thyristor A will win. For an AND operation the bias signal should be set at a level close to one and at half times the intensity of the input signals. Thus, thyristor A will only win when I1 AND I2 are HIGH. With the c o m t biasing and choice of output from a differential pair the functions of AND, OR, NAND, NOR and NOT can be performed. Elecmcal biasing of the thyristors and self biasing (asymmetric devices) are currently under investigation.

1.6 Expected Performance of the Thyristor in a Processing System

The best reported figures for the device performance of the thyristor are that it can switch at a speed of 50MHz with an optical sensitivity of 15aJ/pm[8,101. The speed of operation at this sensitivity is hindered by the coupling efficiency of light from one element to the other, i.e. by the

Optical Input 1 / \ Cathode K

t / Anode A Bottom Contact

Figure 5 : The GaAdAIGaAs Mesa Structure of a PnpN Photothyristor

Figure 6 : Optical biasing of a differential thyristor pair

Input and Outpul 50/50 Amplitude GRIN Lens [-{-I- :.:-: - ... - - - - _ _ _ -

array over C........ . V thyristors

Figure 7 : Optical Imaging System developed for the Thyristor Logic Planes.

losses of the optical system. The light output of these devices is as an LED, but future designs of the element incorporating a Bragg reflector in the structure and microlenses over the optical window are attempts to concentrate the light output into a beam with a divergence angle of approximately 7 degrees. The turn-off process of the thyristor does not rely on slow carrier recombination as does an LED, but excess carriers are quickly extracted from the device with the application of a reverse bias across the device. Investigations of the thyristor element and arrays in processing systems are being performed to develop an efficient optical system for their interconnection[l 11. The optical system that is being developed is shown in Figure(7). This system uses microlenses, GRIN lenses and

68

amplitude beam splitters to interconnect arrays of thyristor differential pairs. The results of the investigations[ 121 show that this set-up will allow arrays of 1000-2000 elements (thyristor size 5 p * l o p ) on a pitch of 2mm square to be imaged via a GRIN lens based system. Besides the dimensions of the GRIN lens limiting the number of elements in an array, the: power dissipation limit of the GaAs material (10Wcm-2) places a physical limitation on array size. Table [2] details the expected limits on array size with respect to the limitations imposed by the usable GRIN lens image area anid the heat dissipation of G&. However, most of the heat dissipation occurs when the thyristor elements are emitting light, which is usually only 8 small period of the switching cycle of the thyristor. If the thyristor is emitting for one quarter of the switching cycle of the thyristor then four times as many thyristor elements will be allowed within the given heat limit

Table 2 : Limits Placed on Array Size due to GRIN Lens Diameter and Heat Dissipation of Imaged Area.

A lower operating current allows an increase in the number of elements in the array, but the required increase in device resistance should be madle considering the effect that this has on the speed of the thyristor elements; the device speed is RC limited. In the near future, it is predicted that the improvements in the coupling efficiency of the optical system will improve the operating speed of the mscription of data between elements in a system to above 1OOMHz. Efforts to i n m w the switching speed are being made without adversely effecting the achievable sensitivity of the device. The external emission efficiency is currently about 0.2% and a major thrust of the device development is to improve this efficiency to approximately 2%. A cycle time of Ins for devices operating at O.lmNO.3mW is seen as the goal of the device development.

The expected data transcription speed performance of the optical system shown in Sgure(9) depends greatly on the efficiency of the imaging system. A system has been constructed with an imaging efficiency of 0.0072%, thus the emission time of a thy.riStor of 5pm*l@m would need to be 4 n s (3mW/channel)(see ref(l2) for details of calculations). It has been predicted that improvements in rhe emission efficiency and the optical system would allow emission times of -4opS to be achieved for an optical system without a 50/50 beam splitter present. Each beam splitter placed in the optic:al path reduces the efficiency by 50% and therefore incrmies the required emission time of Ihe transmitting thyristors by a factor of 2. With emission

times of 4Ops. it is not the optical system which is limiting the speed of operation, but is the switching speed of the thyristor and specifically the reset time of the device; (turn- off RC time). The electronic circuitry required to bias the thyristor elements and to provide the switching signal is also a limiting factor for the switching speed of these elements. The complexity of the switching signal waveform can be as simple as a sine wave or as complicated as a four level square wave (corresponding to the Reset, Receive, Wait and Emit stages of the switching cycle). It is feasible to produce a sine wave switching signal at lGHz and above; more complex waveforms can be produced but at lower frequencies.

In this report we will consider that the thyristor performance is such that a turn-ON time of 2OOps can be achieved and that the turn-OFF time is 7OOps. (4Ops emit and receive times will be considered for an optical interconnection system without beam splitters present). Now that the functionality of the thyristor smart pixel device has been explained, let us investigate the applicability of its use in the neural processing system.

2.0 Towards an Optical Bit Stream Neural Network

The optical implementation of this neuron requires three distinct functions to be realised, either optically or as a mixture of optical and electronic technology. Firstly, the bit streams which represent the input and weight data must be generated. The corresponding input and weight bits must be multiplied using an XNOR-gate and then summed. The design of the optical system can take advantage of planes of arrays of optoelectronic elements which can process the spatially distributed channels of weights and inputs simultaneously. Therefore, the input and weight bit streams for each weight and input should be generated simultaneously, each channel requiring its own bit stream generation hardware and XNOR hardware. The spatially distributed results of the XNOR multiplication have then to be summed. A statistical method to achieve a summation of the spatial channels will be described during this section.

2.1 Stochastic Sequence Generators

The problem of converting binary n-bit values into stochastic bit streams can be simplified if the problem is split into two parts. Firstly, bit streams with occurrence probabilities of 0.5 are produced and then they are modulated in such a way that they are encoded with the correct occurrence probability. The modulation bits of the binary value set the corresponding modulators(M0D) to act as either an AND-gate (modulation bit = 0) or an OR- gate (modulation bit -1). The inputs to these gates consists of a bit stream with an occurrence probability of 0.5 and the output of the previous modulator. The initial modulator input is set to zero. The most significant bit of the binary value is connected to the modulator which provides the bit stream output to the processing structure. Each AND configured modulator produces a bit with the probability of p i a and each OR configured modulator with pid2 +1R. The combined effect of the cascade of the modulator modules is to produce bit streams with the occurrence probability corresponding to the binary value of the

69

I

0 1 0 0

I M2

0.25 0

OR P,=M,.( 0.5 + OSP,)

M7.(O.5P,) + . . . - -

AND

modulation bits. This effect can be appreciated from the truth tables in Figure(8). An electronic implementation of this structure is straight forward but uses a lot of hardware, especially as each input and weight of the system requires a separate modulation unit. An optoelectronic implementation of this module could be fabricated using planes of the thyristor devices which would utilise the ability of these devices to be configured either as an AND- gate or as an OR-gate, depending on the bias of the device pair. The optical set-up for this module is shown in Figure(9).

AND I OR of 0.5 bil stream and plana(2)'s outpur.

Bit meam with darred

bit pobabhly

> Ij VaLd wtpu

lfts n cyclcs

An m a y of 0.5 BII Stream Inpur - - - - - -> - - - - __...___ .... ___. ___ ._.__

tj

Figure 9: Optoelectronic Implementation of the Modulation Module

It consists of a 50150 cube beam splitter and two opposing thyristor planes. An array of incident optical bit streams with probability p0.5 is sent to plane (l), as well as the stored data from plane (2); for initial modulation step, plane (2) contains zero data. The thyristors in plane (1) are biased by modulation bits which are stored in a shift register (one per data channel) and shifted to bias the thyristor as an AND or OR gate. Each bit in the shifted sequence corresponds to one step in the modulation procedure. Thus, for an n-bit digital value to be converted into one stochastic bit of a bit stream n shifts of the shift register, which contains n modulation bits, must be performed. The expected speed performance of this modulator structure has been estimated as 1440ps per

1

modulation step. The main disadvantage of this modulation system is that it requires many modulation steps before it provides an output. Other modulation schemes are being considered which show promise but require further investigation. It is expected that the modulation module will operate with a pipeline delay of 1520ps.

The problem of producing bit streams with an occurrence probability of 0.5 has been investigated and options which are considered feasible for use in this processing system are detailed in Section 3.

2.2 Multiplication of Corresponding Weights and Inputs using XNOR Logic

The simultaneous bit-wise XNOR multiplication of spatially parallel channels of weight and input bit streams is the next function that has to be performed. The implementation of bit-wise logic operations such as AND, OR, NOT, XOR using the optical thyristor devices has already been demonstrated in a design for a programmable logic array11 11. The schematic of an optimised design for the implementation of the XNOR operation is shown in Figure(l0).

OR-Plane

'2 AB+KE AND Plane Q TO BUPFER RANE

Figure 10: XNOR Processing Module using Thyristors

10

The processing time of the XNOR process has been estimated as 1640ps. The processing of this module occurs in three steps: the input to A and B, the movement of this data to the AND and OR processing planes and the output of this data to the next mcdule.

2.3 Spatial Summration of Weighted-Input, Thresholding and Reialisation of Neural Transfer Function.

'The final component of ii bit stream neuron must perform the summation of the wdghted-inputs and compare this excitation with a threshold value. The threshold value should be generated probabilistically to allow the linear and sigmoid transfer functions; to be realised.

- TOP CONTACT

V Output of XNOR Module

Spatial Sumulation "Percohtion" PIP

BOTTOM CONECT

Figure 11 : Percolation System for the Spatial Summation of Incoming Optical Signals

Here we propose a statiistical method of comparing the amount of weighted-inlputs which are HIGH with a threshold level. Consider that an array of optical output of the m e signals of the XNOR plane is imaged onto a network of photosensitive cells: Figure(1 I). Each cell in the network receives the output of one element of the array and when illuminated will allow current to flow in any direction across its domain, i.e. bi-directional current flow

to and from its nearest neighbours. If a number of cells in the matrix are illuminated they will allow current to flow across their path when a connection is made from the TOP electrode to the BOTTOM electrode. Current will be transported by the bias voltage. The resistance measured across the network will change from HIGH (no path connecting the electrodes) to LOW. Detecting whether or not there is current flow between the electrodes provides a statistical estimator as to whether the amount of cells in the network that are illuminated is greater than a threshold level. The statistics of the probability of the network having a connection path between the electrodes is based in the theory of percolation systems[l3]. The statistics of these systems is greatly dependent on the structure of the mamx of cells and the allowed connectivity of a cell with its neighbours. Simulation of a 16 X 16 matrix of square cells (connectivity between cells to its nearest neighbours, above, below and to its sides) has been simulated. The results shown in Figure(l2) are for bit stream lengths of 2048, 16384 and 65535 bits. The curves represent the translation of the average occurrence probability of the input bit streams to the probability of the percolation network having a connection path between its electrodes. The threshold value of the percolation is the value of average input probability which causes the system to percolate with a probability of 0.5, i.e. the output probability at threshold is 0.5. Each of the values of input probability lower and higher than 0.5 will cause percolation of the system with a probability ranging from 0 to 1 following a sigmoid distribution as a consequence of the statistical nature of the summation process. It is obvious from these curves that their smoothness increases with longer bit stream lengths, as is expected (less noisy data into the system, less noisy data out). The threshold probability at which the simulated matrix "percolates" is 0.59. which is due to the structure of the matrix and if a triangular matrix is used a percolation probability of 0.5 would be realised[l3] (for an un-directed percolation system). Directed percolation systems are those which

Pc~rcolcition 204 13 Per-colu t i x i 1GSB4 Percola tion 65535

0.2 0.1 0 6 0 8 0.2 0.4 0.fi 0.8 0 t ll.4 I I G 0 0

1iiI)iil u l~( i l~ i1 i~y

Figure 12 : Translation of Input Probability to Output Probability of the Percolation System for a Square Matrix,

71

restrict the direction of flow of the carriers from one cell to another. Directed bonds between cells of the system decrease the probability of system percolating. Other cell structures have percolation probabilities of higher and lower values than 0.5. For example, a honeycomb structure has a percolation probability of 0.7. If a directed percolation structure is designed with the correct bonding directionality, a honeycomb network could be designed which allows a percolation probability of 0.5. Therefore, there is some compromise which can be taken into account when designing the optoelectronic device structure of the percolation cells.

23.1 Adding "Noise" to the Summation Process to Alter the Threshold Level

Consider that the cell receives the input from the XNOR plane. The cell could either detect this signal and use it in the percolation process; or it could be forced on by a noise signal with a certain probability or forced off with a noise signal with a certain probability ("ZERO" noise). This addition of noise can be used to artificially alter the percolation threshold of the detector matrix. Figure(l3) shows the simulated effect on the cells in the network receiving a probabilistic signal which will turn-OFF a cell with a certain probability. It can be seen that the threshold of the percolation process is shifted up the x-axis such that a higher average probability of the inputs to the cells is required for the matrix to percolate. The effect of turning- ON the cells artificially with a given probability has the effect of translating the threshold probability towards the left,

Buffering of the input signal using the set-up shown in figure(l3) will allow the noise to be applied as desired. Both increasing and decreasing the threshold level is facilitated using one thyristor pair which is supplied with two bias signals. In Case 1, the bias signals are such as to allow the buffering and inversion of the true input from the XNOR element. For Case 2, the bias signals are such as to cause thyristor B to always win and in Case 3, thyristor A will always win. The biasing of the thyristors can be provided by optical signals or alternatively could be performed using bias derived from asymmetry of the thyristor structure to provide the U/4 bias for thyristor A, whilst the bias of thyristor B could be supplied as an optical or electrical signal : Table(3).

23.2 Using the Percolation System to Derive Data for Network Learning

The percolation matrix can also be used to derive the dependency of a neuron's output on a particular input which can be used by a gradient based learning algorithm. If the detector cell in the matrix relating to an input is turned-OFF and as a result the current across the percolation array changes from conducting to non- conducting, the output is seen to be dependent on that input. Using this technique, a stochastic dependency estimator is determined and can be used by a learning rule to train the neuron[l4].

Percolation 16384

Figure 13: Effect of Adding "ZERO" noise to the Spatial Input of the Percolation System

nput from

plane B XNOR

nput from

plane B XNOR

I

I - -1- - - - - - - -

v To Percolation Plane

Figure 14: Buffering System to Impose Thresholding Noise on the Data sent to the Percolation Plane.

Light INPUT Bias A Bias B OUTPUT FUNCTION

0 U/4 ( U + U / 4 ) / 2 1 ThyristorBwins U U/4 (U+U/4) /2 0 ThynstorAwim

0 U/4 > ( U t U / 4 ) 1 ThyristorBwins U U/4 > ( U + U / 4 ) 1 Thyristor Bwins

0 u / 4 0 0 Thynstor A wins CASE3 U u / 4 0 0 Thyristor A wins

Table 3 : Truth Table for the Functionality of the Thyristor Devices in the Buffer Plane.

3.0 Production of Parallel Channels of 0.5 Bit Streams

This section describes a number of approaches for the implementation of a multi-channel bit stream generator where each channel's bit stream Occurrence probability is 0.5. The integration of these methods into the processing system is considered and a comparison of the advantages of each method is made.

72

3.1 Linear Feedback Shift Registers

A widely used technique for producing digital noise is based on a shift register. The output of certain stages of the register are fed back to the input of the register through an XOR gate (mod 2 addition). Careful choice of which stages are fed back can result in a maximum length sequence of 2m -l) states to be cycled[l5,161. The output is not truly random and thus these generators are called Pseudo Random Bit Generators (PRBG). The electronic implementation of the bit stream architecture uses long registers to produce the 10.5 bit streams that the network irequires for the modulator units. The investigations of its use[l7] have shown that taps of every other stage can be considered as an indepenident generator to the extent that the sequences are almost uncorrelated; although they are fully overlapping. This can provide sufficiently independent sequences for the modulator units of the system provided that the clocking of the random bit generator occurs in the opposite direction to the cascaded output of the modulator stream.

3.2 Cellular Automata Systems

A cellular automata system consist of cells which are connected with their local neighbourhood. Each cell's output is determined by the state of the cells in its neighbourhood as well as its current state. A cell requires two functional blocks; a memory block and a logic block, which defines the relatioriship between the current state of uhe cell, its inputs and the next state (the update rule of the cell). Each cell is thus a state machine. A matrix system of cells can be produced which uses the interaction of the individual cells state machines to produce 0.5 bit streams. If half of the states in the cell state machine produce a 1 output whilst the other states produce a 0 output then the cell will produce an output with a probability 0.5. This is true only if all states of the cell are equally probable at any moment in time. This requirement is difficult to implement, but update rules and cellular automata systems have been investigated[l8] which produce 0.5 parallel bit streams with good statistical qualities. Update rules which have been reported as having good properties for cellular m d o m generators are A xor B xor C; A xor (B or C) and A xor (B or (not C)). P1olfram[l8] has investigated the statistical properties of cellular systems using these update rules and concludes that the update rule A xor (B or C) gives the best random output. A number of articles[19,20,21] report the research performed on the development of class 3 ce:llular automata in which the cell update rule is defined by its current state and the states of two of its neighbours. The design procedure for producing a 2-D cellular automata with cell output probability of 0.5 is detailed in[21]. The investigations into these cellular systems have shown that systems can be produced where the bit output of a cell is iuncorrelated with other bits from the same cell and with those of other cells. The visual inspection of the state of the system cells against time shows seemingly random output with no regular patterns in the graph; as is apparent in the state-time diagrams of linear feedback shift registers and 1-D cellular automata systems.

An electronic implementation of such a 2-D cellular system uses a flip-flop (as the memory element) and a 3-input XOR gate (which performs modula 2 addition) with inputs from its upper and left cells and the output of its flip-flop. The output of each cell could be electrically connected to an optical emitter device to provide the optical output to an optical modulator system. It is envisaged that arrays of 10oO cells could be fabircated per cm2. In order to increase the density of channels optical implementations of a cellular automata random bit generator have been considered. A direct implementation of the electronic structure using the thyristors has been investigated and has the form of a 3-D optoelectronic circuit with a shift variant fan-out grating. This set-up requires the interconnection of many planes of devices.

The update rule of the automata can be simplified if the class of the cell is increased, i.e. an increased number of neighbours providing input to a cell. It is comparatively easy to fan-out a signal from one cell to many others in optical automata systems than to interconnect the cells of an electronic automata. A simple structured optical cellular automata using a fan-out of 5 is being investigated as an implementation of the random number generator, but its statistical independencies have not yet been proved. Such an optically connected cellular automata will allow densities of cells to match that of the other processing planes of the neural architecture.

33 Fibre Speckle

Another method to produce parallel channels of 0.5 bit streams is to use the speckle pattern of a multimode step index fibre to illuminate an array of differential pairs of optical thyristors (Figure 15). The speckle pattem features well-known statistical properties, i.e. it is a gamma distribution of which the degrees of freedom equa1 the number of speckle cells per thyristor 1241. Each of the thyristors in the differential pair is subjected to the same light distribution, hence both optical thyristors have an equal chance of switching on (if one thyristor switches on the other is prohibited to do so), thus generating a logical 1 or a logical 0. By subjecting the fibre to a vibratory motion (ultra-sound or turbulent air flow) and sampling the time- varying speckle pattem (=lMHz) a binary seAuence with probability 0.5 of a bit being set to 1 is generated. The fibre might be replaced by a waveguide in the cross section of which the refractive index can be modulated at megahertz frequencies (>lOMHz) by a randomly driven acousto-optic modulator. In that case the speckle pattern could be sampled more often without introducing time correlations between consecutive bits, hence enabling the system to be operated at higher clock frequencies.

The number of speckle cells emitted from a large sized multimode fibre can be equated from

2 V - 2

2 x 2 ( r ) 2

73

lOpn*S~m thyristor mesa's using a total area of 4Opm*SOp for a pair and connects. Random

of the waveguide

Laser Light h= 780nm

Multimode Fibre NA = 0.4 Core Radius (r) = 250~m

Figure 15: Imaging of Fibre Speckle onto Detector Cells

where V is called the V-Number of the fibre and is a figure of merit of the fibre. Thus for a fibre with radius r=250pm and NA=0.4, coupling light of 1 = 780nm, then approximately 324444 modes will be present. For the distribution of speckle cells to be considered gaussian, the number of cells incident on the detector area of the thyristor pair windows must be of the order 10 or more. A pair of thyristors with optical windows of each element in the pair of lOpm*5pm is considered to consume approximately 4 0 p * 5 0 p of area in total, considering the area used for electrical connects etc. Therefore, 400 speckle cells need to be imaged onto the 40pm*50pm, which allows approximately 800 channels to be implemented.

The gaussian distribution of speckle cells will have an mean value and variance 02. Since both sides of the pair are subjected to the same normally distributed intensity, the intensity difference which ultimately determines the result of the thyristors' competition is also normally distributed with average 0 and variance 2 2 . The standard deviation ./zo must be sufficiently large in order to overcome the eventual pair asymmetry due to production inaccuracies. In the case of a gamma distribution the asymmetry requirement can be stated as (with M the degrees of freedom):

This condition can be fulfilled by appropriately choosing the irradiance in the fibre. In this set-up the bias can be used to control the switching probability of the thyristor pair. Variations in bias and element performance across an array are likely using such a set-up.

4.0 Discussion

This study assumes that the thyristor windows are 10pm*5pm each. There are two thyristors in an element and each element uses a substantial area for its electrical contacts. A total area for an element is approximately 40pm*5Opm, which would allow approximately 50,000 elements to be fabricated per cm2. Considering thyristor switching times to be as stated in the report, the expected pipelined processing time of the neural architecture investigated so far would be approximately 2ns, representing a processing time of 2 0 ~ s for a bit-stream of 10,OOO bits. For the currently achieved 50MHz thyristor switching time, the processing time would be 4 0 0 ~ . In the design of the remaining modules of the neuron, it will be the aim for these units to have similar processing times to the processing times of the modules discussed (-211s).

Limitations on size of thyristor arrays can be seen as;

(a) Heat dissipation limits of wafer material. (b) Requirement for extra electronic circuitry to serve and

aid the elements to perform their processing. (c) Complexity of electrical switching signal and maximum

switching frequency of device (limited by turn-off time).

(d) Limitations of imaging area of optical system. (e) Divergence angle of the emitted light from a thyristor.

One must also consider the method of biasing which is to be employed; electrical or optical. Placing electronic circuitry at each pixel will reduce the amount of space available for thyristor elements and require an elecmcal addressing scheme for external signals to set the bias level. It has been estimated that having local biasing electronics would increase the size of a pixel element to approximately 100pm*100pm. This would allow 10,000 elements in lcm2. If application electronics is required then a further increase in pixel size will occur. The philosophy that architecture design should follow is to keep the pixels

74

simple and design a pirocessor architecture which can utilise pipelined optoelectronic processing arrays.

The GRIN lens system discussed in the report offers a cheap and compact modular platform which is easily scaled as the lenses and beam splitters are easily stacked. The uniform imaging area of the GRIN lenses has been investigated and approximately 40% of the diameter has been considered to image uniformly. For an array of 1O.ooO elements per cm2, apprtoximately one sixth of the array would be imaged uniformly using a lcm diameter GRIN. The space which is not used for imaging could be used to house application and switching electronics. Other optical systems will be considered for the implementation of this neural architecture so tihat the uniform imaging area is increased.

The promise of using optical interconnects and optical logic elements in the architecture described in this report suggests that fast and large parallel artificial neural systems could be constructed. Aahitectures, such as bit stream architectures, which can perform the processing in a pipelined manner offer a step forward to the realisation of 3-dimensional optoelectronic/electronic processing circuitry. The optoelectronic devices developed for such SMART PIXEL processing are in their infancy of development but show great promise. Moderate investment in this technology will be rewarding. The reality of 3- dimensional optically initerconnected circuitry will allow scaling of processor architectures beyond the limits of a single wafer, as well as increased density of processing channels. The availability of large 2-D arrays of processing elements will give c u " t and new parallel processing algorithms a realistic hardware platform that will allow parallel processing to enter the real world.

4.1 Acknowledgements

'The investigation of the optical and optoelectronic implementation of the neural system was funded by the HC&M grant ERBCH:RXCT9302 15 "Optoelectronic Hybrid Technologies". M:.Hands is a visiting researcher at ithe Vrije Universiteit Bm~sel from King's College London.

4.2 References

[I] Gaines.B.R ; "Stochastic Computing Systems", Advances in Information Systems Science. 2, pp 37-172, 1969.

[2] J.Shawe-Taylor, P. Jeavons, M.Van Daalen; "Probabilistic bit stream neural chip: Theory", Connection Science, 3(3), pp

[3] Lhdgren.B.W ; "Statisti~cal Theory", Macmillan, New York,, 1976.

[41 Zurada.J.M ; "Introduction to Artificial Neural Systems", Chapter 9, WEST, ISBN 0-314-93391-3, 1992.

[SI Hui. Morgan, Gurney, Bolouri ; "A Cascadable 2048-Neuron VLSI Artificial Neural Network with On-Board Learning", Artificial Neural Networks II. Elsevier Science Publishers, pp

[61 Bolouri. Morgan, Peacoclk ; "A RAM-Based Neural Network Architecture for Wafer Scale Integration", 7th EEE Int.Conf. on WSI, San Frandsco, Jan 1995.

[7] Yasunaga.M, et al ; "Design, Fabrication and Evaluation of a 5- inch Wafer Scale Neural Network LSl composed of 576 Digital

317-328. 1991.

647-651, 1992.

Neurons", InLConf. on Neural Networks, Vol 11, pp 527-535, June 1992.

[8] Heremans.P, Knupfer.B, Kuijk.M, V0unckx.R "Cascadable thyristor optoelectronic switch operating at SOMbiVsec with 7.2fJ external optical input energy", OSA Topical Meeting on Optical Computing, March 1995.

[9] Lentine.4 et al ; "4x4 arrays of FET-SEED embedded control 2x1 optoelectronic switching nodes with electrical fan-out", IEEE PhotTechnLettrs 6, ppl126.1994.

[lo] Kuijk.M, Knupfer.B, Heremans.P ,Vounckx.R, B0rghs.G ; "Down-scaling differential pairs of depleted optical thyristors", submitted to IEEE PhotTechn.Lettrs, Jan 1995.

[l 11 Kirk.& ThienpontH ; "An optoelectronic progammable logic array which empolys diffractive interconnects", SPIE SymposiumHolographic + Diffractive Optics, San Jose, USA, Feb 1995.

[12] Kirk.A, Thienp0nt.H ; "Interconnection issues for VSTEP optoelectronic information processing systems", SPIE SymposiunzOptical Interconnects III, San Jose, USA, Feb 1995.

[13] Stauffer.D. Ahar0ny.A ; "Introduction to Percolation Theory", Taylor + Francis, London. 1992.

[14] Shawe-Taylor.J, Van Daalen.M, 2hao.J ; "Learning in feedforward bit stream neural networks", Internal Publication of the Dept. of Computer Science, Royal Holloway, University of London, Egham, Surrey, UK.

[15] Watson.EJ ; "Primitive Polynomials (mod 2)", Math. Comp.

I161 Zier1er.N ; "Primitive Trinomials whose degree is a Mersenne exponent", Inform Contr. Vol. 15, pp 67-69, 1969.

[171 Van Daalen.M, Jeavons.P, Shawe-Taylor.J, C0hen.D ; "Device for generating binary sequences for stochastic computing", Electronic Letters, Vol29, Nol, pp80-81, Jan 1993.

[18] Wolframs ; "Random sequence generation by cellular automata", Advances in Applied Mathematics 7, pp123-169, 1986.

[19 Compagner.A, Ho0gland.A ; "Maximum length sequences, cellular automata and random numbers", Journal of Computational Physics 71, ~ ~ 3 9 1 4 2 8 , 1 9 8 7 .

[201 Yarmolik.V.N, Murashko.1.A ; "Pseudo random sequence generator construction using cellular automata", Automatic Control and Computer Science Vol27, N03, pp9-13, 1993.

[211 Chowdhury.1, et al ; "A class of two-dimensional cellular automata and their applications in random pattern testing", Journal of Electronic Testing : Theory and Applications 5,

[221 KirkA, Th1enpont.H ; "Programmable logic array with differential pairs of PnpN photothyristors : an experimental assessment", IntConf. on Optical Computing, PDP14, Technical Digest, 1994..

[23] ThienpontH, et al; "Optical Thyristor Based Subsystems for Digital Parallel Processing", Submitted to MPpOIJ995.

[24] Lalanne.P, et a1 ; "2D generation of random numbers by multimode fibre speckle for silicon arrays of processing elements". Optics Communications 76, pp387-394, 1990.

Vol. 16, pp368-369, 1962.

pp67-82, 1994.

75

A a The Choice Between Electrical and Optical...

Documents

Transcript of A a The Choice Between Electrical and Optical...