Networks on chips — using DQQB (Distributed Queue Quad-Bus) on multi-CPU chips

Nonlinear Analysis 71 (2009) e657–e660

Contents lists available at ScienceDirect

Nonlinear Analysis

journal homepage: www.elsevier.com/locate/na

Networks on chips — using DQQB (Distributed Queue Quad-Bus) onmulti-CPU chipsFaisal QureshiDevry University, 1515 Jefferson Davis HwyApt 413, Arlington, United States

a r t i c l e i n f o

Keywords:Distributed queue dual busBandwidth balancing

a b s t r a c t

This paper discusses using a new protocol based on DQDB (Distributed Queue Dual Bus),called DQQB, or Distributed Queue Quad-Bus. The idea is to modify the backplane busof current architecture by utilizing four busses in the backplane. One bus going in thepositive y direction and one going in the negative y direction would allow multiple CPUson the y-axis to talk to various components by requesting slots and sending data packets(in the reverse direction) reliably. The other two busses are in the positive x direction forupstream and the negative x direction for downstream. The system is inherently stochasticas the data packets may incur bit-flips due to various types of interference, but by applyingnetwork protocols for reliable data transfer, we can reduce the probability of error. Also,by redesigning the architecture to account for DQQB, we can dramatically increase oursystem’s throughput.

© 2009 Published by Elsevier Ltd

1. Background

As more and more transistors are used on chips and the hardware becomes smaller, the density of the chips increases.The problem faced with such design is that of how to synchronize reliable data transfers based on a single clock. Large scalenetwork architecture can be applied to themicro-scale systems to allow for synchronous communication. A key componentthat will determine the energy use, performance, efficiency, and reliability will be the interconnections between the variouscomponents of the system. Again, we can mimic macro-scale data communication systems for this purpose.The solution would be to mimic macro-sized communication systems in hardware and use known network protocols,

or an extrapolation of them. The design would have to be locally synchronous and globally asynchronous. Thus, we need adistributed system that functions asynchronously globally, and synchronously locally. This means that each componentwill function on its own internal clock but not be limited to a single global clock. On the basis of the distributed datacommunication between components, the data transfers would be executed on demand. However, an issue arises forunreliable or corrupted data transfers and therefore one requires a stochastic model. For this reason, large scale networkprotocols best suit this system. Part of applying network protocols for micro-scale systems would involve a protocol stackto reduce the probability of error such as bit-flips in the data. Wires will be the primary source of errors that occur andwill not provide reliable data at a fixed delay. The non-deterministic aspect of this is the probability of errors can occur overdifferentwires at different times. These errors can be due to radiation, noise, or interference. In order to provide good qualityof service, this physical layer will have to look to the higher layers to solve this problem.Currently, systems on chips use a shared medium for communication between components. This is done via a backplane

bus. The way it works is that the bus has several bus masters, bus slaves, and an arbiter. The arbiter will decide who getsto send data and when. For example, if a processor wishes to get access to the shared bus, it will have to make a requestto the arbiter. The arbiter will then allow the processor to communicate some time later. This, however, is not as efficient,

E-mail address: [email protected].

0362-546X/$ – see front matter© 2009 Published by Elsevier Ltddoi:10.1016/j.na.2008.11.064

http://www.elsevier.com/locate/na

http://www.elsevier.com/locate/na

mailto:[email protected]

http://dx.doi.org/10.1016/j.na.2008.11.064

e658 F. Qureshi / Nonlinear Analysis 71 (2009) e657–e660

as many components make requests to the arbiter simultaneously. What can happen is that the many reservations can bequeued while waiting for the bus slaves to respond. This way, the bandwidth of the shared medium, the backplane bus, isbetter utilized and the throughput increases. Another method used that improves bandwidth utilization is that where thebus will divide large data transfers into smaller packets. The main drawback of having a shared medium is that it consumesa lot of energy as each data transfer is a broadcast and the bus is also a bottleneck for performance.

2. The DQDB protocol

The DQDB (Distributed Queue Dual Bus) protocol is a network protocol that resolves contention on a shared medium. Ituses TDM with reservations and maintains queues for data transfer at each node. The protocol is based on two busses. Onebus goes upstream and the other downstream. On the basis of TDM a node can reserve a slot on the upstream or downstreambus, where an empty slot for data transfer will arrive for that node on the bus going in the opposite direction.Each slot has a start of slot, a busy bit, and a reservation bit. If the slot is taken, the busy bit is 1 and it is 0 otherwise. The

reservation bit is 1 if a node wants to transmit as this empty slot passes by, which in turn allows a slot on the other bus toremain unused until it reaches the node that made the reservation. Each node has a queue for data and two counters. It hasa ‘before’ counter and an ‘after’ counter. If a reservation has been made before node X, then as the reservation slot passesnode X it will increment it’s ‘before’ counter. Once node X has made a reservation and another reservation slot passes by,node X increments its ‘after’ counter. It is important to note that this system remains fair even for large data transfers.

3. Using DQQB on chips

We can extrapolate on the DQDB protocol and apply it to chips. The extrapolation is implementing a DQQB (DistributedQueue Quad-Bus) protocol. To apply this to systems on chips the backplane bus would need to be redesigned so that thereare a total of four busses. One bus communicates in the positive x direction, one in the negative x direction, one in the positivey direction, and one in the negative y direction. The components can be laid out so that we havemultiple CPUs on the y-axis,and other components such as controllers and memory chips on the x-axis. At the center of this design there would be acache that can be used by components on either axis. This cache can be an instruction cache as well as a memory cache forusage for components on both axes (Fig. 1).The CPUs on the y-axis can send data upstream and downstream on their respective busses, and the components on the

x-axis can also do so, on their respective busses. Thus, the CPUs only contendwith each other for accessing slots on the y-axisbusses, and similarly for the x-axis components. By doing this we are increasing the system throughput tremendously, first,by having multiple fully functional CPUs, and second, by reducing contention of the mediums. The driving force behind thisdesign is the fact that the system is asynchronous and the data transfers are not dependent on a single clock.An example of the functionality and sequence of data transfers would be as follows for BUS1U:

(a) packet arrives at CPU3(b) CPU3 makes reservation(c) CPU2’s C(before) = 1(d) CPU1’s C(before)= 1(e) Packet arrives at CPU1(f) CPU1 makes reservation(g) CPU2’s C(before)= 2(h) CPU3’s C(after)= 1(i) Packet arrives at CPU2(j) CPU2 makes reservation(k) CPU1’s C(after)= 1(l) CPU3’s C(after)= 2(m)Empty slot goes by(n) CPU1’s C(before)= 0(o) CPU2’s C(before)= 1(p) CPU3 transmits and C(before)= C(after), C(after)= 0(q) Empty slot goes by(r) CPU3’s C(before)= 1(s) CPU2’s C(before)= 0(t) CPU1 transmits and C(before)= C(after), C(after)= 0(u) Empty slot goes by(v) CPU3’s C(before)= 0(w)CPU2 transmits and C(before)= C(after), C(after)= 0(x) CPU1’s C(before)= 0.

It is important to note that reservations can be made on BUS1D going in the opposite direction. A similar scenario wouldoccur simultaneously on BUS2U and BUS2D.

F. Qureshi / Nonlinear Analysis 71 (2009) e657–e660 e659

Fig. 1.

4. Analysis

The throughput of the entire system increases tremendously due to multiple communication lines and not beingdependent on a clock or a single-bus arbiter. Although the possibilities or errors are inherently stochastic, theimplementation of layered protocols that apply to the data packets that are transferred greatly reduces the probabilityof error. Errors would generally occur at the physical layer, or more specifically, at the wires. The causes would be noise,radiation, and interference. However, the fact that the data are sent in packets means that we can use generic network errordetection and correction algorithms, and ensure quality of service and reliable data transfer at the data link layer as well asthe transport layer. An important issue to resolvewill be how to avoid corruption of the packets on themedium (wires) sincethe system is globally asynchronous and there are no fixed delays for data transfer. Although using a sharedmedium, wheredata transfers occur according to a single clock, would reduce this source of error, it is at the cost of overall performance.System throughput:(based on Fig. 1)

packetSize = 32X = (number of nodes that transmit data) ∗ (packetSize)X = BUS1U = 5 ∗ 32X = BUS1D = 5 ∗ 32X = BUS2U = 5 ∗ 32X = BUS2D = 5 ∗ 32NodesOnBus = 5BUSXX_Throughput = X/(X + NodesOnBus) = 97%System_Throughput = 291% > than single-bus architecture that uses TDM

Note: as packet sizes increase and/or number of nodes increase, this value increases.Using bandwidth balancing:We want to use available bandwidth as efficiently as possible and stay ‘‘fair’’ so there are no components ‘‘hogging’’ the

available bandwidth. To this end, the means would be to use bandwidth balancing. The idea of bandwidth balancing is thatout of a certain number of available slots, not all of them are used. In the end, the systemwill stabilize and all communicatingcomponents will receive an equal share of the available bandwidth. This fraction would be some α < 1. Because it is better

e660 F. Qureshi / Nonlinear Analysis 71 (2009) e657–e660

in terms of speed and energy consumption, the faster the system converges, the better, and the smaller the alpha chosen,the faster the system stabilizes. As an example, let us say that there is a processor and a memory controller that wish tocommunicate at the same time. Our α = 0.5.

Processor Memory controller0.5 ∗ 1 = 0.5 0.5 ∗ 0.5 = 0.250.5 ∗ 0.75 = 0.375 0.5 ∗ 0.625 = 0.31250.5 ∗ 0.6875 = 0.34375 0.5 ∗ 0.65625 = 0.3281250.5 ∗ 0.67 = 0.33 0.5 ∗ 0.67 = 0.33

The system converges after four rounds.

5. Conclusion

Due to asynchronous design, we can extrapolate on this implementation by adding more busses to reduce contentionon the available mediums while increasing the number of processors that the system can use. Our processing power can beincreased by orders of magnitude. Thus, the only limiting components that we face for our overall system speed will be thecommunication mediums themselves and how much RAM we have available to us. Finally, layered network protocols andstructure greatly reduce our probability of error and maintain the stability of the system.

Networks on chips — using DQQB (Distributed Queue Quad-Bus) on multi-CPU chips

Documents

Transcript of Networks on chips — using DQQB (Distributed Queue Quad-Bus) on multi-CPU chips