2-Level FIFO Architecture Switch Fabrics in Network-on-Chip · FIFO. The purpose for the...

4
2-Level FIFO Architecture Design for Switch Fabrics in Network-on-Chip Po-Tsang Huang and Wei Hwang Department of Electronics Engineering & Institute of electronics, and Microelectronics and Information Systems Research Center, National Chiao-Tung University, HsinChu 300, Taiwan Bu,g.ee9 I g(nctu.edu.tw, Hwang(maiI.nctuedutw Abstract-The network-on-chip (NoC) architecture provides the integrated solution for system-on-chip (SoC) design. The buffer architecture and sizes, however, dominate the P P PE PE performance of NoC and influence on the design of arbiters in the switch fabrics. The 2-level FIFO architecture is proposed. It simplifies the design of the arbitration algorithm and gets P EP better performance than other buffer architectures without l = 3 increasing the buffer sizes. The concept of the shared memory T PE mechanism and multiple accesses for the buffers are developed. The FIFO architecture is implemented and simulated with TSMC 0.13um CMOS technology by HSPICE and Verilog. PE PE PE PE The operation frequency of the 2-level FIFO reaches 400MHz. Figure 1. Mesh Network-on-chip Architecture 1. INTRODUCTION In general, a switch fabric as Fig. 2(a) consists of a System-on-Chip (SoC) designs provide the integrated switching unit that connects input and outputs along with a solution to the challenging design problems. However, buffer unit and a switching unit which is steered by the modem SoC designs are faced with a number of challenges arbiter. The complexity of the switching unit mainly depends caused by the scale and complexity of the designs. The on the implemented arbitration algorithm. While the required on-chip communication bandwidth is growing switching unit is an important part of the switch fabric, the beyond that provided by the standard on chip buses. The buffers will significantly affect the overall performance and shared bus architecture will limit the development factor for the arbitration algorithm [4]. If the design lacks sufficient the integration with increasing processor elements. In deep buffer space, the buffers may fill up too fast while over- sub-micron (DSM) technology, the interconnect delay across provisioning of buffers clearly is a waste of scarce area the chip exceeds the average clock period of the IP blocks resources [5]. Thus, the proposed centralized 2-level FTFO [1]. architecture with centralized FIFO is as Fig. 2(b). It provides Some new architectures for the on-chip communications a shared memory mechanism for the channels to share the are proposed to adapt the next generation SoC, like multi- centralized FIFO with sufficient buffer space. layer on-chip shared buses [2], crossbar switching system [3] Output Input Out ut Input and network-on-chip (NoC). However, the multi-layer on Buffer MUX chip buses architecture is the revised version of the T M D -MUX traditional on-chip buses, and the crossbar switching system Arbiter Arbiter and network-on-chip can be combined with the concept of Cti t tO globally asynchronous local synchronous (GALS) system e; architecture. Nevertheless, the centralized crossbar switching system will be challenged to complex wire routings which will introduce larger power consumption and interconnect \ MUX delay. The mesh network-on-chip architecture as Fig. 1 is L *i based on a homogeneous and scalable switch fabric network, (a) (b) which considers all the requirements of on-chip Figre 2. (a) Co communications and traffic. Where PE represents processor ibu nential Switch Fabric element units and S represent switch fabrics. 0-7803-9390-2/06/$20.00 ©2006 IEEE 4863 ISCAS 2006

Transcript of 2-Level FIFO Architecture Switch Fabrics in Network-on-Chip · FIFO. The purpose for the...

Page 1: 2-Level FIFO Architecture Switch Fabrics in Network-on-Chip · FIFO. The purpose for the distributed level-I FIFO is pre- the packetprobability applies. The2-level FIFOarchitecture

2-Level FIFO Architecture Design for Switch Fabricsin Network-on-Chip

Po-Tsang Huang and Wei HwangDepartment of Electronics Engineering & Institute of electronics, and

Microelectronics and Information Systems Research Center,National Chiao-Tung University, HsinChu 300, TaiwanBu,g.ee9 I g(nctu.edu.tw, Hwang(maiI.nctuedutw

Abstract-The network-on-chip (NoC) architecture providesthe integrated solution for system-on-chip (SoC) design. Thebuffer architecture and sizes, however, dominate the P P PE PEperformance of NoC and influence on the design of arbiters inthe switch fabrics. The 2-level FIFO architecture is proposed.It simplifies the design of the arbitration algorithm and gets P EP

better performance than other buffer architectures without l = 3increasing the buffer sizes. The concept of the shared memory T

PE

mechanism and multiple accesses for the buffers are developed.The FIFO architecture is implemented and simulated withTSMC 0.13um CMOS technology by HSPICE and Verilog. PE PE PE PEThe operation frequency of the 2-level FIFO reaches 400MHz.

Figure 1. Mesh Network-on-chip Architecture1. INTRODUCTION In general, a switch fabric as Fig. 2(a) consists of a

System-on-Chip (SoC) designs provide the integrated switching unit that connects input and outputs along with asolution to the challenging design problems. However, buffer unit and a switching unit which is steered by themodem SoC designs are faced with a number of challenges arbiter. The complexity of the switching unit mainly dependscaused by the scale and complexity of the designs. The on the implemented arbitration algorithm. While therequired on-chip communication bandwidth is growing switching unit is an important part of the switch fabric, thebeyond that provided by the standard on chip buses. The buffers will significantly affect the overall performance andshared bus architecture will limit the development factor for the arbitration algorithm [4]. If the design lacks sufficientthe integration with increasing processor elements. In deep buffer space, the buffers may fill up too fast while over-sub-micron (DSM) technology, the interconnect delay across provisioning of buffers clearly is a waste of scarce areathe chip exceeds the average clock period of the IP blocks resources [5]. Thus, the proposed centralized 2-level FTFO[1]. architecture with centralized FIFO is as Fig. 2(b). It provides

Some new architectures for the on-chip communications a shared memory mechanism for the channels to share theare proposed to adapt the next generation SoC, like multi- centralized FIFO with sufficient buffer space.layer on-chip shared buses [2], crossbar switching system [3] Output Input Out ut Inputand network-on-chip (NoC). However, the multi-layer on Buffer MUXchip buses architecture is the revised version of the T M D -MUXtraditional on-chip buses, and the crossbar switching system Arbiter Arbiterand network-on-chip can be combined with the concept of Ctit tOglobally asynchronous local synchronous (GALS) system e;architecture. Nevertheless, the centralized crossbar switchingsystem will be challenged to complex wire routings whichwill introduce larger power consumption and interconnect \ MUXdelay. The mesh network-on-chip architecture as Fig. 1 is L*ibased on a homogeneous and scalable switch fabric network, (a) (b)which considers all the requirements of on-chip Figre 2. (a) Cocommunications and traffic. Where PE represents processor ibu nential Switch Fabricelement units and S represent switch fabrics.

0-7803-9390-2/06/$20.00 ©2006 IEEE 4863 ISCAS 2006

Page 2: 2-Level FIFO Architecture Switch Fabrics in Network-on-Chip · FIFO. The purpose for the distributed level-I FIFO is pre- the packetprobability applies. The2-level FIFOarchitecture

Buffer increasing the FIFO sizes. It creates the virtual channels formTural lt 1 1 1(A1 ltral lt ---l l 1 l 1 l 1 l l l the inputs to the outputs by multiple accesses of the output

^*_, obuffers. The 2-level FIFO architecture provides a sharedInput2 Output- memory mechanism for the output buffers which can share

the memory elements ofthe centralized level-2 FIFO. Hence,the inputs can send the data to the same output buffer at the

Input3 * i__ 11 __ ___"'O utput3 same time slot with parallel virtual channels. The multiple(a) access output buffers perform the similar utility as the

middle buffers through the virtual channels. Although themiddle buffer can get the best solution for thecommunication traffic as the section II, it needs N physicalchannels and memory elements. The middle buffer, inaddition, still needs N-to-I multiplexer to select the suitabledata in the time slot.

The brief operation of the 2-level FIFO architecture isdescribed as follow. When the input packets arrive to the 2-

Input Buffer Middle Buffer Output Buffer level FIFO architecture, the head decoder will de-multiplex(b) the input data from the header information first. The pointer

Figure 3'(a)HeaoflinBlckngb uffrPsscheduler will schedule the address pointer for the outputFigure 3. (a) Head-of-line Blocking (b) Buffer Position buffers and send the de-multiplexed data into the centralized

level-2 FIFO. Accordingly, the address pointers record theII. BUFFER LocATIoN top address of the output buffers. The centralized level-2

In a switch fabric, the buffers can be placed either before FIFO is implemented by 5R/5W register files. Whenor after the switch [6]. However, there is a distinction acknowledges are asserted form the next stage, thebetween input buffers and output buffers. If a data word is distributed level-I FIFO will transfer the output data.delayed in a switch fabric with input buffers, it will stall all Moreover, the pointer scheduler calculates the addressdata words arriving on the same input. None of them can be pointers which indicate the bottom of the output buffer to theprocessed until the first one has been forwarded successfully. centralized level-2 FIFO. The centralized level-2 FIFOWith the output buffers, the situation is different because that delivers the accuracy data to the level- I FIFO.the switching is performed before the buffering. If a switchfabric can not send the data over one of its outputs, the IV. FUNCTION BLOCK OF 2-LEVEL FIFO ARCHITECTUREbuffers at that output will fill up. There is, however, no The detail of each function block of 2-level FIFOimmediate influence on the inputs. The successive data architecture in Fig. 4 will be interpreted in this section.words can still be received. An architectural disadvantage ofoutput buffering is that in one cycle, data from multiple inputports may have to be written to the same output port.Nevertheless, the multiple buffers can be implemented in 1P Uparallel at the output to deal with the disadvantage. No Head rDI dwhether output buffers or input buffers, they will introduce Addr3head-of-line blocking problem to stall the input data as Fig. d3(a). Inptit| Addrl

Fig. 3(b) shows the input buffers, middle buffers and \ fCetrlized Level-2output buffers in the switch fabrics. The concept of middle 111 Obuffering describes that the cutest of the buffers placementmoves to the middle of the switch. The middle bufferarchitectures have O(N2) buffer blocks for a N-port switch -aWfabric while input and output buffering have only O(N) t-itlbuffer blocks. The middle buffer architecture, however, canreduce the effects of the head-of-line blocking with multiple DdLi5channels in the switching. It will be a trade-off betweentraffic problems and buffer sizes. Consequently, both outputbuffers and middle buffers are looking for the best FIFO cklutility. Ackb Level-I1IF

III. 2-LEVEL FIAFO ARCHITECTURE

The 2-level FIFO architecture as Fig. 4 is proposed to Output VSreduce the head-of-line blocking problems without Figure 4. 2-Level FIFO Architecture

4864

Page 3: 2-Level FIFO Architecture Switch Fabrics in Network-on-Chip · FIFO. The purpose for the distributed level-I FIFO is pre- the packetprobability applies. The2-level FIFOarchitecture

" 0oo" output buffers in the centralized level-2 FIFO. At first, theP2 2West centralized level-2 FIFO is divided into 5 section for the

. North output buffer. In each section, it can serve as a local circleNorffifor the output buffer. The stack and frame pointers, hence,

--------- --------- eae]00 01 10 1 1 perform as circular pointers in its local circle. The shared0 0.o1" lEast SI W N memory mechanism is asserted when the memory elements\\}HSouthW .. E are full in the local circle. It has to borrow the memory

Soui I/\*1r * *IOU1N I<>\ E elements from other local circles which have empty memoryWest P E S elements. As to that the maximum packets which the output

IPO 0l01,' IProcessoer o buffer would be received are four packets, the depth of one

East 11" Element E W N memory element is defined as four words data.Figure 5. Clockwise Header Decoder With the spin of the stack pointers and frame pointers,

A. Header Decoder however, there are two modes for the shared memorymechanism of 2-level FIFO architecture as shown in Fig. 6.

The packets which are delivered from the processor The definition of the boundary pointer (BP) is the boundaryelements contain the headers and payloads. The headers address of the local circles, and the parameter of lengthmake the descriptions of the paths which the packets will go records the number of memory elements in a local circle.through. The header information depends on the routing The difference between model and mode2 shared memoryalgorithm and network-on-chip architecture. For a mesh mechanism is the position of the boundary pointer in thenetwork-on-chip architecture, there are five inputs and borrowed circle - the small circles in the Fig. 6. Modeloutputs in a switch fabric. Each packet has four directions to shared memory mechanism occurs as the boundary pointerselect, and the header needs two bits to dedicate the within the empty memory elements. The dotted lines and theswitching path. The clockwise header decoder in the 2-level black lines are shown as the empty and full memoryFIFO architecture is as Fig. 5. As the packet is inputted from elements respectively. The gray lines, moreover, indicatedthe south port, the number plates of the output ports are i the the borrowed memory elements for the full local circle. Itrow of south. The number plates follow the clockwise shows the linear representation of the centralized level-2direction depending on which input port the packet gets into. FIFO in the left of Fig. 6(a). When the sectionl is full, one

B. Arbiter memory element of section2 moves to the section 1. Hence,The arbiter decides the order of the multiple in

The BP2 will be changed to the top of the borrowed memoryThe arbiter decides the order of the multiple accesses in element. As the borrowed memory element get into circlel

the same time slot. While more than one packet at different which is shown as the right of the Fig. 6(a), it needs a linkinput ports require the same output port in the meanwhile, pair, link pointer and return pointer, to link the discontinuedthe arbiter needs to give the priorities for the packets.Therefore, the pointer scheduler depends on the priorities to - BP1 R FP1 SPIcalculate the pointers and the addresses for the output buffers. FP,SP S Fp2 FP2The arbitration algorithm, however, relies on the buffer sizes. 5 Length Length.=5- 1IWhen the buffer size is not large enough, the arbiter j BP2' 1 BP2algorithm will become more complex to reduce the traffic oBP1 rBPproblems. The 2-level FIFO architecture can simplify the BP2arbitration algorithm as to the shared memory mechanism L ngth= 11 L ngth= 1± 1and multiple accesses for the output buffers. For avoiding FP2starvation with lower priority packets and ensuring the Link-A BP2transmission speed with high priority packets, the 2-level L SP2 FPI, SPI FPI, SPIFIFO architecture uses the time division multiple access (a) BP1(TDMA) arbitration algorithm. It is implemented by a SP2 SP2 FP1, SPIcounter to transfer the priorities for the successive input ports. -The priority decides the packet's position in the output buffer Length=5 Iin a time slot. P2 -eBP2h15 2FP2' \BP2- 2 BP.,

-_-BP2C. Pointer Schedular BP1 BP1 BP1 1The pointer scheduler is the kernel block of the 2-level FIFO ength= 11 nh 1 ±architecture. It calculates the addresses for the centralizedlevel-2 FIFO and sends the correspond data. The pointer FP2scheduler records the address pointers which includes the FP1SP1 FPI, SPI FP1, SPI _stack pointers (SP) and frame pointers (FP) for the outputbuffers. The stack pointer is defined as the top address of the(boutput buffer and the frame pointer points out the bottom Of Fiur 6. (a oe hae eoy ehnsthe output buffer. Thus, the pointer scheduler has five stack (b) Mode2 Shared M4emory Mechanismpointers and five frame pointer to dedicate the location of the

4865

Page 4: 2-Level FIFO Architecture Switch Fabrics in Network-on-Chip · FIFO. The purpose for the distributed level-I FIFO is pre- the packetprobability applies. The2-level FIFOarchitecture

discontinued output buffer. The number of the link pair the register file is 400MHz. The applied input packetinfluences the stall of the input. Mode2 shared memory probabilities can be divided into high (750 o), medium (500/O)mechanism is similar as model. But mode2 is occurred when and low (25%). The probability can be described as the flowthe boundary pointer is between the stack pointer and the rates of the channels. Fig. 7 shows the comparisons betweenframe pointer. First, it needs to spin the borrowed circle for input buffers, output buffers, middle buffers and 2-levelthat the boundary pointer and the frame pointer indicate the FIFO architecture with the same number of memorysame address. The spin vector is the distance between frame elements about 32 words. The performance is defined as thepointer and boundary pointer. For the full circle, nevertheless, inverse number of the blocking data in million test packets,the function is the same as the model shared memory and it is normalized to the input buffer in the high packetmechanism. probability. From Fig. 7, the 2-level FIFO architecture has

the better performance than other buffer architecture. Withthe increasing of the input packet probability, the advantage

The centralized level-2 FIFO is implemented by the of the 2-level FIFO will be noticeable. In addition, the5R/5W register file. The centralized level-2 FIFO is designed performance of the input buffer can be modified with theas the abstract about large sizes FIFO and multiple accesses advanced arbitration algorithm. The advanced arbitrationfor the output buffer. The register file is designed with algorithm, however, is too complex to be implemented.double edge triggered memory architecture. When the clockis asserted, the register file will do the reading operation. In VI. CONCLUSIONSother words, it will perform the writing operation when the The 2-level FIFO architecture provides a good solutionclock is deserted. In addition, the pointer scheduler will feed for the switch fabrics in network-on-chip. The 2-level FIFOthe reading addresses from the frame pointers and writing architecture contains the abstracts about shared memoryaddresses from the stack pointers. No whether what the total mechanism and multiple accesses for the output buffer. Itnumber of the multiple accesses for the output buffers is, the realizes the novel buffer architecture with small memorymaximum number of the addresses and data which are sent elements and high performance. It also reduces the head-of-from the pointer scheduler at the same time slot is five. line blocking problems. Furthermore, another advantage is

E. Distributed Level-I FIFO that the architecture simplifies the arbitration algorithm forthe switch fabrics. The 2-level FIFO architecture has better

The shifter registers compose of the distributed level-i performance than other buffer architectures no matter whatFIFO. The purpose for the distributed level-I FIFO is pre- the packet probability applies. The 2-level FIFO architecturefetch the packets from the centralized level-2 FIFO. The can be very useful as alteration design to performance withmemory elements for the distributed level-I FIFO, hence, are switch fabrics in network-on-chip.required for small and fast. The shifter register of eachoutput consists of two registers. One Register is designed forthat the next stage can get the packets as fast, and the other ACKNOWLEDGMENTone receives the packets from the centralized level-2 FIFO. The work is supported by National Science Council,

5 Performance v.s. Buffer Architecture R.O.C., under the project NSC 92-2220-E-009-011, NSC 93-- input buffer 2220-E-009-024 and TSMC grant. This work is also

output buffer / supported by DoE/DoIT 94-EC- 17-A-0l-S 1 -034.I middle buffer 1

4 M 2-level FIFO

1. ~~~~~~~~REFERENCES3 - | l,111111111 | | l,11111117 l l | | t l l | [1] V. Chandra, A. Xu, H. Schmit, L. Pileggi, "An interconnect channel

anEShibtimon,VoaluTme 2utmei,pp.2113 -114, 2004

VI-tSratoaSymosium,nc pp. 295 ute-298, 2005,pp35

cosintmext-floowgystemon-chiph platforms",eWoterkshop ofrcEmbedded

V. SIMULATION RESULTS 36.1,2004

The 2-level FIFO architecture is implemented in TSMC [6] H. Zimmer, S. Zink, T. Hollstein, M. Glesner, "Buffer-ArchitectureExploration for Routers in a Hierarchical Network-on-Chip", IEEE0.13um CMOS technology. The register file iS simulated by International Parallel and Distributed Processing Symposium, pp.HSPICE, and the timing information was extracted to create 171a - 171a, 2005a timing model for the Verilog simulation. The frequency of

4866