Download - Advanced Design Issues for OASIS Network-on-Chip Architecture

Advanced Design Issues for OASIS Network-on-Chip ArchitectureAdvanced Design Issues for OASIS Network-on-Chip Architecture
Kenichi Mori, Adam Esch, Abderazek Ben Abdallah, Kenichi Kuroda The University of Aizu,
Graduate School of Computer Science and Engineering, Aizu-Wakamatsu 965-8580, Fukushima, Japan
Abstract—Network-on-Chip (NoC) architectures provide a good way of realizing efficient interconnections and largely alleviate the limitations of bus-based solutions. NoC has emerged as a solution to problems exhibited by the shared bus communication approach in System-On-Chip (SoC) implementations including lack of scalability, clock skew, lack of support for concurrent communication and power consumption. The communication requirement of this paradigm is affected by architecture parameters such as topology, routing, buffer size etc. In this paper, we propose advanced optimization techniques for OASIS NoC, a NoC we previously designed. We describe the architecture and the novel optimization techniques in details. Hardware complexity and preliminary performance results are also given.
Index words: Network-on-chip design; Optimization; Paral- lel; Flow control; Round robin.
I. INTRODUCTION
Current Systems-on-Chip (SoC) execute applications that demand extensive amounts of parallel processing. Networks- On-Chip (NoC) [1], [2] provide a good way of realizing interconnections on silicon and largely alleviate the limitations of bus-based solutions.
Deep sub-micron processing technologies have enabled the implementation of new application-specific embedded architectures that integrate multiple software programmable processors and dedicated hardware components together onto a single chip. Recently, these application-specific architectures have emerged as key design solutions for today’s non-electronic design problems, which are being driven by emerging applications in the areas of (1) wireless communication, (2) broadband/distributed networking, (3) distributed computing, and (4) multimedia computing.
NoC is becoming an attractive option for solving bus- based problems. It is a scalable architectural platform with huge potential to handle increasing complexity and can easily provide reconfigurability. In NoC architectures, processors are connected via a packet-switched communication network on a single chip similar to the way computers are connected to Internet. The packet-switched network routes information between network clients (e.g. processors, memories, and custom logic devices).
Packet switching supports asynchronous transfer of information. It also provides extremely high bandwidth by distributing the propagation delay across multiple switches, effectively pipelining the signal transmission. In addition, NoC offers several promising features. First, it transmits packets instead of words. Dedicated address lines, like those in bus systems, are not necessary since the destination addresses of packets are included in packet. Second, transmission can be conducted in parallel if the network provides more than one transmission channel between a sender and a receiver. Thus, unlike bus- based systems-on-chip, NoC presents theoretical infinite scalability, facile IP core reusing, and higher level of parallelism. This paper presents advanced optimization techniques and evaluation of a complexity effective Network-on-Chip (ONoC) based on our previously designed OASIS Network-on-Chip [1], [4].
Fig. 1. (a) denotes circuit switching and (b) denotes packet switching model.
Fig. 2. 4x4-mesh topology NoC.
2010 International Conference on Broadband, Wireless Computing, Communication and Applications
978-0-7695-4236-2/10 $26.00 © 2010 IEEE
978-0-7695-4236-2/10 $26.00 © 2010 IEEE
978-0-7695-4236-2/10 $26.00 © 2010 IEEE
978-0-7695-4236-2/10 $26.00 © 2010 IEEE
II. ON-CHIP INTERCONNECTION OVERVIEW
The NoC interconnection paradigm is characterized by its topology, protocol, and flow control. In circuit switching, physical paths beginning from source nodes and ending at destination nodes are reserved prior to the transmission of data. On-Chip interconnections use a layered approach method, where it is appropriate to describe protocol functions that operate on data units at different levels of abstraction. NoC paradigms are characterized the same as parallel machine interconnections using, protocol, topology, flow control, etc. There are some additional decisions which should be made to design NoC including the communication protocol, switching method, network topology, and so on. Figure 1 shows a typical NoC paradigm and point-to-point network. Figure 2 is 4x4- mesh topology NoC architecture. Various types of interconnect architectures for Multicore System-on-Chip (MCSoC) architectures have been proposed. Most of them borrow ideas from parallel computing while considering different constraints such as power, complexity, etc. Common aims are low latency and high throughput. The latter depends on the flow control mechanism, which deals with the allocation of channel and buffer resources to packets as they traverse paths [3]. There is a large range of protocols to select from for NoCs. Switching and routing styles are possible choices for NoC protocols. These can be distinguished by their flow control methodologies. When these switching techniques are imple- mented in on-chip networks, they will behave differently in performance and require different hardware resources. In circuit switching, resources are reserved throughout network connections and paths are set up before communication. Ad- vantages of using circuit switching include a certain amount of connection latency and quality of service. The disadvantage is regardless of the amount of data which is streamed to connections, resources are occupied, causing low utilization. In packet switching, packets communicate through routing nodes on the network. Each packet is independent, so other packets can be arbitrarily accepted by any node. It does not block communication nodes and more than one connection can be used at a time, which increases resource utilization [4]. Networks which have many cores can be configured, and are scalable because of switching. However, NoC connection stream functions are regressive. Also, when routing node congestion happens, packet transition latency may increase. In packet switching, Store and Forwarding (SF), Wormhole Switching (WS), Virtual Cut Through (VCT) can be chosen as switching methods [5]. These switching methods transmit flits (flow control units), are parts of individual packets.
Store and Forward - In this switching method, each router needs to have buffers which can store all of the flits associated with a single packet. When a router starts transmitting flits to the next router, it must store the entirety of flits in buffers. If a packet splits into a lot of flits, many buffers are needed, and there might be a long period of time between the first and last
flits. However, there are only a few number of routers which are used in a single transmission; it is useful in many other communication situations.
Wormhole - Each router can transmit flits one after another without waiting to store entire packets, and because buffers do not have to be as large, design complexity and latency decrease; however, routers may be occupied and block other transmissions.
Virtual Cut Through - VCT is an intermediate forwarding method that has properties of both SF and WS. It is possible to forward flits one after another, but still needs the entire buffer to store blocked flits. When blocking happens, flits are stored in a router next to the blocked one. Buffer size is large, but forwarding latency is small and there are few occupied routers when blocking happens.
Power consumption greatly depends on buffer size, though it is expected that the forwarding methods with many buffers obtain the highest throughputs. These trade-offs are described in the simulation section. There is an important discussion when one is selected among these methods of forwarding and data is transmitted. How flits are forwarded and through what route they are forwarded through are problems of routing. There are two types of NoC routing: static and dynamic. Dynamic routing involves selecting another route when the prior route is occupied or blocked. When using dynamic routing, it is necessary to account for deadlocks and livelocks. In Static routing, routing paths are determined by source routers by use of a routing table. All routes are fixed. If blocking happens, static routing waits to release the router because deadlocks cannot happen.
A. Control Flow
Flits must be transmitted so that there are no dropped flits, or with some kind of resend protocol. Low power consumption, and calculation time, are not dependent on situation of network and are ideal for flow control. ON/OFF, Credit-based, Hand- shaking, and ACK/NACK are commonly used control flows used in NoC and are explained in this paper. ON/OFF flow control [6], [7] has protocols which can manage data flow from upstream routers while issuing a minimal amount of control signals. It is able to do this because it has only two states: ON or OFF. This control flow has threshold values, which are dependent on the number of free buffers in downstream routers. The threshold values are used to decide the states of the control signals. When the number of free buffers is over the threshold, downstream routers emit an ”OFF” signal to upstream routers, stopping the flow of flits. Meanwhile downstream routers send flits to other nodes, and the number of free buffers becomes less than the threshold value, downstream routers emit an ”ON” signal to upstream routers, restarting the flow of flits. Since the ON/OFF signal
75757575
is just only sent to switch, there is a low calculation time. Figure 3 indicates one transmission example with ON/OFF flow control. In Credit-based flow control [1], [7], [8], [9], upstream nodes have information about the number of empty downstream buffers; we call this information CN (credit number). Each time upstream nodes send a flit to downstream buffers, the number is decremented by one. When downstream buffers send some flits to other nodes, they also send a credit control signal to upstream routers, and when the upstream router receives the signal, the CN associated with the path is incremented appropriately. Figure 4 illustrates data flow and an example of transmission. In this example, initially node2 is blocked, and CN is decremented. Next node2 starts sending flits and credit signals are emitted to node1, which receives the signal and re-starts sending flits to node2. Handshaking flow control [10] is illustrated in Fig. 5, it is needed two control signals to send valid and acknowledge signal. The upstream nodes must keep valid signal corresponding to every flits until they receive an acknowledge signal. In Fig. 5, node1 sends a valid signal to node2, and node2 stores some flits, and sends an acknowledge signal to node1. Node1 receives the acknowledge signal and eliminates the corresponding valid. Node2 does not have free buffers to store flits when it receives the val3 signal, it sends a non- acknowledge signal to node1. Node1 resends a val3 signal until it receives an acknowledge signal. The above algorithms send signals from the downstream buffers to upstream buffers and decide whether or not to send flits. On the other hand, ACK/NACK flow control [6], [7] does not need to wait for and calculate signals from downstream buffers. This flow control reduces needed to issue control signal latency. In this flow control model, as flits are sent from source to destination, a copy is kept in each of the node buffers to resend some dropped flits if necessary. An ”ACK” signal is sent from downstream nodes when a flit is sent, and upstream nodes receive ”ACK” signals, it deletes its copy from its buffers. If the downstream node cannot or does not receive the correct flits, it sends ”NACK” signal to the upstream node, and upstream node rewinds its output queue and starts resending a copy of the corrupted flit. Figure 6 indicates one example of this flow control.
III. OPTIMIZED NOC DESIGN
We previously made a NoC named OASIS which has 4x4 mesh-topology, wormhole switching, a First-Come, First- Served (FCFS) scheduler, and re-transmission flow control which is like ACK/NACK flow control, however source and destination core only manage their flow control. In this section, we describe the advantages and drawbacks of OASIS NoC. FCFS scheduling is a very simple algorithm to implement in hardware. It experiences low power consumption and low latency. The re-transmission flow control employed by OASIS is done in the Network Interface (NI). Source nodes send checksum and the NIs of destination nodes confirm the correct-
Fig. 3. ON/OFF flow control.
Fig. 4. Credit-based flow control.
ness. If flits have been corrupted or dropped, they must be re- transmitted. In this situation, new flits are resent from source PEs. Using this technique, OASIS can correctly transmit data in noisy environments. However, OASIS has some drawbacks. FCFS has unbalanced network utilization when analyzing overall performance, because some transmissions are served by the scheduler and other transmissions are always stalled. This approach causes unbalanced network utilization and may increase the latency. The re-transmission technique must resend flits if errors occur, and may cause increase network utilization, increasing the possibility of a bottleneck.
Fig. 5. Handshaking flow control.
76767676
Fig. 6. ACK/NACK flow control.
Fig. 7. One flit structure. Payload is 8-bit.
A. Abstract of ONoC Architecture
We have made optimizations to ONoC in areas including switching, forwarding, routing, and topology and have ex- changed additional functionality to the design by implement- ing stall-go and round robin scheduling [11], [12]. The new algorithms are intended to optimize and enhance performance for general application; it does not target specific applications and platforms.
Fig. 9. State machine of Stall-go flow control.
B. Switching
Wormhole switching and virtual cut through forwarding methods are both employed in ONoC. The forwarding method that is chosen in a given instance is dependent on the level of packet fragmentation. In OASIS, each router has buffers which can store four flits. When a packet is divided into more than four flits, ONoC chooses wormhole switching. When packets are divided into less than four flits, ONoC chooses virtual cut through. In other words, when buffer size is greater than or equal to the number of flits, virtual cut through is used, and when buffer size is less than or equal to the number of flits, wormhole switching is used.
C. ONoC buffer design consideration
ONoC has buffers which store multiple flits at each input port. Buffer size is an important NoC parameter that affects area utilization and throughput. An increased number of occupied routers increases the prob- ability of communication stall. In wormhole switching, it is possible to observe cases where the buffer size is too small to store entire flits. On the other hand, virtual cut through uses fewer router routers when there is a blockage, increasing throughput. There is also an inverse relationship between power consumption and throughput, and the buffer depth of ONoC is changed between 4, 8, 16, and 32 and simulated.
D. Routing and topology
ONoC employs XY-routing and distributed routing, mean- ing all flits contain routing information about the path between the source and destination routers. Each router compares the current address and destination address to select the output port direction. Figure 7 indicates flit structure; it accepts up to 7x7 network size.
E. Avoiding buffer overflow
The used avoiding buffer overflow technique efficiently prevents buffer from overflowing. Data transfer is controlled by signals indicating buffer condition. When there are not stall- go functions, receiving cores need to judge whether there are dropped packets or not. If it there are, the sender must resend
77777777
the dropped packets using a receiving request signal from master cores. Addition of stall-go function incurs communication blocking. However it can reduce latency. Figure 8 shows a simplified block diagram of the technique. Figure 9 illustrates the state machine of this approach. State ”Go” indicates the receiving FIFO can store more than two flits. State ”Sent” means it can store 1 flit. State ”Stop” means it can not store any flits. There is no state transition ”Stop” to ”Sent”, because once the state transition ”Stop” happens, priority of this port is altered to low in sending router.
F. Scheduling
ONoC adopts wormhole-like switching technique. Inputs and outputs of routers have five directions: north, east, south, west, and local. It is acceptable for routers to have multiple inputs, but only one input stream is processed and routed. Our early design adopted FCFS scheduling. Using FCFS, if a message is divided into five packets (for example), the communication blocks all routers in the path between source nodes and destination nodes until it is finished transmitting. Because routers are blocked, other transmissions cannot be treated, increasing transmission latency. To solve this problem, ONoC employs round-robin scheduling via the packet transmit layer. Thus, it can treat each communication as a partially fixed transmit latency.
IV. SIMULATION RESULT
ONoC is designed in Verilog HDL, synthesized with Altera CAD tools, and simulated with ModelSim [13]. The target of this prototype is a hardware-based JPEG codec [14]. Figure 10 illustrates the block diagram for the JPEG2000 codec [15], [16]. We used this tasks diagram to implement JPEG2000 on ONoC using random mapping as shown in Fig. 11. However, we only focused on transfer. We evaluated the hardware complexity and data transfer cycles for ONoC. Table 1 shows the simulation parameters. Figure 12 shows the results of simulation in which 120,015 bytes of input data were tested (Baboon (ratio 200x200)). In this example, flit data size is 8-bit long because the minimal data transmission size is 8- bit in preprocessing [17]. As a result, total transmission time decreased by 12.9% in 16 buffer depth (for example). In average, the total transmission time is reduced 19.5%.
V. HARDWARE DESIGN RESULT
As we mention in section 3, the buffer size is one of most important parameter in the NoC architecture. If buffer size is small and network is congested, many flits can not start transmission. Hence, large buffer achieve high throughput. We analyzed relationship between buffer size and area utilization. The hardware complexity comparison between ONoC and OASIS-NoC is shown in Table 2. We estimated different buffer size area utilization over various sizes. We observed from this evaluation that large buffer size dramatically increases area utilization. In average, ONoC area is about 4.38% larger than OASIS’ area. Similarly, power is about 0.11% higher and speed decreases only about 9.46%.
TABLE I SIMULATION PARAMETERS.
ONoC Parameters Value Network Size 3x3-mesh Buffer Depth 4, 8, 16 and 32 Flit Size 20 bits (header:12 bits,payload:8 bits) Forwarding Wormhole switching like Scheduling Round-robin Control Flow Stall-go Routing Static X-Y Target Application JPEG codec Target Device Altera Stratix III Input Data Size 120,015 bytes (ratio 200x200)
TABLE II OASIS AND ONOC HARDWARE PERFORMANCE. BD: BUFFER DEPTH.
BD Arch. Area(ALUTs) Power(mW) Speed(MHz) 4 ONoC 5,485(5%) 649.17 185.87
OASIS 5,282(5%) 649.03 207.90 8 ONoC 8,269(7%) 660.02 186.60
OASIS 7,890(7%) 659.31 195.05 16 ONoC 10,538(9%) 682.80 161.26
OASIS 10,279(9%) 681.63 177.43 32 ONoC 17,416(15%) 716.87 153.96
OASIS 16,569(15%) 716.02 172.38
VI. CONCLUSION
In this paper, we presented architecture and advanced optimization techniques for flit control-flow and scheduling. The ONoC architecture is designed in Verilog HDL, synthesized and simulated with commercial CAD tools. ONoC reduces total transmission time without significant hardware cost. The total transmission time is decreased by 19.5% with 4.38% extra area utilization in average.
REFERENCES
[1] A. Ben Abdallah, and M. Sowa, ”Basic Network-on-Chip Intercon- nection for Future Gigascale MCSoCs Applications: Communication and Computation Orthogonalization”, In Proceedings of Tunisia-Japan Symposium on Society, Science and Technology (TJASSST), Dec. 4- 9th, 2006.
78787878
Fig. 11. JPEG codec tasks are mapped on ONoC.
Fig. 12. OASIS and ONoC simulation result. It shows both of their total transmission time and number of dropped packets.
[2] Moraes, F.G; et al., ”HERMES: an Infrastructure for Low Area Overhead Packet-Switching Networks on Chip”, Integration, the VLSI Journal, vol. 38-1, 2004, pp. 69-93.
[3] L. Benini et al., ”Networks on chips: A new SoC”, paradigm, IEEE Computer, vol. 36, no. 1, Jan. 2002, p.70-78.
[4] Kenichi Mori, A. Ben Abdallah, Kenichi Kuroda, ”Design and Eval- uation of a Complexity Effective Network-on-Chip Architecture on FPGA”, The 19th Intelligent System Symposium (FAN 2009), Sep. 2009, pp.318-321.
[5] M. S. Rasmussen, ”Network-on-Chip in Digital Hearing Aids”, Infor- matics and Mathematical Modelling, Technical University of Denmark, DTU, Richard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby, IMM-Thesis-2006-76, 2006.
[6] A. Pullini, F. Angiolini, D. Bertozzi, and L. Benini, ”Fault tolerance overhead in network-on-chip flow control schemes”, In Proceedings of 18th Annu. Symp. Integr. Circuits and Syst. Des. (SBCCI), 2005, pp. 224-229.
[7] William J. Dally, Brian Towles, ”Principles and Practices of Intercon- nection Networks”, Morgan Kaufmann, 2004, chap.13.
[8] A. Mello, L. Tedesco, N. Calazans, and F. Moraes., ”Virtual channels in networks on chip: Implementation and evaluation on hermes NoC”, In SBCCI’05: In Proceedings of the 18th annual symposium on Integrated circuits and system design, New York, NY, USA, 2005. ACM Press, pp.178-183.
[9] E. Bolotin et al., ”Automatic hardware-efficient SoC integration by QoS network on chip”, in 11th IEEE International Conference on Electronics, Circuits and Systems, 2004.
[10] C. A. Zeferino and A. A. Susin., ”SoCIN: A Parametric and Scalable Network-on-Chip”, In Proceedings of the 16th Symposium on Integrated Circuits and Systems Design (SBCCI03), 2003, pp. 34-43.
[11] M. Weber., ”Arbiters: Design ideas and coding styles”, Synopsys Users Group 2001 Proceedings, Boston, 2001.
[12] E.S. Shin et al., ”Round-robin Arbiter Design and Generation”, In International Symposium on System Synthesis, Kyoto, Japan, 2002, pp. 243-248.
[13] Rickard Holsmark, ”Modeling and Prototyping of a Network on Chip”, Master Thesis, Electronics, Sweden Jonkoping University, 2002.
[14] J. Rosethal, ”JPEG Image Compression Using an FPGA”, Master of Science in Electrical and Computer Engineering, DEC. 2006.
[15] M. J. Gormish, D. Lee, M. W. Marcellin, ”JPEG-2000: Overview, Architecture and Applications”, in Proceedings IEEE Int. Conference, Image Processing (ICIP2000), Sep. 2000, pp. 29-32.
[16] Zhang, H. and Fritts, J., ”EBCOT coprocessing architecture for JPEG2000”, SPIE Electronic Imaging - Video Communications and Image Processing 2004, San Jose, CA, pp. 1333-1340.
[17] A. N. Skordas, C. A. Christopoulos, T. Ebrahimi, ”JPEG2000: The Upcoming Still Image Compression Standard”, In Proceedings of the 11th Portuguese Conference on Pattern Recognition, Portugal, May 11- 12 2000, pp. 359-366.
79797979