Advanced Design Issues for OASIS Network-on-Chip
ArchitectureAdvanced Design Issues for OASIS Network-on-Chip
Architecture
Kenichi Mori, Adam Esch, Abderazek Ben Abdallah, Kenichi Kuroda The
University of Aizu,
Graduate School of Computer Science and Engineering, Aizu-Wakamatsu
965-8580, Fukushima, Japan
Abstract—Network-on-Chip (NoC) architectures provide a good way of
realizing efficient interconnections and largely allevi- ate the
limitations of bus-based solutions. NoC has emerged as a solution
to problems exhibited by the shared bus communication approach in
System-On-Chip (SoC) implementations including lack of scalability,
clock skew, lack of support for concurrent communication and power
consumption. The communication requirement of this paradigm is
affected by architecture param- eters such as topology, routing,
buffer size etc. In this paper, we propose advanced optimization
techniques for OASIS NoC, a NoC we previously designed. We describe
the architecture and the novel optimization techniques in details.
Hardware complexity and preliminary performance results are also
given.
Index words: Network-on-chip design; Optimization; Paral- lel; Flow
control; Round robin.
I. INTRODUCTION
Current Systems-on-Chip (SoC) execute applications that demand
extensive amounts of parallel processing. Networks- On-Chip (NoC)
[1], [2] provide a good way of realizing interconnections on
silicon and largely alleviate the limitations of bus-based
solutions.
Deep sub-micron processing technologies have enabled the
implementation of new application-specific embedded archi- tectures
that integrate multiple software programmable proces- sors and
dedicated hardware components together onto a single chip.
Recently, these application-specific architectures have emerged as
key design solutions for today’s non-electronic design problems,
which are being driven by emerging ap- plications in the areas of
(1) wireless communication, (2) broadband/distributed networking,
(3) distributed computing, and (4) multimedia computing.
NoC is becoming an attractive option for solving bus- based
problems. It is a scalable architectural platform with huge
potential to handle increasing complexity and can easily provide
reconfigurability. In NoC architectures, processors are connected
via a packet-switched communication network on a single chip
similar to the way computers are connected to Internet. The
packet-switched network routes information be- tween network
clients (e.g. processors, memories, and custom logic
devices).
Packet switching supports asynchronous transfer of informa- tion.
It also provides extremely high bandwidth by distributing the
propagation delay across multiple switches, effectively pipelining
the signal transmission. In addition, NoC offers several promising
features. First, it transmits packets instead of words. Dedicated
address lines, like those in bus systems, are not necessary since
the destination addresses of packets are included in packet.
Second, transmission can be conducted in parallel if the network
provides more than one transmission channel between a sender and a
receiver. Thus, unlike bus- based systems-on-chip, NoC presents
theoretical infinite scal- ability, facile IP core reusing, and
higher level of parallelism. This paper presents advanced
optimization techniques and evaluation of a complexity effective
Network-on-Chip (ONoC) based on our previously designed OASIS
Network-on-Chip [1], [4].
Fig. 1. (a) denotes circuit switching and (b) denotes packet
switching model.
Fig. 2. 4x4-mesh topology NoC.
2010 International Conference on Broadband, Wireless Computing,
Communication and Applications
978-0-7695-4236-2/10 $26.00 © 2010 IEEE
978-0-7695-4236-2/10 $26.00 © 2010 IEEE
978-0-7695-4236-2/10 $26.00 © 2010 IEEE
978-0-7695-4236-2/10 $26.00 © 2010 IEEE
II. ON-CHIP INTERCONNECTION OVERVIEW
The NoC interconnection paradigm is characterized by its topology,
protocol, and flow control. In circuit switching, physical paths
beginning from source nodes and ending at destination nodes are
reserved prior to the transmission of data. On-Chip
interconnections use a layered approach method, where it is
appropriate to describe protocol functions that operate on data
units at different levels of abstraction. NoC paradigms are
characterized the same as parallel machine interconnections using,
protocol, topology, flow control, etc. There are some additional
decisions which should be made to design NoC including the
communication protocol, switching method, network topology, and so
on. Figure 1 shows a typical NoC paradigm and point-to-point
network. Figure 2 is 4x4- mesh topology NoC architecture. Various
types of interconnect architectures for Multicore System-on-Chip
(MCSoC) architectures have been proposed. Most of them borrow ideas
from parallel computing while considering different constraints
such as power, complexity, etc. Common aims are low latency and
high throughput. The latter depends on the flow control mechanism,
which deals with the allocation of channel and buffer resources to
packets as they traverse paths [3]. There is a large range of
protocols to select from for NoCs. Switching and routing styles are
possible choices for NoC protocols. These can be distinguished by
their flow control methodologies. When these switching techniques
are imple- mented in on-chip networks, they will behave differently
in performance and require different hardware resources. In circuit
switching, resources are reserved throughout network connections
and paths are set up before communication. Ad- vantages of using
circuit switching include a certain amount of connection latency
and quality of service. The disadvantage is regardless of the
amount of data which is streamed to connections, resources are
occupied, causing low utilization. In packet switching, packets
communicate through routing nodes on the network. Each packet is
independent, so other packets can be arbitrarily accepted by any
node. It does not block communication nodes and more than one
connection can be used at a time, which increases resource
utilization [4]. Networks which have many cores can be configured,
and are scalable because of switching. However, NoC connection
stream functions are regressive. Also, when routing node congestion
happens, packet transition latency may increase. In packet
switching, Store and Forwarding (SF), Wormhole Switching (WS),
Virtual Cut Through (VCT) can be chosen as switching methods [5].
These switching methods transmit flits (flow control units), are
parts of individual packets.
Store and Forward - In this switching method, each router needs to
have buffers which can store all of the flits associated with a
single packet. When a router starts transmitting flits to the next
router, it must store the entirety of flits in buffers. If a packet
splits into a lot of flits, many buffers are needed, and there
might be a long period of time between the first and last
flits. However, there are only a few number of routers which are
used in a single transmission; it is useful in many other
communication situations.
Wormhole - Each router can transmit flits one after another without
waiting to store entire packets, and because buffers do not have to
be as large, design complexity and latency decrease; however,
routers may be occupied and block other transmissions.
Virtual Cut Through - VCT is an intermediate forwarding method that
has properties of both SF and WS. It is possible to forward flits
one after another, but still needs the entire buffer to store
blocked flits. When blocking happens, flits are stored in a router
next to the blocked one. Buffer size is large, but forwarding
latency is small and there are few occupied routers when blocking
happens.
Power consumption greatly depends on buffer size, though it is
expected that the forwarding methods with many buffers obtain the
highest throughputs. These trade-offs are described in the
simulation section. There is an important discussion when one is
selected among these methods of forwarding and data is transmitted.
How flits are forwarded and through what route they are forwarded
through are problems of routing. There are two types of NoC
routing: static and dynamic. Dynamic routing involves selecting
another route when the prior route is occupied or blocked. When
using dynamic routing, it is necessary to account for deadlocks and
livelocks. In Static routing, routing paths are determined by
source routers by use of a routing table. All routes are fixed. If
blocking happens, static routing waits to release the router
because deadlocks cannot happen.
A. Control Flow
Flits must be transmitted so that there are no dropped flits, or
with some kind of resend protocol. Low power consumption, and
calculation time, are not dependent on situation of network and are
ideal for flow control. ON/OFF, Credit-based, Hand- shaking, and
ACK/NACK are commonly used control flows used in NoC and are
explained in this paper. ON/OFF flow control [6], [7] has protocols
which can manage data flow from upstream routers while issuing a
minimal amount of control signals. It is able to do this because it
has only two states: ON or OFF. This control flow has threshold
values, which are dependent on the number of free buffers in
downstream routers. The threshold values are used to decide the
states of the control signals. When the number of free buffers is
over the threshold, downstream routers emit an ”OFF” signal to
upstream routers, stopping the flow of flits. Meanwhile downstream
routers send flits to other nodes, and the number of free buffers
becomes less than the threshold value, downstream routers emit an
”ON” signal to upstream routers, restarting the flow of flits.
Since the ON/OFF signal
75757575
is just only sent to switch, there is a low calculation time.
Figure 3 indicates one transmission example with ON/OFF flow
control. In Credit-based flow control [1], [7], [8], [9], upstream
nodes have information about the number of empty downstream
buffers; we call this information CN (credit number). Each time
upstream nodes send a flit to downstream buffers, the number is
decremented by one. When downstream buffers send some flits to
other nodes, they also send a credit control signal to upstream
routers, and when the upstream router receives the signal, the CN
associated with the path is incremented appropriately. Figure 4
illustrates data flow and an example of transmission. In this
example, initially node2 is blocked, and CN is decremented. Next
node2 starts sending flits and credit signals are emitted to node1,
which receives the signal and re-starts sending flits to node2.
Handshaking flow control [10] is illustrated in Fig. 5, it is
needed two control signals to send valid and acknowledge signal.
The upstream nodes must keep valid signal corre- sponding to every
flits until they receive an acknowledge signal. In Fig. 5, node1
sends a valid signal to node2, and node2 stores some flits, and
sends an acknowledge signal to node1. Node1 receives the
acknowledge signal and eliminates the corresponding valid. Node2
does not have free buffers to store flits when it receives the val3
signal, it sends a non- acknowledge signal to node1. Node1 resends
a val3 signal until it receives an acknowledge signal. The above
algorithms send signals from the downstream buffers to upstream
buffers and decide whether or not to send flits. On the other hand,
ACK/NACK flow control [6], [7] does not need to wait for and
calculate signals from downstream buffers. This flow control
reduces needed to issue control signal latency. In this flow
control model, as flits are sent from source to destination, a copy
is kept in each of the node buffers to resend some dropped flits if
necessary. An ”ACK” signal is sent from downstream nodes when a
flit is sent, and upstream nodes receive ”ACK” signals, it deletes
its copy from its buffers. If the downstream node cannot or does
not receive the correct flits, it sends ”NACK” signal to the
upstream node, and upstream node rewinds its output queue and
starts resending a copy of the corrupted flit. Figure 6 indicates
one example of this flow control.
III. OPTIMIZED NOC DESIGN
We previously made a NoC named OASIS which has 4x4 mesh-topology,
wormhole switching, a First-Come, First- Served (FCFS) scheduler,
and re-transmission flow control which is like ACK/NACK flow
control, however source and destination core only manage their flow
control. In this section, we describe the advantages and drawbacks
of OASIS NoC. FCFS scheduling is a very simple algorithm to
implement in hardware. It experiences low power consumption and low
latency. The re-transmission flow control employed by OASIS is done
in the Network Interface (NI). Source nodes send checksum and the
NIs of destination nodes confirm the correct-
Fig. 3. ON/OFF flow control.
Fig. 4. Credit-based flow control.
ness. If flits have been corrupted or dropped, they must be re-
transmitted. In this situation, new flits are resent from source
PEs. Using this technique, OASIS can correctly transmit data in
noisy environments. However, OASIS has some drawbacks. FCFS has
unbalanced network utilization when analyzing overall performance,
because some transmissions are served by the scheduler and other
transmissions are always stalled. This approach causes unbalanced
network utilization and may increase the latency. The
re-transmission technique must re- send flits if errors occur, and
may cause increase network utilization, increasing the possibility
of a bottleneck.
Fig. 5. Handshaking flow control.
76767676
Fig. 6. ACK/NACK flow control.
Fig. 7. One flit structure. Payload is 8-bit.
A. Abstract of ONoC Architecture
We have made optimizations to ONoC in areas including switching,
forwarding, routing, and topology and have ex- changed additional
functionality to the design by implement- ing stall-go and round
robin scheduling [11], [12]. The new algorithms are intended to
optimize and enhance performance for general application; it does
not target specific applications and platforms.
Fig. 9. State machine of Stall-go flow control.
B. Switching
Wormhole switching and virtual cut through forwarding methods are
both employed in ONoC. The forwarding method that is chosen in a
given instance is dependent on the level of packet fragmentation.
In OASIS, each router has buffers which can store four flits. When
a packet is divided into more than four flits, ONoC chooses
wormhole switching. When packets are divided into less than four
flits, ONoC chooses virtual cut through. In other words, when
buffer size is greater than or equal to the number of flits,
virtual cut through is used, and when buffer size is less than or
equal to the number of flits, wormhole switching is used.
C. ONoC buffer design consideration
ONoC has buffers which store multiple flits at each input port.
Buffer size is an important NoC parameter that affects area
utilization and throughput. An increased number of occupied routers
increases the prob- ability of communication stall. In wormhole
switching, it is possible to observe cases where the buffer size is
too small to store entire flits. On the other hand, virtual cut
through uses fewer router routers when there is a blockage,
increasing throughput. There is also an inverse relationship
between power consumption and throughput, and the buffer depth of
ONoC is changed between 4, 8, 16, and 32 and simulated.
D. Routing and topology
ONoC employs XY-routing and distributed routing, mean- ing all
flits contain routing information about the path between the source
and destination routers. Each router compares the current address
and destination address to select the output port direction. Figure
7 indicates flit structure; it accepts up to 7x7 network
size.
E. Avoiding buffer overflow
The used avoiding buffer overflow technique efficiently prevents
buffer from overflowing. Data transfer is controlled by signals
indicating buffer condition. When there are not stall- go
functions, receiving cores need to judge whether there are dropped
packets or not. If it there are, the sender must resend
77777777
the dropped packets using a receiving request signal from mas- ter
cores. Addition of stall-go function incurs communication blocking.
However it can reduce latency. Figure 8 shows a simplified block
diagram of the technique. Figure 9 illustrates the state machine of
this approach. State ”Go” indicates the receiving FIFO can store
more than two flits. State ”Sent” means it can store 1 flit. State
”Stop” means it can not store any flits. There is no state
transition ”Stop” to ”Sent”, because once the state transition
”Stop” happens, priority of this port is altered to low in sending
router.
F. Scheduling
ONoC adopts wormhole-like switching technique. Inputs and outputs
of routers have five directions: north, east, south, west, and
local. It is acceptable for routers to have multiple inputs, but
only one input stream is processed and routed. Our early design
adopted FCFS scheduling. Using FCFS, if a message is divided into
five packets (for example), the communication blocks all routers in
the path between source nodes and destination nodes until it is
finished transmitting. Because routers are blocked, other
transmissions cannot be treated, increasing transmission latency.
To solve this problem, ONoC employs round-robin scheduling via the
packet transmit layer. Thus, it can treat each communication as a
partially fixed transmit latency.
IV. SIMULATION RESULT
ONoC is designed in Verilog HDL, synthesized with Altera CAD tools,
and simulated with ModelSim [13]. The target of this prototype is a
hardware-based JPEG codec [14]. Figure 10 illustrates the block
diagram for the JPEG2000 codec [15], [16]. We used this tasks
diagram to implement JPEG2000 on ONoC using random mapping as shown
in Fig. 11. However, we only focused on transfer. We evaluated the
hardware complexity and data transfer cycles for ONoC. Table 1
shows the simulation parameters. Figure 12 shows the results of
simulation in which 120,015 bytes of input data were tested (Baboon
(ratio 200x200)). In this example, flit data size is 8-bit long
because the minimal data transmission size is 8- bit in
preprocessing [17]. As a result, total transmission time decreased
by 12.9% in 16 buffer depth (for example). In average, the total
transmission time is reduced 19.5%.
V. HARDWARE DESIGN RESULT
As we mention in section 3, the buffer size is one of most
important parameter in the NoC architecture. If buffer size is
small and network is congested, many flits can not start
transmission. Hence, large buffer achieve high throughput. We
analyzed relationship between buffer size and area utilization. The
hardware complexity comparison between ONoC and OASIS-NoC is shown
in Table 2. We estimated different buffer size area utilization
over various sizes. We observed from this evaluation that large
buffer size dramatically increases area utilization. In average,
ONoC area is about 4.38% larger than OASIS’ area. Similarly, power
is about 0.11% higher and speed decreases only about 9.46%.
TABLE I SIMULATION PARAMETERS.
ONoC Parameters Value Network Size 3x3-mesh Buffer Depth 4, 8, 16
and 32 Flit Size 20 bits (header:12 bits,payload:8 bits) Forwarding
Wormhole switching like Scheduling Round-robin Control Flow
Stall-go Routing Static X-Y Target Application JPEG codec Target
Device Altera Stratix III Input Data Size 120,015 bytes (ratio
200x200)
TABLE II OASIS AND ONOC HARDWARE PERFORMANCE. BD: BUFFER
DEPTH.
BD Arch. Area(ALUTs) Power(mW) Speed(MHz) 4 ONoC 5,485(5%) 649.17
185.87
OASIS 5,282(5%) 649.03 207.90 8 ONoC 8,269(7%) 660.02 186.60
OASIS 7,890(7%) 659.31 195.05 16 ONoC 10,538(9%) 682.80
161.26
OASIS 10,279(9%) 681.63 177.43 32 ONoC 17,416(15%) 716.87
153.96
OASIS 16,569(15%) 716.02 172.38
VI. CONCLUSION
In this paper, we presented architecture and advanced opti-
mization techniques for flit control-flow and scheduling. The ONoC
architecture is designed in Verilog HDL, synthesized and simulated
with commercial CAD tools. ONoC reduces total transmission time
without significant hardware cost. The total transmission time is
decreased by 19.5% with 4.38% extra area utilization in
average.
REFERENCES
[1] A. Ben Abdallah, and M. Sowa, ”Basic Network-on-Chip Intercon-
nection for Future Gigascale MCSoCs Applications: Communication and
Computation Orthogonalization”, In Proceedings of Tunisia-Japan
Symposium on Society, Science and Technology (TJASSST), Dec. 4-
9th, 2006.
78787878
Fig. 11. JPEG codec tasks are mapped on ONoC.
Fig. 12. OASIS and ONoC simulation result. It shows both of their
total transmission time and number of dropped packets.
[2] Moraes, F.G; et al., ”HERMES: an Infrastructure for Low Area
Overhead Packet-Switching Networks on Chip”, Integration, the VLSI
Journal, vol. 38-1, 2004, pp. 69-93.
[3] L. Benini et al., ”Networks on chips: A new SoC”, paradigm,
IEEE Computer, vol. 36, no. 1, Jan. 2002, p.70-78.
[4] Kenichi Mori, A. Ben Abdallah, Kenichi Kuroda, ”Design and
Eval- uation of a Complexity Effective Network-on-Chip Architecture
on FPGA”, The 19th Intelligent System Symposium (FAN 2009), Sep.
2009, pp.318-321.
[5] M. S. Rasmussen, ”Network-on-Chip in Digital Hearing Aids”,
Infor- matics and Mathematical Modelling, Technical University of
Denmark, DTU, Richard Petersens Plads, Building 321, DK-2800 Kgs.
Lyngby, IMM-Thesis-2006-76, 2006.
[6] A. Pullini, F. Angiolini, D. Bertozzi, and L. Benini, ”Fault
tolerance overhead in network-on-chip flow control schemes”, In
Proceedings of 18th Annu. Symp. Integr. Circuits and Syst. Des.
(SBCCI), 2005, pp. 224-229.
[7] William J. Dally, Brian Towles, ”Principles and Practices of
Intercon- nection Networks”, Morgan Kaufmann, 2004, chap.13.
[8] A. Mello, L. Tedesco, N. Calazans, and F. Moraes., ”Virtual
channels in networks on chip: Implementation and evaluation on
hermes NoC”, In SBCCI’05: In Proceedings of the 18th annual
symposium on Integrated circuits and system design, New York, NY,
USA, 2005. ACM Press, pp.178-183.
[9] E. Bolotin et al., ”Automatic hardware-efficient SoC
integration by QoS network on chip”, in 11th IEEE International
Conference on Electronics, Circuits and Systems, 2004.
[10] C. A. Zeferino and A. A. Susin., ”SoCIN: A Parametric and
Scalable Network-on-Chip”, In Proceedings of the 16th Symposium on
Integrated Circuits and Systems Design (SBCCI03), 2003, pp.
34-43.
[11] M. Weber., ”Arbiters: Design ideas and coding styles”,
Synopsys Users Group 2001 Proceedings, Boston, 2001.
[12] E.S. Shin et al., ”Round-robin Arbiter Design and Generation”,
In International Symposium on System Synthesis, Kyoto, Japan, 2002,
pp. 243-248.
[13] Rickard Holsmark, ”Modeling and Prototyping of a Network on
Chip”, Master Thesis, Electronics, Sweden Jonkoping University,
2002.
[14] J. Rosethal, ”JPEG Image Compression Using an FPGA”, Master of
Science in Electrical and Computer Engineering, DEC. 2006.
[15] M. J. Gormish, D. Lee, M. W. Marcellin, ”JPEG-2000: Overview,
Architecture and Applications”, in Proceedings IEEE Int.
Conference, Image Processing (ICIP2000), Sep. 2000, pp.
29-32.
[16] Zhang, H. and Fritts, J., ”EBCOT coprocessing architecture for
JPEG2000”, SPIE Electronic Imaging - Video Communications and Image
Processing 2004, San Jose, CA, pp. 1333-1340.
[17] A. N. Skordas, C. A. Christopoulos, T. Ebrahimi, ”JPEG2000:
The Upcoming Still Image Compression Standard”, In Proceedings of
the 11th Portuguese Conference on Pattern Recognition, Portugal,
May 11- 12 2000, pp. 359-366.
79797979