Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC)
description
Transcript of Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC)
Best of Both Worlds: A Bus-Enhanced Network on-Chip
(BENoC)
Ran Manevich, Isask’har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny
Technion – Israel Institute of Technology
May, 2009
2
Network on-Chip : the Good News
Interconnect for SoCs, CMPs and FPGAs Multi-hop, packet-based communication Efficient resource sharing
Scalable performance and efficiency in Power Area Design productivity
System Bus
3
Network on-Chip : the Bad News
Increased and hard-to-predict latency due to multi-hop and sharing Time critical signals
Broadcast? multicast? No easy solutions Slow (10s of cycles)
I wish I had a bus at hand ….
4
Solution: Bus-Enhanced NoC (BENoC)
Bus re-introduced as a NoC “add-on”
Use NoC for data Optimized for high bandwidth
Use bus for short meta-data Low bandwidth, low latency Broadcast, multicast
Overhead should be justified!
R
RR RR
R
R
R RR
R
R
R R
R
R
R
R R
R
R
R
R
R
RR
RR
R
R
R
R
Module Module
Module Module
Module Module
Module Module
Module
Module
Module
Module
Module
Module
Module
Module
5
In-band support of time critical communication; and:In-band Multicast/Broadcast Complex router
implementation Suffer from multi-hop latency
Existing Bus-NoC hybrids Form a topological hierarchy Typically bus used for local
communication
Related WorkModule
Module
Module
Module
Module
Module
Module Module Module
R
R
R R
R
R R R
R
Module Module
Module Module
Module
Module
Module
Module
R R R
Module Module
Module Module
Module
Module
Module
Module
R R R
R R R
6
BENoC Services
Fast unicast and multicast signaling CMP cache example
Anycast Find resources that fulfills certain
conditions E.g., “Looking for an idling DSP”; or
“Where are the 5 closest multipliers?” Convergecast
Efficient collection of feedback back to the initiator
Barrier synchronization, …
7
Additional BENoC Applications
NoC control Router configuration
E.g., routing table configuration Adapt NoC routing for load balancing Fault discovery and recovery
System control Power management Resource load balancing
Debug
8
Outline Introduction MetaBus architecture MetaBus latency and energy analysis CMP cache use case
9
Conventional System Buses
Figure is copied from “Amba Specifications Rev 2.0” - http://www.arm.com/products/solutions/AMBA_Spec.html
Bandwidth optimized Poor scalability Not suitable for tasks in
BENoC
10
MetaBus Design Requirements
Low area, low power Low bandwidth Low latency Simple Versatile Scalable
Multicast and broadcast support
Acknowledgement
R
R
R
R
R R
R
RR R R
RR R R
R
Module
Module
Module
Module
Module
Module
Module
Module
ModuleModule Module Module
ModuleModule Module Module
“MetaBus”
11
MetaBus Architecture
Many possible implementations Example: tree topology with distributed
arbitration
Module#1
Module#2
Module#3
Module#4
Module#5
Module#6
Module#7
Module#8
Module#9
BusStation
BusStation
BusStation
BusStation
Root
BusStation
12
Module#1
Module#2
Module#3
Module#4
Module#5
Module#6
Module#7
Module#8
Module#9
BusStation
BusStation
BusStation
BusStation
Root
BusStation
Data Path
Data to rootData to receivers
13
Module#1
Module#2
Module#3
Module#4
Module#5
Module#6
Module#7
Module#8
Module#9
BusStation
BusStation
BusStation
BusStation
Root
BusStation
Address word propagates to the rootData word
1Data word 2
propagates to the modules
Example: Broadcast of Two Words
14
Module#1
Module#2
Module#3
BusStation
BusStation
Root
BusStation
Distributed Arbitration Mechanism
Bus RequestBus Grant
15
Module#1
Module#2
Module#3
Module#4
Module#5
Module#6
Module#7
Module#8
Module#9
BusStation 3
BusStation 4
BusStation 5
BusStation 2
Root
BusStation 1
Address word propagates to the rootData word
1propagates to the modules
Masking Saves Power
Mask1Mask2Mask3Mask4Mask5
Mask1
Mask2
Mask3
Mask4
Mask5
Unicast from Module#3 to Module#5
1 0
1 0 1
10101
16
(Binary )Bus Station
17
MetaBus Floorplan – An Example
64 modules balanced binary MetaBus
18
Outline Introduction MetaBus architecture MetaBus Latency and energy analysis CMP cache use case
19
Analysis Highlights 1/4
NoC Broadcast+Unicast Energy/Transaction:
2NoC broadcast flits NL NDE V N K C C
2
1
2NoC unicast flits W NL ND
nE V N L C C
20
Analysis Highlights 2/4
MetaBus Broadcast and Unicast Energy/Transaction:
2,
12
,1 1
D D
MetaBus flits D BL BD upbroadcast
B Bn n
flits BL R BD down Rn n
E V N B C C
V N C B C B
2,
2,1
MetaBus flits D BL BD upunicast
flits R D BL D BD down
E V N B C C
V N B B C B C
21
Analysis Highlights 3/4
NoC unicast and broadcast latency:
NoC unicast CiR Nclk Nclk flitsT nN T T N
NoC broadcast Nclk flitsT n T N
22
Analysis Highlights 4/4
MetaBus unicast and broadcast latency:
,,
,
, ,
,
1.5
0.7 0.4
0.7 0.4
MetaBus flits
BL BD upD BL BD up BL BL
BD up
R BL BD down BL BD downD BL BL
BD down
T N
C CB R C R C
C
B C C R CB R C
C
23
Results - Energy Consumption
Energy consumption for a 3 data words broadcast and unicast transactions
0
0.5
1
1.5
2
2.5
3
3.5
0 5 10 15 20 25 30 35 40
Number of Modules
En
erg
y p
er t
ran
sact
ion
[n
J]
MetaBus Broadcast
Network Broadcast
MetaBus Unicast
Network Unicast
Bus and NoC unicast and broadcast energy per transaction
10X10 mm chip
64 modules mesh
1GHz NoC clock
Speed optimized bus
@0.18um
24
Results - Latencies 3 data words broadcast and unicast
transactions latencies in system with a frequency and a speed optimized MetaBus.
0
20
40
60
80
100
120
0 5 10 15 20 25 30 35 40
Number of modules
La
ten
cy
[n
s]
MetaBus
Network Broadcast
Network Unicast
Figure 9: Bus and NoC broadcast latencies
10X10 mm chip
64 modules mesh
1GHz NoC clock
Speed optimized bus
@0.18um
25
Outline Introduction MetaBus architecture MetaBus Latency and energy analysis CMP cache use case
26
Dynamic Non-Uniform Cache Access
Split large cache into independent smaller banks Non uniform cache access time (NUCA)
Cache lines are moved to shorten access time Dynamic NUCA
Before fetching a into its L1$, a CPU needs to find the L2 cache storing the line
CPUL1$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
L2$ L2$
CPUL1$
CP
UL1$
CP
UL1$
CPUL1$
CPUL1$
CP
UL1
$
CP
UL1
$
L2$
CMP
(Chi
p Mul
ti Pr
oces
sor)
27
Simulation Setup 16 processors, 64 L2 cache banks PARSEC and SPLASH-2 benchmarks Vanilla Wormhole NoC Simulation account for bus latency,
arbitration time, etc.
28
Simulation Results
Performance improvement in BENoC compared to a NoC-based CMP
(a) average read transaction latency; (b) application speed
29
Summary Current NoCs are largely distributed
Borrowing concepts from off-chip networks On-chip environment provides an
opportunity Enhancing the network with a bus gives the
best of both worlds Advanced services are easily supported
Anycast, management and control Cost effective
Power and performance Analysis and simulation
30
Thank you!
Questions?
Bus-Enhanced NoC
M odule
M odule M odule
M odule M odule
M odule M odule
M odule
M odule
M odule
M odule
M odule
QNoCResearch
GroupGroup
ResearchQNoC