Centralized Routing of IP Networks using QRS Routing Simulator
Cost Effective centralized adpative routing for networks on chip
-
Upload
chiportal -
Category
Technology
-
view
641 -
download
3
description
Transcript of Cost Effective centralized adpative routing for networks on chip
May 2, 2011 1
A Cost Effective Centralized Adaptive Routing for Networks
on ChipRan Manevich*, Israel Cidon*, Avinoam Kolodny*,
Isask’har (Zigi) Walter* and Shmuel Wimer#
*Technion – Israel Institute of Technology
M odule
M odule M odule
M odule M odule
M odule M odule
M odule
M odule
M odule
M odule
M oduleGroup
ResearchQNoC
#Bar-Ilan University
May 2, 2011 2
Networks-on-Chip (NoCs)
May 2, 2011 3
Global traffic information is essential to make the right decision!
May 2, 2011 4
Adaptive Routing in NoCs – Local vs. Global Information
2D Mesh NoCLow Congestion
Medium Congestion
High Congestion
A Packet routed from upper left to bottom right corner utilizing local congestion information.
The same packet routed using global information.
I CAN MAKE IT!!!Source
Destination
May 2, 2011 5
Route Selection - ATDOR ATDOR - Adaptive Toggle Dimension Ordered Routing
Keep it simple! Centralized selection:
Routing tables in sources. One bit per destination.
The option with less congested bottleneck link is preferred.
XY or YX
May 2, 2011 6
ATDOR Illustration 1 Five identical flows, 100
MB/s each.
Links modeled as M/M/1 queues. Delay of a single link:
LINKTraffic
DCapacity Traffic
Links capacity is 210 MB/s.
Initial routing - XY
May 2, 2011 7
Centralized Routing – How?• Option 1 – Continuous calculation of optimal routing
for the active sessions:
Achievable load balancing
Speed and computation complexity
System complexity
May 2, 2011 8
Centralized Routing – How?• Option 2 – Iterative serial selection based on traffic
load measurements between XY and YX for all source-destination pairs:
Achievable load balancing
Speed and computation complexity
System complexity
May 2, 2011 9
ATDOR illustration 1
Average Delay
∞
Re-Routed Flow
Step #
1->15 1
Re-Routed Flow
Step #
2->8 2
Average Delay
37 ns
Re-Routed Flow
Step #
2->15 3
Average Delay
22 ns
May 2, 2011 10
What did we just see? For each flow we:
1. Calculated the better route.
2. Updated routing table of the source.
3. Waited for the update to take effect and measured global traffic load.
Steps 2 and 3 are unified for all destinations of a single source:
Achievable load balancing
Speed and computation complexity
Scalability
Performing steps 1-3 for each flow is slow and not scalable.
May 2, 2011 11
Back illustration 1
Average Delay
∞
Re-Routed Flow
Step #
1->15 1
Average Delay
22 ns
Re-Routed Flow
Step #
2->82
2->15
Re-Routed Flow
Step #
4->15 3
Average Delay
22 ns
Re-Routed Flow
Step #
1->15 4
Average Delay
22 ns
Re-Routed Flow
Step #
2->85
2->15
Average Delay
∞
May 2, 2011 12
Problem #1 Changing routing may enhance
congestion and cause fluctuations.
Solution: Change routing only if the alternative is better by the margin α, 0< α <1:
YX XY
YX XY
XY YX
XY YX
if (Current Route = XY)
YX if MAX[Load ] a MAX[Load ]NextRoute =
XY if MAX[Load ] > a MAX[Load ]
elseif (Current Route = YX)
XY if MAX[Load ] a MAX[Load ]NextRoute =
YX if MAX[Load ] > a MAX[Load ]
May 2, 2011 13
ATDOR illustration 2
Average Delay
∞
Re-Routed Flow
Step #
1->14
11->15
1->16
Average Delay
∞
Re-Routed Flow
Step #
1->14
21->15
1->16
Re-Routed Flow
Step #
1->14
31->15
1->16
May 2, 2011 14
Problem #2 Coupling among flows sharing the same
source.
Solution: Re-Routing counters CI,J count routing changes of flows from source I to destination J (FI,J). When CI,J reaches a limit LI,J, routing of FI,J is locked. A Possible definition of Limits LI,J :
, ( ) mod 3I JL I J
May 2, 2011 15
Back to illustration 2R. Changes
LeftFlows
2 1->16
1 1->15
0 1->14
Average Delay
∞
R. Changes Left
Flows
1 1->16
0 1->15
0 1->14
Average Delay
73 ns
R. Changes Left
Flows
0 1->16
0 1->15
0 2->14
Average Delay
22 ns
, ( ) mod 3I JL I J
May 2, 2011 16
Bring it all togetherR. Changes
LeftFlows
1 1>-15
1 2>-8
2 2>-15
1 4>-15
Average Delay
∞
R. Changes Left
Flows
0 1>-15
1 2>-8
2 2>-15
1 4>-15
R. Changes Left
Flows
0 1>-15
0 2>-8
1 2>-15
1 4>-15
Average Delay
22 ns
R. Changes Left
Flows
0 1>-15
0 2>-8
1 2>-15
0 4>-15
Average Delay
22 nsAverage Delay
14 ns
R. Changes Left
Flows
0 1>-15
0 2>-8
0 2>-15
0 4>-15
, ( ) mod 3I JL I J
May 2, 2011 17
Centralized Adaptive Routing for NoCs - Architecture
Traffic load measurements aggregation into Traffic Load Maps.
Routing control.
Local traffic load measurements inside the routers.
May 2, 2011 18
Load Measurements Aggregation An illustration of
aggregation of load values in a 4X4 2D mesh.
A congestion value is written to each traffic load map every clock cycle.
May 2, 2011 19
ATDOR – Route Selection Circuit
• Combinatorial pipelined implementation.
Result every ATDOR clock cycle.
Maximally loaded links of the two alternatives are compared. Next route:
YX XY
YX XY
XY YX
XY YX
if(Current Route = XY)
YX if MAX[Load ] a MAX[Load ]NextRoute =
XY if MAX[Load ] > a MAX[Load ]
elseif(Current Route = YX)
XY if MAX[Load ] a MAX[Load ]NextRoute =
YX if MAX[Load ] > a MAX[Load ]
0 < a <1
May 2, 2011 20
Hardware Requirements The whole mechanism
was implemented on xc5vlx50t VIRTEX 5 FPGA.
Estimated area for 45nm technology node.
Per-Router hardware overheads in % for a NoC with typical size (50 KGates) virtual channel routers.
May 2, 2011 21
Average Packet Delay – Uniform Traffic
• Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Uniform traffic pattern.
May 2, 2011 22
Average Packet Delay – Transpose Traffic
• Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Transpose traffic pattern.
May 2, 2011 23
Average Packet Delay – Hotspot Traffic
• Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. 4 Hotspots traffic pattern.
May 2, 2011 24
Control Iteration Duration• Number of re-routed flows vs. time. • 8X8 2D Mesh, ATDOR clock of 100 MHz.
α = 15/16 α = 3/4
May 2, 2011 25
CMP DNUCA - Architecture• 8X8 CMP DNUCA (Dynamic Non Uniform Cache Array)
with 8 CPUs and 56 cache banks:
May 2, 2011 26
CMP DNUCA – Saturation Throughput
• Saturation throughput - Splash 2 and Parsec benchmarks on 8X8 CMP DNUCA with 8 CPUs and 56 cache banks:
May 2, 2011 27
Conclusions• Centralized adaptive routing is feasible for NoCs.
ATDOR: Centralized selection between XY and YX for each source-destination pair.
Hardware overhead: <4% of an 8X8 typical NoC.
Average saturation throughput improvement:Vs. RCA Vs. O1TURN
12.1% 19.3% Synthetic Patterns
12.8% 22.8% Spash 2 and Parsec Benchmarks