A Cost Effective Centralized Adaptive Routing for Networks on Chip
description
Transcript of A Cost Effective Centralized Adaptive Routing for Networks on Chip
A Cost Effective Centralized Adaptive
Routing for Networks on Chip
Ran Manevich, Israel Cidon, Avinoam Kolodny, Isask’har (Zigi) Walter and Shmuel
WimerTechnion – Israel Institute of
TechnologyModule
Modu le Module
Modu le Modu le
Modu le Modu le
Modu le
Module
Modu le
Modu le
Modu leGroup
ResearchQNoC
Global traffic information is essential to make the right decision!
2D Mesh NoC
Adaptive Routing in NoCs – Local vs. Global Information
Low CongestionMedium CongestionHigh Congestion
A Packet routed from upper left to bottom right corner utilizing local congestion information.The same packet routed using global information.
I CAN MAKE IT!!!Source
Destination
Route Selection - ATDOR ATDOR - Adaptive Toggle Dimension Ordered
Routing Keep it simple! Centralized selection:
Routing tables in sources. One bit per destination.
The option with less congested bottleneck link is preferred.
XY or YX
ATDOR Illustration 1 Five identical flows,
100 MB/s each.
Links modeled as M/M/1 queues. Delay of a single link:
LINKTraffic
DCapacity Traffic
Links capacity is 210 MB/s.
Initial routing - XY
Centralized Routing – How?
Option 1 – Continuous calculation of optimal routing for the active sessions:
Achievable load balancing
Speed and computation complexity
System complexity
Centralized Routing – How?
Option 2 – Iterative serial selection based on traffic load measurements between XY and YX for all source-destination pairs:
Achievable load balancing
Speed and computation complexitySystem complexity
ATDOR illustration 1
Average Delay
∞
Re-Routed Flow
Step #
1->15 1
Re-Routed Flow
Step #
2->8 2
Average Delay
37 ns
Re-Routed Flow
Step #
2->15 3
Average Delay
22 ns
What did we just see? For each flow we:
1. Calculated the better route.2. Updated routing table of the
source.3. Waited for the update to take effect and measured global traffic load.
Steps 2 and 3 are unified for all destinations of a single source:Achievable load balancing
Speed and computation complexityScalability
Performing steps 1-3 for each flow is slow and not scalable.
Back illustration 1
Average Delay
∞
Re-Routed Flow
Step #
1->15 1
Average Delay
22 ns
Re-Routed Flow
Step #
2->822->15
Re-Routed Flow
Step #
4->15 3
Average Delay
22 ns
Re-Routed Flow
Step #
1->15 4
Average Delay
22 ns
Re-Routed Flow
Step #
2->852->15
Average Delay
∞
Problem #1 Changing routing may enhance
congestion and cause fluctuations.
Solution: Change routing only if the alternative is better by the margin α, 0< α <1:
YX XY
YX XY
XY YX
XY YX
if (Current Route = XY)YX if MAX[Load ] a MAX[Load ]
NextRoute =XY if MAX[Load ] > a MAX[Load ]
elseif (Current Route = YX)XY if MAX[Load ] a MAX[Load ]
NextRoute =YX if MAX[Load ] > a MAX[Load ]
ATDOR illustration 2
Average Delay
∞
Re-Routed Flow
Step #
1->14
11->15
1->16
Average Delay
∞
Re-Routed Flow
Step #
1->14
21->15
1->16
Re-Routed Flow
Step #
1->14
31->15
1->16
Problem #2 Coupling among flows sharing
the same source. Solution: Re-Routing counters
CI,J count routing changes of flows from source I to destination J (FI,J). When CI,J reaches a limit LI,J, routing of FI,J is locked. A Possible definition of Limits LI,J :
, ( ) mod 3I JL I J
Back to illustration 2R.
Changes Left
Flows
2 1->161 1->150 1->14
Average Delay
∞
R. Changes
Left
Flows
1 1->160 1->150 1->14
Average Delay
73 ns
R. Changes
Left
Flows
0 1->160 1->150 2->14
Average Delay
22 ns
, ( ) mod 3I JL I J
Bring it all togetherR.
Changes Left
Flows
1 1>-15
1 2>-8
2 2>-15
1 4>-15
Average Delay
∞
R. Changes
Left
Flows
0 1>-15
1 2>-8
2 2>-15
1 4>-15
R. Changes
Left
Flows
0 1>-15
0 2>-8
1 2>-15
1 4>-15
Average Delay
22 ns
R. Changes
Left
Flows
0 1>-15
0 2>-8
1 2>-15
0 4>-15
Average Delay
22 nsAverage Delay
14 ns
R. Changes
Left
Flows
0 1>-15
0 2>-8
0 2>-15
0 4>-15
, ( ) mod 3I JL I J
Centralized Adaptive Routing for NoCs - Architecture
Traffic load measurements aggregation into Traffic Load Maps.
Routing control.
Local traffic load measurements inside the routers.
Load Measurements Aggregation
An illustration of aggregation of load values in a 4X4 2D mesh.
A congestion value is written to each traffic load map every clock cycle.
ATDOR – Route Selection Circuit
Combinatorial pipelined implementation.
Result every ATDOR clock cycle.
Maximally loaded links of the two alternatives are compared. Next route:
YX XY
YX XY
XY YX
XY YX
if(Current Route = XY)YX if MAX[Load ] a MAX[Load ]
NextRoute =XY if MAX[Load ] > a MAX[Load ]
elseif(Current Route = YX)XY if MAX[Load ] a MAX[Load ]
NextRoute =YX if MAX[Load ] > a MAX[Load ]
0 < a <1
Hardware Requirements The whole
mechanism was implemented on xc5vlx50t VIRTEX 5 FPGA.
Estimated area for 45nm technology node.
Per-Router hardware overheads in % for a NoC with typical size (50 KGates) virtual channel routers.
Average Packet Delay – Uniform Traffic
Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Uniform traffic pattern.
Average Packet Delay – Transpose Traffic
Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. Transpose traffic pattern.
Average Packet Delay – Hotspot Traffic
Average delay vs. average load in links normalized to links capacity. 8X8 2D Mesh. 4 Hotspots traffic pattern.
Control Iteration Duration Number of re-routed flows vs. time. 8X8 2D Mesh, ATDOR clock of 100 MHz.
α = 15/16 α = 3/4
CMP DNUCA - Architecture 8X8 CMP DNUCA (Dynamic Non Uniform
Cache Array) with 8 CPUs and 56 cache banks:
CMP DNUCA – Saturation Throughput
Saturation throughput - Splash 2 and Parsec benchmarks on 8X8 CMP DNUCA with 8 CPUs and 56 cache banks:
Conclusions Centralized adaptive routing is feasible
for NoCs. ATDOR: Centralized selection
between XY and YX for each source-destination pair. Hardware overhead: <4% of an 8X8 typical NoC. Average saturation throughput improvement:Vs. RCA Vs. O1TURN
12.1% 19.3% Synthetic Patterns12.8% 22.8% Spash 2 and
Parsec Benchmarks