Post on 08-Jan-2016
description
Prediction Router:
Hiroki Matsutani (Keio Univ., Japan)Michihiro Koibuchi (NII, Japan)
Hideharu Amano (Keio Univ., Japan)Tsutomu Yoshinaga (UEC, Japan)
Yet another low-latency on-chip router architecture
• Tile architecture– Many cores (e.g., processors & caches)– On-chip interconnection network
Why low-latency router is needed?
Packet switched network
router
[Dally, DAC’01]
router router
router router router
router router router
RouterRouterCoreCore
16-core tile architecture16-core tile architectureOn-chip router affects the performance and cost of the
chip
System Topology Routing Switching
Flow ctrl
MIT RAW 2D mesh (32bit)
XY DOR WH, no VC Credit
UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit
QuickSilver ACM
H-Tree (32bit) Up*/down* 1-flit, no VC Credit
UMass Amherst aSOC
2D mesh Shortest-path
Pipelined CS, no VC
Timeslot
Sun T1 Crossbar (128bit)
- - Handshake
Cell BE EIB Ring (128bit) Shortest-path
Pipelined CS, no VC
Credit
TRIPS (operand)
2D mesh (109bit)
YX DOR 1-flit, no VC On/off
TRIPS (on-chip) 2D mesh (128bit)
YX DOR WH, 4 VCs Credit
Intel SCC 2D torus (32bit) XY,YX DOR, odd-even TM
WH, no VC Stall/go
TILE64 iMesh 2D mesh (32bit) XY DOR WH, no VC Credit
Intel 80-core NoC
2-D mesh (32bit)
Source routing
WH, 2 lanes
On/off
Number of cores increases (e.g., 64-core or more?)
Their communication latency is a crucial problem
Number of cores increases (e.g., 64-core or more?)
Their communication latency is a crucial problem
Number of hops increases
Low-latency router architecture has been extensively studied
Why low-latency router is needed?
Outline: Prediction router for low-latency NoC
• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router
• Prediction router– Architecture and the prediction algorithms
• Hit rate analysis• Evaluations
– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network
Wormhole router: Hardware structure
5x5 CROSSBA
R
ARBITER
FIFO
FIFO
FIFO
FIFO
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Routing, arbitration, & switch traversal are performed in a pipeline manner
Input ports Output ports1) selecting an output channel
2) arbitration for the selected output
channel
3) sending the packet to the
output channel
GRANT
• At least 3-cycle for traversing a router– RC (Routing computation)– VSA (Virtual channel & switch allocations)– ST (Switch traversal)
• A packet transfer from router (a) to router (c)
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9 10 11 12
@Router A @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
At least 12-cycle for transferring a packet from router (a) to router (c)
SA
SA
SA
SA
SA
SA
SA
SA
SA
VA & SA are speculatively performed in parallel
To perform RC and VSA in parallel, look-ahead routing is used
Pipeline structure: 3-cycle router
Speculative router: VA/SA in parallel [Peh,HPCA’01]
• At least 3-cycle for traversing a router– NRC (Next routing computation)– VSA (Virtual channel & switch allocations)– ST (Switch traversal)
NRC VSA ST
ST
ST
ST
VSA ST
ST
ST
ST
VSA ST
ST
ST
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9 10 11 12
@Router A @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
NRC NRC
VSA can be performed w/o waiting for NRCRouting computation for the next hop
Output port of router (i+1) is selected by router i
SA
SA
SA
SA
SA
SA
SA
SA
SA
Look-ahead router:RC/VA in parallel
• At least 2-cycle for traversing a router– NRC + VSA (Next routing computation / arbitrations)– ST (Switch traversal)
NRCVSA
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9
@Router A
HEAD
DATA 1
DATA 2
DATA 3
NRCVSA
STNRCVSA
ST
@Router B @Router C
No dependency between NRC & VSA NRC & VSA in parallel
Typical example of 2-cycle router
Look-ahead router:RC/VA in parallel
At least 9-cycle for transferring a packet from router (a) to router (c)
Packing NRC,VSA,ST into a single stage frequency harmed
[Dally’s book, 2004]
3-cycle
• Bypassing between intermediate nodes– E.g., Express VCs
Bypassing router: skip some stages
SRC DST
[Kumar, ISCA’07]
3-cycle 3-cycle
Virtual bypassing paths
3-cycle 3-cycle1-cycleBypassed
1-cycleBypassed
• Bypassing between intermediate nodes– E.g., Express VCs
• Pipeline bypassing utilizing the regularity of DOR– E.g., Mad postman
• Pipeline stages on frequently used are skipped– E.g., Dynamic fast path
• Pipeline stages on user-specific paths are skipped– E.g., Preferred path– E.g., DBP
Bypassing router: skip some stages
[Kumar, ISCA’07]
[Koibuchi, NOCS’08]
[Michelogiannakis, NOCS’07]
[Park, HOTI’07]
[Izu, PDP’94]
We propose a low-latency router based on multiple predictors
3-cycleSRC DST
3-cycle 3-cycle
Virtual bypassing paths
3-cycle 3-cycle1-cycleBypassed
1-cycleBypassed
• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router
• Prediction router– Architecture and the prediction algorithms
• Hit rate analysis• Evaluations
– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network
Outline: Prediction router for low-latency NoC
Prediction router for 1-cycle transfer
• Each input channel has predictors• When an input channel is idle,
– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9 10 11 12
@Router A @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
RC & VSA are skipped if prediction hits 1-cycle transfer
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
Prediction router for 1-cycle transfer
• Each input channel has predictors• When an input channel is idle,
– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)
ELAPSED TIME [CYCLE]
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
1 2 3 4 5 6 7 8 9 10 11 12
MISS @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
RC & VSA are skipped if prediction hits 1-cycle transfer
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
Prediction router for 1-cycle transfer
• Each input channel has predictors • When an input channel is idle,
– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)
ELAPSED TIME [CYCLE]
RC VSA ST
ST
ST
ST
ST RC VSA ST
ST
ST
ST
1 2 3 4 5 6 7 8 9 10 11 12
MISS @Router C
HEAD
DATA 1
DATA 2
DATA 3
ST
ST
ST
HIT
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
RC & VSA are skipped if prediction hits 1-cycle transfer
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
Prediction router for 1-cycle transfer
• Each input channel has predictors • When an input channel is idle,
– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)
ELAPSED TIME [CYCLE]
RC VSA ST
ST
ST
ST
ST ST
ST
ST
ST
1 2 3 4 5 6 7 8 9 10 11 12
MISS HITHEAD
DATA 1
DATA 2
DATA 3
ST
ST
ST
HIT
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
RC & VSA are skipped if prediction hits 1-cycle transfer
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
Prediction router: Prediction algorithms
• Efficient predictor is key
• Prediction router– Multiple predictors for
each input channel
– Select one of them in response to a given network environment
Single predictor isn’t enough
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
for applications with different traffic patterns
Predictors
A B C
Predictors
A B C
1. Random2. Static Straight (SS)
An output channel on the same dimension is selected (exploiting the regularity of
DOR)3. Custom
User can specify which output channel is accelerated
4. Latest Port (LP)Previously used output channel is
selected5. Finite Context Method (FCM)
The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)
6. Sampled Pattern Match (SPM)Pattern matching using a record table
1. Random2. Static Straight (SS)
An output channel on the same dimension is selected (exploiting the regularity of
DOR)3. Custom
User can specify which output channel is accelerated
4. Latest Port (LP)Previously used output channel is
selected5. Finite Context Method (FCM)
The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)
6. Sampled Pattern Match (SPM)Pattern matching using a record table
[Burtscher, TC’02]
[Jacquet, TIT’02]
5x5 XBAR
ARBITER
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Predictors
A B C
1-cycle transfer using the reserved crossbar-port when prediction hits
Basic operation @ Correct prediction
Crossbar is reserved
Idle state: Output port X+ is selected and reserved1st cycle: Incoming flit is transferred to X+ without RC and VSA
Correct
1st cycle: RC is performed The prediction is correct!
2nd cycle: Next flit is transferred to X+ without RC and VSA
5x5 XBAR
ARBITER
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Predictors
A B C
Even with miss prediction, a flit is transferred in 3-cycle as original router
Basic operation @ Miss prediction
Idle state: Output port X+ is selected and reserved1st cycle: Incoming flit is transferred to X+ without RC and VSA
Correct Dead flit
1st cycle: RC is performed The prediction is wrong! (X- is correct)
KILL
Kill signal to X+ is asserted2nd/3rd cycle: Dead flit is removed; retransmission to the correct port
More energy for retransmission
• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router
• Prediction router– Architecture and the prediction algorithms
• Hit rate analysis• Evaluations
– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network
Outline: Prediction router for low-latency NoC
Prediction hit rate analysis
• Formulas to calculate the prediction hit rates on– 2-D torus (Random, LP, SS, FCM, and SPM)– 2-D mesh (Random, LP, SS, FCM, and SPM)– Fat tree (Random and LRU)
– To forecast which prediction algorithm is suited for a given network environment w/o simulations
• Accuracy of the analytical model is confirmed through simulationsDerivation of the formulas is omitted in this
talk
(See “Section 4” of our paper for more detail)
• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router
• Prediction router– Architecture and the prediction algorithms
• Hit rate analysis• Evaluations
– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network
Outline: Prediction router for low-latency NoC
Evaluation items
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis)Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF
Packet length 4-flit (1-flit: 64 bit)
Switching technique
wormhole
Channel buffer size
4-flit / VC
Number of VCs 1 or 2VCs
Cycle / hop (miss) 3 stage
Cycle / hop (hit) 1 stage*Topology and traffic are mentioned later
Table 1: Router & network parameters CMOS process 65nm
Core voltage 1.20V
Temperature 25C
Table 2: Process library
Design compiler 2006.06
Astro 2007.03
Table 3: CAD tools used
3 case studies of prediction router
Case study 3Case study 1 & 2
2-D mesh network Fat tree network
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis)Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF
• The most popular network topology
MIT’s RAW [Taylor,ISCA’04]
Intel’s 80-core [Vangal,ISSCC’07]
• Dimension-order routing (XY routing)
Here, we show the results of case studies 1 and 2 together
Case study 1: Zero-load comm.latency
C
om
m. l
ate
ncy
[cy
cle
s]
Network size (k-ary 2-mesh)
• Original router• Pred router (SS)• Pred router (100% hit)
Uniform random traffic on
4x4 to 16x16 meshes
35.8% reduced for 8x8 cores
(*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction
48.2% reduced for 16x16 cores
Simulation results
(analytical model also shows the same result)
More latency reduced (48% for k=16) as network size increases
Case study 2: Hit rate @ 8x8 mesh
• SS: go straight• LP: the last one• FCM: frequently used pattern
Pre
dic
tio
n h
it r
ate
[%
]
7 NAS parallel benchmark programs 4 synthesized traffics
Efficient for long straight comm.
Case study 2: Hit rate @ 8x8 mesh
Efficient for short repeated comm.
Pre
dic
tio
n h
it r
ate
[%
]
• SS: go straight• LP: the last one• FCM: frequently used pattern
Efficient for long straight comm.
7 NAS parallel benchmark programs 4 synthesized traffics
Case study 2: Hit rate @ 8x8 mesh
All arounder !
Pre
dic
tio
n h
it r
ate
[%
]
• SS: go straight• LP: the last one• FCM: frequently used pattern
Efficient for long straight comm.
Efficient for short repeated comm.
7 NAS parallel benchmark programs 4 synthesized traffics
• Existing bypassing routers use– Only a static or a single bypassing policy
• Prediction router supports– Multiple predictors which can be switched in a
cycle– To accelerate a wider range of applications
• Existing bypassing routers use– Only a static or a single bypassing policy
• Prediction router supports– Multiple predictors which can be switched in a
cycle– To accelerate a wider range of applications
However, effective bypassing policy depends on traffic patterns…
Case study 2: Area & Energy• Area (gate count)
– Original router– Pred router (SS + LP)– Pred router (SS+LP+FCM)
• Energy consumption
Router area [kilo gates]
6.4 - 15.9% increased, depending on type and number of predictors
Light-weight (small
overhead)
FCM is all-arounder, but requires
counters
Verilog-HDL designs
Synthesized with 65nm library
6.4 - 15.9% increased, depending on type and number of predictors
Case study 2: Area & Energy• Area (gate count)
– Original router– Pred router (SS + LP)– Pred router (SS+LP+FCM)
• Energy consumption– Original router– Pred router (70% hit)– Pred router (100% hit)
Flit switching energy [pJ / bit]
Miss prediction consumes power; 9.5% increased if hit rate is 70%Latency 35.8%-48.2% saved w/ reasonable area/energy
overheads
Router area [kilo gates]
This estimation is pessimistic.
1. More energy consumed in links Effect of router energy overhead is reduced
2. Application will be finished early More energy saved
3 case studies of prediction router
Case study 3Case study 1 & 2
2-D mesh network Fat tree network
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis)Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF
Case study 3: Fat tree network
UpDown
1. LRU algorithm
LRU output port is selected for upward transfer
2. LRU + LP algorithm
Plus, LP for downward transfer
1. LRU algorithm
LRU output port is selected for upward transfer
2. LRU + LP algorithm
Plus, LP for downward transfer
Case study 3: Fat tree network• Comm. latency @uniform
– Original router– Pred router (LRU)– Pred router (LRU + LP)
UpDown
C
om
m. l
ate
ncy
[cy
cle
s]
Network size (# of cores)
Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)
• Prediction router for low-latency NoCs– Multiple predictors, which can be switched in a cycle– Architecture and six prediction algorithms– Analytical model of prediction hit rates
• Evaluations of prediction router– Case study 1 : 2-D mesh (small core size)– Case study 2 : 2-D mesh (large core size)– Case study 3 : Fat tree network
• Results1. Prediction router can be applied to various NoCs2. Communication latency reduced with small
overheads3. Prediction router with multiple predictors can
accelerate a wider range of applications
From three case studies
Area overhead: 6.4% (SS+LP)
Energy overhead: 9.5% (worst)
Latency reduction: up to 48%
(from Case studies 1 & 2)
Summary of the prediction router
Thank you
for your attention
It would be very helpful if you would speak slowly. Thank you in advance.
5x5 XBAR
ARBITER
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Predictors
A B C
Prediction router: New modifications
KILL signals
• Predictors for each input channel
• Kill mechanism to remove dead flits
• Two-level arbiter– “Reservation” higher priority– “Tentative reservation” by the pre-execution of
VSA
Currently, the critical path is related to the
arbiter
• Static scheme– A predictor is
selected by user per application
• Dynamic scheme– A predictor is
adaptively selected
Prediction router: Predictor selection
Predictors
A B C
Application 1
Predictor B
Application 2
Predictor A
Application 3
Predictor C
… …
Configuration table
Simple Pre-analysis is needed
Predictors
A B C
Predictor A 100
Predictor B 80
Predictor C 120
Count up if each predictor hits
A predictor is selected every n cycles (e.g., n =10,000)
Flexible More energy
Case study 1: Router critical path
• RC: Routing comp.• VSA: Arbitration• ST: Switch traversal
Original router Pred router (SS)
Sta
ge
del
ay
[F
O4
s]
6.2% critical path delay increased compared with original router
ST can be occurred in these stages of prediction router
Case study 2: Hit rate @ 8x8 mesh
All arounder !
• SS: go straight• LP: the last one• FCM: frequently used pattern• Custom: user-specific path
Efficient for long straight comm.
Efficient for short repeated comm.
7 NAS parallel benchmark programs 4 synthesized traffics
Pre
dic
tio
n h
it r
ate
[%]
Efficient for simple comm.
Case study 4: Spidergon network
• Spidergon topology– Ring + across links
– Each router has 3-port– Mesh-like 2-D layout– Across first routing
[Coppola,ISSOC’04]
• Hit rate @ Uniform
Case study 4: Spidergon network
• Spidergon topology– Ring + across links
– Each router has 3-port– Mesh-like 2-D layout– Across first routing
• Hit rate @ Uniform– SS: Go straight– LP: Last used one– FCM: Frequently used one
[Coppola,ISSOC’04]
Network size (# of cores)
Pre
dic
tio
n h
it r
ate
[%
] Hit rates of SS & FCM are almost the same
High hit rate is achieved (80% for 64core; 94% for 256core)
4 case studies of prediction router
Case study 3 Case study 4Case study 1 & 2
2-D mesh network Fat tree network Spidergon network
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis)Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF