Prediction Router:
description
Transcript of Prediction Router:
![Page 1: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/1.jpg)
Prediction Router:
Hiroki Matsutani (Keio Univ., Japan)Michihiro Koibuchi (NII, Japan)
Hideharu Amano (Keio Univ., Japan)Tsutomu Yoshinaga (UEC, Japan)
Yet another low-latency on-chip router architecture
![Page 2: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/2.jpg)
• Tile architecture– Many cores (e.g., processors & caches)– On-chip interconnection network
Why low-latency router is needed?
Packet switched network
router
[Dally, DAC’01]
router router
router router router
router router router
RouterRouterCoreCore
16-core tile architecture16-core tile architectureOn-chip router affects the performance and cost of the
chip
![Page 3: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/3.jpg)
System Topology Routing Switching
Flow ctrl
MIT RAW 2D mesh (32bit)
XY DOR WH, no VC Credit
UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit
QuickSilver ACM
H-Tree (32bit) Up*/down* 1-flit, no VC Credit
UMass Amherst aSOC
2D mesh Shortest-path
Pipelined CS, no VC
Timeslot
Sun T1 Crossbar (128bit)
- - Handshake
Cell BE EIB Ring (128bit) Shortest-path
Pipelined CS, no VC
Credit
TRIPS (operand)
2D mesh (109bit)
YX DOR 1-flit, no VC On/off
TRIPS (on-chip) 2D mesh (128bit)
YX DOR WH, 4 VCs Credit
Intel SCC 2D torus (32bit) XY,YX DOR, odd-even TM
WH, no VC Stall/go
TILE64 iMesh 2D mesh (32bit) XY DOR WH, no VC Credit
Intel 80-core NoC
2-D mesh (32bit)
Source routing
WH, 2 lanes
On/off
Number of cores increases (e.g., 64-core or more?)
Their communication latency is a crucial problem
Number of cores increases (e.g., 64-core or more?)
Their communication latency is a crucial problem
Number of hops increases
Low-latency router architecture has been extensively studied
Why low-latency router is needed?
![Page 4: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/4.jpg)
Outline: Prediction router for low-latency NoC
• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router
• Prediction router– Architecture and the prediction algorithms
• Hit rate analysis• Evaluations
– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network
![Page 5: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/5.jpg)
Wormhole router: Hardware structure
5x5 CROSSBA
R
ARBITER
FIFO
FIFO
FIFO
FIFO
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Routing, arbitration, & switch traversal are performed in a pipeline manner
Input ports Output ports1) selecting an output channel
2) arbitration for the selected output
channel
3) sending the packet to the
output channel
GRANT
![Page 6: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/6.jpg)
• At least 3-cycle for traversing a router– RC (Routing computation)– VSA (Virtual channel & switch allocations)– ST (Switch traversal)
• A packet transfer from router (a) to router (c)
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9 10 11 12
@Router A @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
At least 12-cycle for transferring a packet from router (a) to router (c)
SA
SA
SA
SA
SA
SA
SA
SA
SA
VA & SA are speculatively performed in parallel
To perform RC and VSA in parallel, look-ahead routing is used
Pipeline structure: 3-cycle router
Speculative router: VA/SA in parallel [Peh,HPCA’01]
![Page 7: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/7.jpg)
• At least 3-cycle for traversing a router– NRC (Next routing computation)– VSA (Virtual channel & switch allocations)– ST (Switch traversal)
NRC VSA ST
ST
ST
ST
VSA ST
ST
ST
ST
VSA ST
ST
ST
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9 10 11 12
@Router A @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
NRC NRC
VSA can be performed w/o waiting for NRCRouting computation for the next hop
Output port of router (i+1) is selected by router i
SA
SA
SA
SA
SA
SA
SA
SA
SA
Look-ahead router:RC/VA in parallel
![Page 8: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/8.jpg)
• At least 2-cycle for traversing a router– NRC + VSA (Next routing computation / arbitrations)– ST (Switch traversal)
NRCVSA
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9
@Router A
HEAD
DATA 1
DATA 2
DATA 3
NRCVSA
STNRCVSA
ST
@Router B @Router C
No dependency between NRC & VSA NRC & VSA in parallel
Typical example of 2-cycle router
Look-ahead router:RC/VA in parallel
At least 9-cycle for transferring a packet from router (a) to router (c)
Packing NRC,VSA,ST into a single stage frequency harmed
[Dally’s book, 2004]
![Page 9: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/9.jpg)
3-cycle
• Bypassing between intermediate nodes– E.g., Express VCs
Bypassing router: skip some stages
SRC DST
[Kumar, ISCA’07]
3-cycle 3-cycle
Virtual bypassing paths
3-cycle 3-cycle1-cycleBypassed
1-cycleBypassed
![Page 10: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/10.jpg)
• Bypassing between intermediate nodes– E.g., Express VCs
• Pipeline bypassing utilizing the regularity of DOR– E.g., Mad postman
• Pipeline stages on frequently used are skipped– E.g., Dynamic fast path
• Pipeline stages on user-specific paths are skipped– E.g., Preferred path– E.g., DBP
Bypassing router: skip some stages
[Kumar, ISCA’07]
[Koibuchi, NOCS’08]
[Michelogiannakis, NOCS’07]
[Park, HOTI’07]
[Izu, PDP’94]
We propose a low-latency router based on multiple predictors
3-cycleSRC DST
3-cycle 3-cycle
Virtual bypassing paths
3-cycle 3-cycle1-cycleBypassed
1-cycleBypassed
![Page 11: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/11.jpg)
• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router
• Prediction router– Architecture and the prediction algorithms
• Hit rate analysis• Evaluations
– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network
Outline: Prediction router for low-latency NoC
![Page 12: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/12.jpg)
Prediction router for 1-cycle transfer
• Each input channel has predictors• When an input channel is idle,
– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9 10 11 12
@Router A @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
RC & VSA are skipped if prediction hits 1-cycle transfer
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
![Page 13: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/13.jpg)
Prediction router for 1-cycle transfer
• Each input channel has predictors• When an input channel is idle,
– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)
ELAPSED TIME [CYCLE]
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
1 2 3 4 5 6 7 8 9 10 11 12
MISS @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
RC & VSA are skipped if prediction hits 1-cycle transfer
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
![Page 14: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/14.jpg)
Prediction router for 1-cycle transfer
• Each input channel has predictors • When an input channel is idle,
– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)
ELAPSED TIME [CYCLE]
RC VSA ST
ST
ST
ST
ST RC VSA ST
ST
ST
ST
1 2 3 4 5 6 7 8 9 10 11 12
MISS @Router C
HEAD
DATA 1
DATA 2
DATA 3
ST
ST
ST
HIT
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
RC & VSA are skipped if prediction hits 1-cycle transfer
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
![Page 15: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/15.jpg)
Prediction router for 1-cycle transfer
• Each input channel has predictors • When an input channel is idle,
– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)
ELAPSED TIME [CYCLE]
RC VSA ST
ST
ST
ST
ST ST
ST
ST
ST
1 2 3 4 5 6 7 8 9 10 11 12
MISS HITHEAD
DATA 1
DATA 2
DATA 3
ST
ST
ST
HIT
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
RC & VSA are skipped if prediction hits 1-cycle transfer
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
![Page 16: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/16.jpg)
Prediction router: Prediction algorithms
• Efficient predictor is key
• Prediction router– Multiple predictors for
each input channel
– Select one of them in response to a given network environment
Single predictor isn’t enough
[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]
for applications with different traffic patterns
Predictors
A B C
Predictors
A B C
1. Random2. Static Straight (SS)
An output channel on the same dimension is selected (exploiting the regularity of
DOR)3. Custom
User can specify which output channel is accelerated
4. Latest Port (LP)Previously used output channel is
selected5. Finite Context Method (FCM)
The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)
6. Sampled Pattern Match (SPM)Pattern matching using a record table
1. Random2. Static Straight (SS)
An output channel on the same dimension is selected (exploiting the regularity of
DOR)3. Custom
User can specify which output channel is accelerated
4. Latest Port (LP)Previously used output channel is
selected5. Finite Context Method (FCM)
The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)
6. Sampled Pattern Match (SPM)Pattern matching using a record table
[Burtscher, TC’02]
[Jacquet, TIT’02]
![Page 17: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/17.jpg)
5x5 XBAR
ARBITER
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Predictors
A B C
1-cycle transfer using the reserved crossbar-port when prediction hits
Basic operation @ Correct prediction
Crossbar is reserved
Idle state: Output port X+ is selected and reserved1st cycle: Incoming flit is transferred to X+ without RC and VSA
Correct
1st cycle: RC is performed The prediction is correct!
2nd cycle: Next flit is transferred to X+ without RC and VSA
![Page 18: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/18.jpg)
5x5 XBAR
ARBITER
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Predictors
A B C
Even with miss prediction, a flit is transferred in 3-cycle as original router
Basic operation @ Miss prediction
Idle state: Output port X+ is selected and reserved1st cycle: Incoming flit is transferred to X+ without RC and VSA
Correct Dead flit
1st cycle: RC is performed The prediction is wrong! (X- is correct)
KILL
Kill signal to X+ is asserted2nd/3rd cycle: Dead flit is removed; retransmission to the correct port
More energy for retransmission
![Page 19: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/19.jpg)
• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router
• Prediction router– Architecture and the prediction algorithms
• Hit rate analysis• Evaluations
– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network
Outline: Prediction router for low-latency NoC
![Page 20: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/20.jpg)
Prediction hit rate analysis
• Formulas to calculate the prediction hit rates on– 2-D torus (Random, LP, SS, FCM, and SPM)– 2-D mesh (Random, LP, SS, FCM, and SPM)– Fat tree (Random and LRU)
– To forecast which prediction algorithm is suited for a given network environment w/o simulations
• Accuracy of the analytical model is confirmed through simulationsDerivation of the formulas is omitted in this
talk
(See “Section 4” of our paper for more detail)
![Page 21: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/21.jpg)
• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router
• Prediction router– Architecture and the prediction algorithms
• Hit rate analysis• Evaluations
– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network
Outline: Prediction router for low-latency NoC
![Page 22: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/22.jpg)
Evaluation items
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis)Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF
Packet length 4-flit (1-flit: 64 bit)
Switching technique
wormhole
Channel buffer size
4-flit / VC
Number of VCs 1 or 2VCs
Cycle / hop (miss) 3 stage
Cycle / hop (hit) 1 stage*Topology and traffic are mentioned later
Table 1: Router & network parameters CMOS process 65nm
Core voltage 1.20V
Temperature 25C
Table 2: Process library
Design compiler 2006.06
Astro 2007.03
Table 3: CAD tools used
![Page 23: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/23.jpg)
3 case studies of prediction router
Case study 3Case study 1 & 2
2-D mesh network Fat tree network
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis)Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF
• The most popular network topology
MIT’s RAW [Taylor,ISCA’04]
Intel’s 80-core [Vangal,ISSCC’07]
• Dimension-order routing (XY routing)
Here, we show the results of case studies 1 and 2 together
![Page 24: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/24.jpg)
Case study 1: Zero-load comm.latency
C
om
m. l
ate
ncy
[cy
cle
s]
Network size (k-ary 2-mesh)
• Original router• Pred router (SS)• Pred router (100% hit)
Uniform random traffic on
4x4 to 16x16 meshes
35.8% reduced for 8x8 cores
(*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction
48.2% reduced for 16x16 cores
Simulation results
(analytical model also shows the same result)
More latency reduced (48% for k=16) as network size increases
![Page 25: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/25.jpg)
Case study 2: Hit rate @ 8x8 mesh
• SS: go straight• LP: the last one• FCM: frequently used pattern
Pre
dic
tio
n h
it r
ate
[%
]
7 NAS parallel benchmark programs 4 synthesized traffics
Efficient for long straight comm.
![Page 26: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/26.jpg)
Case study 2: Hit rate @ 8x8 mesh
Efficient for short repeated comm.
Pre
dic
tio
n h
it r
ate
[%
]
• SS: go straight• LP: the last one• FCM: frequently used pattern
Efficient for long straight comm.
7 NAS parallel benchmark programs 4 synthesized traffics
![Page 27: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/27.jpg)
Case study 2: Hit rate @ 8x8 mesh
All arounder !
Pre
dic
tio
n h
it r
ate
[%
]
• SS: go straight• LP: the last one• FCM: frequently used pattern
Efficient for long straight comm.
Efficient for short repeated comm.
7 NAS parallel benchmark programs 4 synthesized traffics
• Existing bypassing routers use– Only a static or a single bypassing policy
• Prediction router supports– Multiple predictors which can be switched in a
cycle– To accelerate a wider range of applications
• Existing bypassing routers use– Only a static or a single bypassing policy
• Prediction router supports– Multiple predictors which can be switched in a
cycle– To accelerate a wider range of applications
However, effective bypassing policy depends on traffic patterns…
![Page 28: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/28.jpg)
Case study 2: Area & Energy• Area (gate count)
– Original router– Pred router (SS + LP)– Pred router (SS+LP+FCM)
• Energy consumption
Router area [kilo gates]
6.4 - 15.9% increased, depending on type and number of predictors
Light-weight (small
overhead)
FCM is all-arounder, but requires
counters
Verilog-HDL designs
Synthesized with 65nm library
![Page 29: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/29.jpg)
6.4 - 15.9% increased, depending on type and number of predictors
Case study 2: Area & Energy• Area (gate count)
– Original router– Pred router (SS + LP)– Pred router (SS+LP+FCM)
• Energy consumption– Original router– Pred router (70% hit)– Pred router (100% hit)
Flit switching energy [pJ / bit]
Miss prediction consumes power; 9.5% increased if hit rate is 70%Latency 35.8%-48.2% saved w/ reasonable area/energy
overheads
Router area [kilo gates]
This estimation is pessimistic.
1. More energy consumed in links Effect of router energy overhead is reduced
2. Application will be finished early More energy saved
![Page 30: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/30.jpg)
3 case studies of prediction router
Case study 3Case study 1 & 2
2-D mesh network Fat tree network
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis)Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF
![Page 31: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/31.jpg)
Case study 3: Fat tree network
UpDown
1. LRU algorithm
LRU output port is selected for upward transfer
2. LRU + LP algorithm
Plus, LP for downward transfer
![Page 32: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/32.jpg)
1. LRU algorithm
LRU output port is selected for upward transfer
2. LRU + LP algorithm
Plus, LP for downward transfer
Case study 3: Fat tree network• Comm. latency @uniform
– Original router– Pred router (LRU)– Pred router (LRU + LP)
UpDown
C
om
m. l
ate
ncy
[cy
cle
s]
Network size (# of cores)
Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)
![Page 33: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/33.jpg)
• Prediction router for low-latency NoCs– Multiple predictors, which can be switched in a cycle– Architecture and six prediction algorithms– Analytical model of prediction hit rates
• Evaluations of prediction router– Case study 1 : 2-D mesh (small core size)– Case study 2 : 2-D mesh (large core size)– Case study 3 : Fat tree network
• Results1. Prediction router can be applied to various NoCs2. Communication latency reduced with small
overheads3. Prediction router with multiple predictors can
accelerate a wider range of applications
From three case studies
Area overhead: 6.4% (SS+LP)
Energy overhead: 9.5% (worst)
Latency reduction: up to 48%
(from Case studies 1 & 2)
Summary of the prediction router
![Page 34: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/34.jpg)
Thank you
for your attention
It would be very helpful if you would speak slowly. Thank you in advance.
![Page 35: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/35.jpg)
5x5 XBAR
ARBITER
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Predictors
A B C
Prediction router: New modifications
KILL signals
• Predictors for each input channel
• Kill mechanism to remove dead flits
• Two-level arbiter– “Reservation” higher priority– “Tentative reservation” by the pre-execution of
VSA
Currently, the critical path is related to the
arbiter
![Page 36: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/36.jpg)
• Static scheme– A predictor is
selected by user per application
• Dynamic scheme– A predictor is
adaptively selected
Prediction router: Predictor selection
Predictors
A B C
Application 1
Predictor B
Application 2
Predictor A
Application 3
Predictor C
… …
Configuration table
Simple Pre-analysis is needed
Predictors
A B C
Predictor A 100
Predictor B 80
Predictor C 120
Count up if each predictor hits
A predictor is selected every n cycles (e.g., n =10,000)
Flexible More energy
![Page 37: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/37.jpg)
Case study 1: Router critical path
• RC: Routing comp.• VSA: Arbitration• ST: Switch traversal
Original router Pred router (SS)
Sta
ge
del
ay
[F
O4
s]
6.2% critical path delay increased compared with original router
ST can be occurred in these stages of prediction router
![Page 38: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/38.jpg)
Case study 2: Hit rate @ 8x8 mesh
All arounder !
• SS: go straight• LP: the last one• FCM: frequently used pattern• Custom: user-specific path
Efficient for long straight comm.
Efficient for short repeated comm.
7 NAS parallel benchmark programs 4 synthesized traffics
Pre
dic
tio
n h
it r
ate
[%]
Efficient for simple comm.
![Page 39: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/39.jpg)
Case study 4: Spidergon network
• Spidergon topology– Ring + across links
– Each router has 3-port– Mesh-like 2-D layout– Across first routing
[Coppola,ISSOC’04]
• Hit rate @ Uniform
![Page 40: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/40.jpg)
Case study 4: Spidergon network
• Spidergon topology– Ring + across links
– Each router has 3-port– Mesh-like 2-D layout– Across first routing
• Hit rate @ Uniform– SS: Go straight– LP: Last used one– FCM: Frequently used one
[Coppola,ISSOC’04]
Network size (# of cores)
Pre
dic
tio
n h
it r
ate
[%
] Hit rates of SS & FCM are almost the same
High hit rate is achieved (80% for 64core; 94% for 256core)
![Page 41: Prediction Router:](https://reader035.fdocuments.net/reader035/viewer/2022070405/56813fd6550346895daabaf9/html5/thumbnails/41.jpg)
4 case studies of prediction router
Case study 3 Case study 4Case study 1 & 2
2-D mesh network Fat tree network Spidergon network
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis)Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF