Prediction Router:

41
Prediction Router: Hiroki Matsutani (Keio Univ., Japan) Michihiro Koibuchi (NII, Japan) Hideharu Amano (Keio Univ., Yet another low-latency on-chip router architecture

description

Prediction Router:. Yet another low-latency on-chip router architecture. Hiroki Matsutani (Keio Univ., Japan) Michihiro Koibuchi (NII, Japan) Hideharu Amano (Keio Univ., Japan) Tsutomu Yoshinaga (UEC, Japan). Tile architecture - PowerPoint PPT Presentation

Transcript of Prediction Router:

Page 1: Prediction Router:

Prediction Router:

Hiroki Matsutani (Keio Univ., Japan)Michihiro Koibuchi (NII, Japan)

Hideharu Amano (Keio Univ., Japan)Tsutomu Yoshinaga (UEC, Japan)

Yet another low-latency on-chip router architecture

Page 2: Prediction Router:

• Tile architecture– Many cores (e.g., processors & caches)– On-chip interconnection network

Why low-latency router is needed?

Packet switched network

router

[Dally, DAC’01]

router router

router router router

router router router

RouterRouterCoreCore

16-core tile architecture16-core tile architectureOn-chip router affects the performance and cost of the

chip

Page 3: Prediction Router:

System Topology Routing Switching

Flow ctrl

MIT RAW 2D mesh (32bit)

XY DOR WH, no VC Credit

UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit

QuickSilver ACM

H-Tree (32bit) Up*/down* 1-flit, no VC Credit

UMass Amherst aSOC

2D mesh Shortest-path

Pipelined CS, no VC

Timeslot

Sun T1 Crossbar (128bit)

- - Handshake

Cell BE EIB Ring (128bit) Shortest-path

Pipelined CS, no VC

Credit

TRIPS (operand)

2D mesh (109bit)

YX DOR 1-flit, no VC On/off

TRIPS (on-chip) 2D mesh (128bit)

YX DOR WH, 4 VCs Credit

Intel SCC 2D torus (32bit) XY,YX DOR, odd-even TM

WH, no VC Stall/go

TILE64 iMesh 2D mesh (32bit) XY DOR WH, no VC Credit

Intel 80-core NoC

2-D mesh (32bit)

Source routing

WH, 2 lanes

On/off

Number of cores increases (e.g., 64-core or more?)

Their communication latency is a crucial problem

Number of cores increases (e.g., 64-core or more?)

Their communication latency is a crucial problem

Number of hops increases

Low-latency router architecture has been extensively studied

Why low-latency router is needed?

Page 4: Prediction Router:

Outline: Prediction router for low-latency NoC

• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router

• Prediction router– Architecture and the prediction algorithms

• Hit rate analysis• Evaluations

– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network

Page 5: Prediction Router:

Wormhole router: Hardware structure

5x5 CROSSBA

R

ARBITER

FIFO

FIFO

FIFO

FIFO

FIFOX+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Routing, arbitration, & switch traversal are performed in a pipeline manner

Input ports Output ports1) selecting an output channel

2) arbitration for the selected output

channel

3) sending the packet to the

output channel

GRANT

Page 6: Prediction Router:

• At least 3-cycle for traversing a router– RC (Routing computation)– VSA (Virtual channel & switch allocations)– ST (Switch traversal)

• A packet transfer from router (a) to router (c)

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

ELAPSED TIME [CYCLE]

1 2 3 4 5 6 7 8 9 10 11 12

@Router A @Router B @Router C

HEAD

DATA 1

DATA 2

DATA 3

At least 12-cycle for transferring a packet from router (a) to router (c)

SA

SA

SA

SA

SA

SA

SA

SA

SA

VA & SA are speculatively performed in parallel

To perform RC and VSA in parallel, look-ahead routing is used

Pipeline structure: 3-cycle router

Speculative router: VA/SA in parallel [Peh,HPCA’01]

Page 7: Prediction Router:

• At least 3-cycle for traversing a router– NRC (Next routing computation)– VSA (Virtual channel & switch allocations)– ST (Switch traversal)

NRC VSA ST

ST

ST

ST

VSA ST

ST

ST

ST

VSA ST

ST

ST

ST

ELAPSED TIME [CYCLE]

1 2 3 4 5 6 7 8 9 10 11 12

@Router A @Router B @Router C

HEAD

DATA 1

DATA 2

DATA 3

NRC NRC

VSA can be performed w/o waiting for NRCRouting computation for the next hop

Output port of router (i+1) is selected by router i

SA

SA

SA

SA

SA

SA

SA

SA

SA

Look-ahead router:RC/VA in parallel

Page 8: Prediction Router:

• At least 2-cycle for traversing a router– NRC + VSA (Next routing computation / arbitrations)– ST (Switch traversal)

NRCVSA

ST

ELAPSED TIME [CYCLE]

1 2 3 4 5 6 7 8 9

@Router A

HEAD

DATA 1

DATA 2

DATA 3

NRCVSA

STNRCVSA

ST

@Router B @Router C

No dependency between NRC & VSA NRC & VSA in parallel

Typical example of 2-cycle router

Look-ahead router:RC/VA in parallel

At least 9-cycle for transferring a packet from router (a) to router (c)

Packing NRC,VSA,ST into a single stage frequency harmed

[Dally’s book, 2004]

Page 9: Prediction Router:

3-cycle

• Bypassing between intermediate nodes– E.g., Express VCs

Bypassing router: skip some stages

SRC DST

[Kumar, ISCA’07]

3-cycle 3-cycle

Virtual bypassing paths

3-cycle 3-cycle1-cycleBypassed

1-cycleBypassed

Page 10: Prediction Router:

• Bypassing between intermediate nodes– E.g., Express VCs

• Pipeline bypassing utilizing the regularity of DOR– E.g., Mad postman

• Pipeline stages on frequently used are skipped– E.g., Dynamic fast path

• Pipeline stages on user-specific paths are skipped– E.g., Preferred path– E.g., DBP

Bypassing router: skip some stages

[Kumar, ISCA’07]

[Koibuchi, NOCS’08]

[Michelogiannakis, NOCS’07]

[Park, HOTI’07]

[Izu, PDP’94]

We propose a low-latency router based on multiple predictors

3-cycleSRC DST

3-cycle 3-cycle

Virtual bypassing paths

3-cycle 3-cycle1-cycleBypassed

1-cycleBypassed

Page 11: Prediction Router:

• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router

• Prediction router– Architecture and the prediction algorithms

• Hit rate analysis• Evaluations

– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network

Outline: Prediction router for low-latency NoC

Page 12: Prediction Router:

Prediction router for 1-cycle transfer

• Each input channel has predictors• When an input channel is idle,

– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

ELAPSED TIME [CYCLE]

1 2 3 4 5 6 7 8 9 10 11 12

@Router A @Router B @Router C

HEAD

DATA 1

DATA 2

DATA 3

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

RC & VSA are skipped if prediction hits 1-cycle transfer

[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]

Page 13: Prediction Router:

Prediction router for 1-cycle transfer

• Each input channel has predictors• When an input channel is idle,

– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)

ELAPSED TIME [CYCLE]

[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

RC VSA ST

ST

ST

ST

1 2 3 4 5 6 7 8 9 10 11 12

MISS @Router B @Router C

HEAD

DATA 1

DATA 2

DATA 3

RC & VSA are skipped if prediction hits 1-cycle transfer

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Page 14: Prediction Router:

Prediction router for 1-cycle transfer

• Each input channel has predictors • When an input channel is idle,

– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)

ELAPSED TIME [CYCLE]

RC VSA ST

ST

ST

ST

ST RC VSA ST

ST

ST

ST

1 2 3 4 5 6 7 8 9 10 11 12

MISS @Router C

HEAD

DATA 1

DATA 2

DATA 3

ST

ST

ST

HIT

[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Page 15: Prediction Router:

Prediction router for 1-cycle transfer

• Each input channel has predictors • When an input channel is idle,

– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)

ELAPSED TIME [CYCLE]

RC VSA ST

ST

ST

ST

ST ST

ST

ST

ST

1 2 3 4 5 6 7 8 9 10 11 12

MISS HITHEAD

DATA 1

DATA 2

DATA 3

ST

ST

ST

HIT

[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]

RC & VSA are skipped if prediction hits 1-cycle transfer

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Page 16: Prediction Router:

Prediction router: Prediction algorithms

• Efficient predictor is key

• Prediction router– Multiple predictors for

each input channel

– Select one of them in response to a given network environment

Single predictor isn’t enough

[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]

for applications with different traffic patterns

Predictors

A B C

Predictors

A B C

1. Random2. Static Straight (SS)

An output channel on the same dimension is selected (exploiting the regularity of

DOR)3. Custom

User can specify which output channel is accelerated

4. Latest Port (LP)Previously used output channel is

selected5. Finite Context Method (FCM)

The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)

6. Sampled Pattern Match (SPM)Pattern matching using a record table

1. Random2. Static Straight (SS)

An output channel on the same dimension is selected (exploiting the regularity of

DOR)3. Custom

User can specify which output channel is accelerated

4. Latest Port (LP)Previously used output channel is

selected5. Finite Context Method (FCM)

The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)

6. Sampled Pattern Match (SPM)Pattern matching using a record table

[Burtscher, TC’02]

[Jacquet, TIT’02]

Page 17: Prediction Router:

5x5 XBAR

ARBITER

FIFOX+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Predictors

A B C

1-cycle transfer using the reserved crossbar-port when prediction hits

Basic operation @ Correct prediction

Crossbar is reserved

Idle state: Output port X+ is selected and reserved1st cycle: Incoming flit is transferred to X+ without RC and VSA

Correct

1st cycle: RC is performed The prediction is correct!

2nd cycle: Next flit is transferred to X+ without RC and VSA

Page 18: Prediction Router:

5x5 XBAR

ARBITER

FIFOX+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Predictors

A B C

Even with miss prediction, a flit is transferred in 3-cycle as original router

Basic operation @ Miss prediction

Idle state: Output port X+ is selected and reserved1st cycle: Incoming flit is transferred to X+ without RC and VSA

Correct Dead flit

1st cycle: RC is performed The prediction is wrong! (X- is correct)

KILL

Kill signal to X+ is asserted2nd/3rd cycle: Dead flit is removed; retransmission to the correct port

More energy for retransmission

Page 19: Prediction Router:

• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router

• Prediction router– Architecture and the prediction algorithms

• Hit rate analysis• Evaluations

– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network

Outline: Prediction router for low-latency NoC

Page 20: Prediction Router:

Prediction hit rate analysis

• Formulas to calculate the prediction hit rates on– 2-D torus (Random, LP, SS, FCM, and SPM)– 2-D mesh (Random, LP, SS, FCM, and SPM)– Fat tree (Random and LRU)

– To forecast which prediction algorithm is suited for a given network environment w/o simulations

• Accuracy of the analytical model is confirmed through simulationsDerivation of the formulas is omitted in this

talk

(See “Section 4” of our paper for more detail)

Page 21: Prediction Router:

• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router

• Prediction router– Architecture and the prediction algorithms

• Hit rate analysis• Evaluations

– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network

Outline: Prediction router for low-latency NoC

Page 22: Prediction Router:

Evaluation items

Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]

How many cycles ?

miss hit hit

hit

Flit-level net simulation

XBAR

FIFO

FIFO

Design compiler(synthesis)Fujitsu 65nm library

Astro (place & route)

NC-Verilog (simulation)

Power compiler

SAIF SDF

Packet length 4-flit (1-flit: 64 bit)

Switching technique

wormhole

Channel buffer size

4-flit / VC

Number of VCs 1 or 2VCs

Cycle / hop (miss) 3 stage

Cycle / hop (hit) 1 stage*Topology and traffic are mentioned later

Table 1: Router & network parameters CMOS process 65nm

Core voltage 1.20V

Temperature 25C

Table 2: Process library

Design compiler 2006.06

Astro 2007.03

Table 3: CAD tools used

Page 23: Prediction Router:

3 case studies of prediction router

Case study 3Case study 1 & 2

2-D mesh network Fat tree network

Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]

How many cycles ?

miss hit hit

hit

Flit-level net simulation

XBAR

FIFO

FIFO

Design compiler(synthesis)Fujitsu 65nm library

Astro (place & route)

NC-Verilog (simulation)

Power compiler

SAIF SDF

• The most popular network topology

MIT’s RAW [Taylor,ISCA’04]

Intel’s 80-core [Vangal,ISSCC’07]

• Dimension-order routing (XY routing)

Here, we show the results of case studies 1 and 2 together

Page 24: Prediction Router:

Case study 1: Zero-load comm.latency

C

om

m. l

ate

ncy

[cy

cle

s]

Network size (k-ary 2-mesh)

• Original router• Pred router (SS)• Pred router (100% hit)

Uniform random traffic on

4x4 to 16x16 meshes

35.8% reduced for 8x8 cores

(*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction

48.2% reduced for 16x16 cores

Simulation results

(analytical model also shows the same result)

More latency reduced (48% for k=16) as network size increases

Page 25: Prediction Router:

Case study 2: Hit rate @ 8x8 mesh

• SS: go straight• LP: the last one• FCM: frequently used pattern

Pre

dic

tio

n h

it r

ate

[%

]

7 NAS parallel benchmark programs 4 synthesized traffics

Efficient for long straight comm.

Page 26: Prediction Router:

Case study 2: Hit rate @ 8x8 mesh

Efficient for short repeated comm.

Pre

dic

tio

n h

it r

ate

[%

]

• SS: go straight• LP: the last one• FCM: frequently used pattern

Efficient for long straight comm.

7 NAS parallel benchmark programs 4 synthesized traffics

Page 27: Prediction Router:

Case study 2: Hit rate @ 8x8 mesh

All arounder !

Pre

dic

tio

n h

it r

ate

[%

]

• SS: go straight• LP: the last one• FCM: frequently used pattern

Efficient for long straight comm.

Efficient for short repeated comm.

7 NAS parallel benchmark programs 4 synthesized traffics

• Existing bypassing routers use– Only a static or a single bypassing policy

• Prediction router supports– Multiple predictors which can be switched in a

cycle– To accelerate a wider range of applications

• Existing bypassing routers use– Only a static or a single bypassing policy

• Prediction router supports– Multiple predictors which can be switched in a

cycle– To accelerate a wider range of applications

However, effective bypassing policy depends on traffic patterns…

Page 28: Prediction Router:

Case study 2: Area & Energy• Area (gate count)

– Original router– Pred router (SS + LP)– Pred router (SS+LP+FCM)

• Energy consumption

Router area [kilo gates]

6.4 - 15.9% increased, depending on type and number of predictors

Light-weight (small

overhead)

FCM is all-arounder, but requires

counters

Verilog-HDL designs

Synthesized with 65nm library

Page 29: Prediction Router:

6.4 - 15.9% increased, depending on type and number of predictors

Case study 2: Area & Energy• Area (gate count)

– Original router– Pred router (SS + LP)– Pred router (SS+LP+FCM)

• Energy consumption– Original router– Pred router (70% hit)– Pred router (100% hit)

Flit switching energy [pJ / bit]

Miss prediction consumes power; 9.5% increased if hit rate is 70%Latency 35.8%-48.2% saved w/ reasonable area/energy

overheads

Router area [kilo gates]

This estimation is pessimistic.

1. More energy consumed in links Effect of router energy overhead is reduced

2. Application will be finished early More energy saved

Page 30: Prediction Router:

3 case studies of prediction router

Case study 3Case study 1 & 2

2-D mesh network Fat tree network

Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]

How many cycles ?

miss hit hit

hit

Flit-level net simulation

XBAR

FIFO

FIFO

Design compiler(synthesis)Fujitsu 65nm library

Astro (place & route)

NC-Verilog (simulation)

Power compiler

SAIF SDF

Page 31: Prediction Router:

Case study 3: Fat tree network

UpDown

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer

Page 32: Prediction Router:

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer

Case study 3: Fat tree network• Comm. latency @uniform

– Original router– Pred router (LRU)– Pred router (LRU + LP)

UpDown

C

om

m. l

ate

ncy

[cy

cle

s]

Network size (# of cores)

Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)

Page 33: Prediction Router:

• Prediction router for low-latency NoCs– Multiple predictors, which can be switched in a cycle– Architecture and six prediction algorithms– Analytical model of prediction hit rates

• Evaluations of prediction router– Case study 1 : 2-D mesh (small core size)– Case study 2 : 2-D mesh (large core size)– Case study 3 : Fat tree network

• Results1. Prediction router can be applied to various NoCs2. Communication latency reduced with small

overheads3. Prediction router with multiple predictors can

accelerate a wider range of applications

From three case studies

Area overhead: 6.4% (SS+LP)

Energy overhead: 9.5% (worst)

Latency reduction: up to 48%

(from Case studies 1 & 2)

Summary of the prediction router

Page 34: Prediction Router:

Thank you

for your attention

It would be very helpful if you would speak slowly. Thank you in advance.

Page 35: Prediction Router:

5x5 XBAR

ARBITER

FIFOX+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Predictors

A B C

Prediction router: New modifications

KILL signals

• Predictors for each input channel

• Kill mechanism to remove dead flits

• Two-level arbiter– “Reservation” higher priority– “Tentative reservation” by the pre-execution of

VSA

Currently, the critical path is related to the

arbiter

Page 36: Prediction Router:

• Static scheme– A predictor is

selected by user per application

• Dynamic scheme– A predictor is

adaptively selected

Prediction router: Predictor selection

Predictors

A B C

Application 1

Predictor B

Application 2

Predictor A

Application 3

Predictor C

… …

Configuration table

Simple Pre-analysis is needed

Predictors

A B C

Predictor A 100

Predictor B 80

Predictor C 120

Count up if each predictor hits

A predictor is selected every n cycles (e.g., n =10,000)

Flexible More energy

Page 37: Prediction Router:

Case study 1: Router critical path

• RC: Routing comp.• VSA: Arbitration• ST: Switch traversal

Original router Pred router (SS)

Sta

ge

del

ay

[F

O4

s]

6.2% critical path delay increased compared with original router

ST can be occurred in these stages of prediction router

Page 38: Prediction Router:

Case study 2: Hit rate @ 8x8 mesh

All arounder !

• SS: go straight• LP: the last one• FCM: frequently used pattern• Custom: user-specific path

Efficient for long straight comm.

Efficient for short repeated comm.

7 NAS parallel benchmark programs 4 synthesized traffics

Pre

dic

tio

n h

it r

ate

[%]

Efficient for simple comm.

Page 39: Prediction Router:

Case study 4: Spidergon network

• Spidergon topology– Ring + across links

– Each router has 3-port– Mesh-like 2-D layout– Across first routing

[Coppola,ISSOC’04]

• Hit rate @ Uniform

Page 40: Prediction Router:

Case study 4: Spidergon network

• Spidergon topology– Ring + across links

– Each router has 3-port– Mesh-like 2-D layout– Across first routing

• Hit rate @ Uniform– SS: Go straight– LP: Last used one– FCM: Frequently used one

[Coppola,ISSOC’04]

Network size (# of cores)

Pre

dic

tio

n h

it r

ate

[%

] Hit rates of SS & FCM are almost the same

High hit rate is achieved (80% for 64core; 94% for 256core)

Page 41: Prediction Router:

4 case studies of prediction router

Case study 3 Case study 4Case study 1 & 2

2-D mesh network Fat tree network Spidergon network

Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]

How many cycles ?

miss hit hit

hit

Flit-level net simulation

XBAR

FIFO

FIFO

Design compiler(synthesis)Fujitsu 65nm library

Astro (place & route)

NC-Verilog (simulation)

Power compiler

SAIF SDF