Prediction Router:

Hiroki Matsutani (Keio Univ., Japan)Michihiro Koibuchi (NII, Japan)

Hideharu Amano (Keio Univ., Japan)Tsutomu Yoshinaga (UEC, Japan)

Yet another low-latency on-chip router architecture

• Tile architecture– Many cores (e.g., processors & caches)– On-chip interconnection network

Why low-latency router is needed?

Packet switched network

router

[Dally, DAC’01]

router router

router router router

RouterRouterCoreCore

16-core tile architecture16-core tile architectureOn-chip router affects the performance and cost of the

System Topology Routing Switching

Flow ctrl

MIT RAW 2D mesh (32bit)

XY DOR WH, no VC Credit

UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit

QuickSilver ACM

H-Tree (32bit) Up*/down* 1-flit, no VC Credit

UMass Amherst aSOC

2D mesh Shortest-path

Pipelined CS, no VC

Timeslot

Sun T1 Crossbar (128bit)

- - Handshake

Cell BE EIB Ring (128bit) Shortest-path

Pipelined CS, no VC

Credit

TRIPS (operand)

2D mesh (109bit)

YX DOR 1-flit, no VC On/off

TRIPS (on-chip) 2D mesh (128bit)

YX DOR WH, 4 VCs Credit

Intel SCC 2D torus (32bit) XY,YX DOR, odd-even TM

WH, no VC Stall/go

TILE64 iMesh 2D mesh (32bit) XY DOR WH, no VC Credit

Intel 80-core NoC

2-D mesh (32bit)

Source routing

WH, 2 lanes

On/off

Number of cores increases (e.g., 64-core or more?)

Their communication latency is a crucial problem

Number of cores increases (e.g., 64-core or more?)

Their communication latency is a crucial problem

Number of hops increases

Low-latency router architecture has been extensively studied

Why low-latency router is needed?

Outline: Prediction router for low-latency NoC

• Existing low-latency routers– Speculative router– Look-ahead router– Bypassing router

• Prediction router– Architecture and the prediction algorithms

• Hit rate analysis• Evaluations

– Hit rate, gate count, and energy consumption– Case study 1: 2-D mesh (small core size)– Case study 2: 2-D mesh (large core size)– Case study 3: Fat tree network

Wormhole router: Hardware structure

5x5 CROSSBA

ARBITER

FIFOX+

Routing, arbitration, & switch traversal are performed in a pipeline manner

Input ports Output ports1) selecting an output channel

2) arbitration for the selected output

channel

3) sending the packet to the

output channel

• At least 3-cycle for traversing a router– RC (Routing computation)– VSA (Virtual channel & switch allocations)– ST (Switch traversal)

• A packet transfer from router (a) to router (c)

RC VSA ST

ELAPSED TIME [CYCLE]

1 2 3 4 5 6 7 8 9 10 11 12

@Router A @Router B @Router C

DATA 1

DATA 2

DATA 3

At least 12-cycle for transferring a packet from router (a) to router (c)

VA & SA are speculatively performed in parallel

To perform RC and VSA in parallel, look-ahead routing is used

Pipeline structure: 3-cycle router

Speculative router: VA/SA in parallel [Peh,HPCA’01]

• At least 3-cycle for traversing a router– NRC (Next routing computation)– VSA (Virtual channel & switch allocations)– ST (Switch traversal)

NRC VSA ST

VSA ST

1 2 3 4 5 6 7 8 9 10 11 12

DATA 1

DATA 2

DATA 3

NRC NRC

VSA can be performed w/o waiting for NRCRouting computation for the next hop

Output port of router (i+1) is selected by router i

Look-ahead router:RC/VA in parallel

• At least 2-cycle for traversing a router– NRC + VSA (Next routing computation / arbitrations)– ST (Switch traversal)

NRCVSA

1 2 3 4 5 6 7 8 9

@Router A

DATA 1

DATA 2

DATA 3

NRCVSA

STNRCVSA

@Router B @Router C

No dependency between NRC & VSA NRC & VSA in parallel

Typical example of 2-cycle router

Look-ahead router:RC/VA in parallel

At least 9-cycle for transferring a packet from router (a) to router (c)

Packing NRC,VSA,ST into a single stage frequency harmed

[Dally’s book, 2004]

3-cycle

• Bypassing between intermediate nodes– E.g., Express VCs

Bypassing router: skip some stages

SRC DST

[Kumar, ISCA’07]

3-cycle 3-cycle

Virtual bypassing paths

3-cycle 3-cycle1-cycleBypassed

1-cycleBypassed

• Bypassing between intermediate nodes– E.g., Express VCs

• Pipeline bypassing utilizing the regularity of DOR– E.g., Mad postman

• Pipeline stages on frequently used are skipped– E.g., Dynamic fast path

• Pipeline stages on user-specific paths are skipped– E.g., Preferred path– E.g., DBP

Bypassing router: skip some stages

[Kumar, ISCA’07]

[Koibuchi, NOCS’08]

[Michelogiannakis, NOCS’07]

[Park, HOTI’07]

[Izu, PDP’94]

We propose a low-latency router based on multiple predictors

3-cycleSRC DST

3-cycle 3-cycle

Virtual bypassing paths

3-cycle 3-cycle1-cycleBypassed

1-cycleBypassed

Prediction router for 1-cycle transfer

• Each input channel has predictors• When an input channel is idle,

– Predict an output port to be used (RC pre-execution)– Arbitration to use the predicted port(SA pre-execution)

RC VSA ST

1 2 3 4 5 6 7 8 9 10 11 12

DATA 1

DATA 2

DATA 3

E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

RC & VSA are skipped if prediction hits 1-cycle transfer

[Yoshinaga,IWIA’06][Yoshinaga,IWIA’07]

• Each input channel has predictors• When an input channel is idle,

RC VSA ST

1 2 3 4 5 6 7 8 9 10 11 12

MISS @Router B @Router C

DATA 1

DATA 2

DATA 3

• Each input channel has predictors • When an input channel is idle,

RC VSA ST

ST RC VSA ST

1 2 3 4 5 6 7 8 9 10 11 12

MISS @Router C

DATA 1

DATA 2

DATA 3

• Each input channel has predictors • When an input channel is idle,

RC VSA ST

1 2 3 4 5 6 7 8 9 10 11 12

MISS HITHEAD

DATA 1

DATA 2

DATA 3

Prediction router: Prediction algorithms

• Efficient predictor is key

• Prediction router– Multiple predictors for

each input channel

– Select one of them in response to a given network environment

Single predictor isn’t enough

for applications with different traffic patterns

Predictors

1. Random2. Static Straight (SS)

An output channel on the same dimension is selected (exploiting the regularity of

DOR)3. Custom

User can specify which output channel is accelerated

4. Latest Port (LP)Previously used output channel is

selected5. Finite Context Method (FCM)

The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)

6. Sampled Pattern Match (SPM)Pattern matching using a record table

1. Random2. Static Straight (SS)

An output channel on the same dimension is selected (exploiting the regularity of

DOR)3. Custom

User can specify which output channel is accelerated

4. Latest Port (LP)Previously used output channel is

selected5. Finite Context Method (FCM)

The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)

6. Sampled Pattern Match (SPM)Pattern matching using a record table

[Burtscher, TC’02]

[Jacquet, TIT’02]

5x5 XBAR

ARBITER

FIFOX+

Predictors

1-cycle transfer using the reserved crossbar-port when prediction hits

Basic operation @ Correct prediction

Crossbar is reserved

Idle state: Output port X+ is selected and reserved1st cycle: Incoming flit is transferred to X+ without RC and VSA

Correct

1st cycle: RC is performed The prediction is correct!

2nd cycle: Next flit is transferred to X+ without RC and VSA

5x5 XBAR

ARBITER

FIFOX+

Predictors

Even with miss prediction, a flit is transferred in 3-cycle as original router

Basic operation @ Miss prediction

Idle state: Output port X+ is selected and reserved1st cycle: Incoming flit is transferred to X+ without RC and VSA

Correct Dead flit

1st cycle: RC is performed The prediction is wrong! (X- is correct)

Kill signal to X+ is asserted2nd/3rd cycle: Dead flit is removed; retransmission to the correct port

More energy for retransmission

Prediction hit rate analysis

• Formulas to calculate the prediction hit rates on– 2-D torus (Random, LP, SS, FCM, and SPM)– 2-D mesh (Random, LP, SS, FCM, and SPM)– Fat tree (Random and LRU)

– To forecast which prediction algorithm is suited for a given network environment w/o simulations

• Accuracy of the analytical model is confirmed through simulationsDerivation of the formulas is omitted in this

(See “Section 4” of our paper for more detail)

Evaluation items

Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]

How many cycles ?

miss hit hit

Flit-level net simulation

Design compiler(synthesis)Fujitsu 65nm library

Astro (place & route)

NC-Verilog (simulation)

Power compiler

SAIF SDF

Packet length 4-flit (1-flit: 64 bit)

Switching technique

wormhole

Channel buffer size

4-flit / VC

Number of VCs 1 or 2VCs

Cycle / hop (miss) 3 stage

Cycle / hop (hit) 1 stage*Topology and traffic are mentioned later

Table 1: Router & network parameters CMOS process 65nm

Core voltage 1.20V

Temperature 25C

Table 2: Process library

Design compiler 2006.06

Astro 2007.03

Table 3: CAD tools used

3 case studies of prediction router

Case study 3Case study 1 & 2

2-D mesh network Fat tree network

How many cycles ?

miss hit hit

Power compiler

SAIF SDF

• The most popular network topology

MIT’s RAW [Taylor,ISCA’04]

Intel’s 80-core [Vangal,ISSCC’07]

• Dimension-order routing (XY routing)

Here, we show the results of case studies 1 and 2 together

Case study 1: Zero-load comm.latency

Network size (k-ary 2-mesh)

• Original router• Pred router (SS)• Pred router (100% hit)

Uniform random traffic on

4x4 to 16x16 meshes

35.8% reduced for 8x8 cores

(*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction

48.2% reduced for 16x16 cores

Simulation results

(analytical model also shows the same result)

More latency reduced (48% for k=16) as network size increases

Case study 2: Hit rate @ 8x8 mesh

• SS: go straight• LP: the last one• FCM: frequently used pattern

7 NAS parallel benchmark programs 4 synthesized traffics

Efficient for long straight comm.

Efficient for short repeated comm.

All arounder !

• Existing bypassing routers use– Only a static or a single bypassing policy

• Prediction router supports– Multiple predictors which can be switched in a

cycle– To accelerate a wider range of applications

• Existing bypassing routers use– Only a static or a single bypassing policy

• Prediction router supports– Multiple predictors which can be switched in a

cycle– To accelerate a wider range of applications

However, effective bypassing policy depends on traffic patterns…

Case study 2: Area & Energy• Area (gate count)

– Original router– Pred router (SS + LP)– Pred router (SS+LP+FCM)

• Energy consumption

Router area [kilo gates]

6.4 - 15.9% increased, depending on type and number of predictors

Light-weight (small

overhead)

FCM is all-arounder, but requires

counters

Verilog-HDL designs

Synthesized with 65nm library

6.4 - 15.9% increased, depending on type and number of predictors

Case study 2: Area & Energy• Area (gate count)

– Original router– Pred router (SS + LP)– Pred router (SS+LP+FCM)

• Energy consumption– Original router– Pred router (70% hit)– Pred router (100% hit)

Flit switching energy [pJ / bit]

Miss prediction consumes power; 9.5% increased if hit rate is 70%Latency 35.8%-48.2% saved w/ reasonable area/energy

overheads

Router area [kilo gates]

This estimation is pessimistic.

1. More energy consumed in links Effect of router energy overhead is reduced

2. Application will be finished early More energy saved

Case study 3Case study 1 & 2

2-D mesh network Fat tree network

How many cycles ?

miss hit hit

Power compiler

SAIF SDF

Case study 3: Fat tree network

UpDown

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer

1. LRU algorithm

LRU output port is selected for upward transfer

2. LRU + LP algorithm

Plus, LP for downward transfer

Case study 3: Fat tree network• Comm. latency @uniform

– Original router– Pred router (LRU)– Pred router (LRU + LP)

UpDown

Network size (# of cores)

Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)

• Prediction router for low-latency NoCs– Multiple predictors, which can be switched in a cycle– Architecture and six prediction algorithms– Analytical model of prediction hit rates

• Evaluations of prediction router– Case study 1 : 2-D mesh (small core size)– Case study 2 : 2-D mesh (large core size)– Case study 3 : Fat tree network

• Results1. Prediction router can be applied to various NoCs2. Communication latency reduced with small

overheads3. Prediction router with multiple predictors can

accelerate a wider range of applications

From three case studies

Area overhead: 6.4% (SS+LP)

Energy overhead: 9.5% (worst)

Latency reduction: up to 48%

(from Case studies 1 & 2)

Summary of the prediction router

Thank you

for your attention

It would be very helpful if you would speak slowly. Thank you in advance.

5x5 XBAR

ARBITER

FIFOX+

Predictors

Prediction router: New modifications

KILL signals

• Predictors for each input channel

• Kill mechanism to remove dead flits

• Two-level arbiter– “Reservation” higher priority– “Tentative reservation” by the pre-execution of

Currently, the critical path is related to the

arbiter

• Static scheme– A predictor is

selected by user per application

• Dynamic scheme– A predictor is

adaptively selected

Prediction router: Predictor selection

Predictors

Application 1

Predictor B

Application 2

Predictor A

Application 3

Predictor C

… …

Configuration table

Simple Pre-analysis is needed

Predictors

Predictor A 100

Predictor B 80

Predictor C 120

Count up if each predictor hits

A predictor is selected every n cycles (e.g., n =10,000)

Flexible More energy

Case study 1: Router critical path

• RC: Routing comp.• VSA: Arbitration• ST: Switch traversal

Original router Pred router (SS)

6.2% critical path delay increased compared with original router

ST can be occurred in these stages of prediction router

All arounder !

• SS: go straight• LP: the last one• FCM: frequently used pattern• Custom: user-specific path

Efficient for simple comm.

Case study 4: Spidergon network

• Spidergon topology– Ring + across links

– Each router has 3-port– Mesh-like 2-D layout– Across first routing

[Coppola,ISSOC’04]

• Hit rate @ Uniform

Case study 4: Spidergon network

• Spidergon topology– Ring + across links

– Each router has 3-port– Mesh-like 2-D layout– Across first routing

• Hit rate @ Uniform– SS: Go straight– LP: Last used one– FCM: Frequently used one

[Coppola,ISSOC’04]

Network size (# of cores)

] Hit rates of SS & FCM are almost the same

High hit rate is achieved (80% for 64core; 94% for 256core)

Case study 3 Case study 4Case study 1 & 2

2-D mesh network Fat tree network Spidergon network

How many cycles ?

miss hit hit

Power compiler

SAIF SDF

Prediction Router:

Documents

Transcript of Prediction Router:

CNC ROUTER MACHINE CONTROL - machmotion.com · CNC AutoMotion 5 Axis Router Large Aluminum Cutting CNC Router 2 Spindle AXYZ Router Retrofit Industrial Router Drive Enclosure CNC

Cisco router commands vs huawei router commands

AirStation NFINITI HighPower Router and AccessPoint… · 7 Router Switch Switches router mode between enabled, disabled, and auto. On: Router is enabled (router mode). Off: Router

Router Konfigurasyonu Komutlari | Router Konfigurasyonu Video

BASRT-B BAS Router BACnet Multi-network Router

FieldServer BACnet Router FS-ROUTER-BAC2 Start-up Guide ......BACnet Router FS-ROUTER-BAC2 Start-up Guide BAS Router (BACnet Multi-Network Router) APPLICABILITY & EFFECTIVITY The instructions

Digitaler Matrix-Router Digital Matrix Router

Router Configuration. CISCO 2500 series router.

KONFIGURASI CISCO ROUTER - IwanRies B'log · Ada dua jenis router berdasarkan cara ruting-nya, yaitu Router Statis dan Router Dinamis. 1. Router Statis ... Tabel Routing Static dibuat

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

Cisco Router 871Wcaclab.csd.auth.gr/yliko_eksoikeiwshs.pdf · 9 Cisco Router 871W ... Router# disable ... Router(config)# router eigrp 1 Enters router configuration mode from global

CNC Router Recommended Bits - Bilby 3D · Title: CNC Router Recommended Bits Author: BilbyCNC Subject: Router Bits Keywords: Carbon fibre, Router bits, Fibreglass, CNC router, Engraving

Telebriefing 150414 Router 6000 Virtual Router

Mesa Router Aw03mar 09953 Router Table

ROUTER TABLE WITH ROUTER

Networking Overview: “Everything” you need to know, in 50 ...owen/courses/cmps122/... · Host A Host B Host E Host D Host C Router 1 Router 2 Router 3 Router 4 Router 5 Router

Modul 1 WAN dan Router · 1.2 Pengenalan router WAN Router adalah sebuah komputer khusus, router mempunyai komponen-komponen dasar yang sama dengan PC desktop, Router mempunyai CPU,

Prediction of New Observations - ETH Z...Content • Introduction • Prediction of Mixed Effects • Prediction of Future Observation • Principles of Prediction – Prediction Process

System Level Interconnect Prediction (SLIP) 20081 Sidewinder: A Scalable ILP-Based Router Jin Hu, Jarrod Roy, and Igor Markov Dept. of Computer Science.

How to Use Cisco Router | Cisco Router