Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

32
Recursive Partitioning Multicast: A Bandwidth- Efficient Routing for Networks-On-Chip Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim Department of Computer Science and Engineerin g Texas A&M University

description

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip. Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim Department of Computer Science and Engineering Texas A&M University. MIT Raw (0.18um, 300MHz) 16-core chip Four 4x4 mesh networks. Intel Polaris - PowerPoint PPT Presentation

Transcript of Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Page 1: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for

Networks-On-Chip

Lei Wang, Yuho Jin, Hyungjun Kim and

Eun Jung KimDepartment of Computer Science and Engineering

Texas A&M University

Page 2: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 2

Multi-Core Wave & Networks-On-Chip

Uniprocessors hit the power wall. Multi-processors provide high performance at lower power budget.

Shared-bus architecture has scalability limitation. Networks-On-Chip (NOCs) orchestrate chip-wide communications towards

future many-core processors.

MIT Raw (0.18um, 300MHz)16-core chipFour 4x4 mesh networks

Intel Polaris (65nm, 4GHz)80-core chip8x10 mesh network

Page 3: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 3

Challenges in On-Chip Communication

High performance Low communication latency is critical for high system performance.

Bandwidth-efficient Well-designed routing algorithms provide high network throughput.

Power and Area Constraints Simple topologies and slim routers reduce communication power c

onsumption and save chip area. Efficient Multicast supporting

Cache coherence protocols heavily rely on multicast or broadcast communication characteristics.

We propose a bandwidth-efficient routing for multicast communication in NOCs with low latency and power consumption.

Page 4: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 4

Prior Work in Multicast Communication

Routing Evaluation Criteria for Multicast Communication [Ni93] Multicast in multicomputer system

Tree-based Multicast Routing for DSM Multiprocessor [Torrellas96] Short message multicast in DSM system

Virtual Circuit Tree Multicasting for NOCs[Lipasti08] Demonstrate necessity of multicasting on-chip Propose table-based multicast routing

Region-based Multicast for CMPs [Duato08] Multicast routing for irregular topology in CMPs

Page 5: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 5

Outline

Motivation Multicast Router Design

State-of-art Unicast Router Architecture Replication Schemes Destination List Management

Recursive Partitioning Multicast (RPM) Network Partitioning Routing Rules Example Deadlock Avoidance

Evaluation Conclusion

Page 6: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 6

Different Bandwidth Usage Example

Left Path requires 11 link traversals, 12 buffer writes, 15 buffer reads, and 15 crossbar traversals

Right Path requires 5 link traversals, 6 buffer writes, 10 buffer reads, and 10 cross-bar traversals

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Source

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Destination

Page 7: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 7

State-of-Art Wormhole Unicast Router

Output 4

RouteComputation

VCAllocatorSwitch

Allocator

VC 1

VC 2

VC n

Input buffers

VC 1

VC 2

VC n

Input buffers

Input 0

Input 4

Output 0

.

.

.

.

.

.

Crossbar switch

RC VA SA ST LT

RCVASA

ST LT

Router Link

LinkRouter

RC: Route Computation VA: VC Allocation; SA: Switch Allocation

ST: Switch Traversal; LT: Link Traversal

Page 8: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 8

What we need in a Multicast Router?

Packet Replication Synchronous Replication Asynchronous Replication

Destination List Management All-destination Encoding Bit String Encoding Multiple-region Broadcast Encoding

Page 9: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 9

Synchronous Replication

Packet replication happens at Switch Traversal Stage.

Input 0

Input 3

Output 0

Output 1

Output 2

Output 3

Input 1

Input 2

T M M H

3210

Time (Cycle)

HM

H

M

Head flit

Middle flit

T Tail flit

Page 10: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 10

Asynchronous Replication

Input 0

Input 3

Output 0

Output 1

Output 2

Output 3

Input 1

Input 2

T M M H

3210

Time (Cycle)

HMM

H

M

Head flit

Middle flit

T Tail flit

Page 11: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 11

Network Partitioning

Three Parts (5, 6, 7)

Three Parts (0, 1, 7)

Three Parts (3, 4, 5) Three Parts (1, 2, 3)

Source node

Eight Parts

N

S

EW

01

2

3

4

5

7

8

Page 12: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 12

Basic Routing Rules

NE

SW

NE

SW

Source

Destination

N

S

EW

North: top right corner. West: top left corner. South: bottom left corner. East: bottom right corner.

Page 13: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 13

Optimized Routing Rules

Source

Destination

Deadlock!!!

Page 14: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 14

RPM Example-step 1

MM

MSource DestinationMulticast Packet Partitioning

Page 15: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 15

RPM Example-step 2

M

MM

Ejection

MSource DestinationMulticast Packet Partitioning

Page 16: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 16

RPM Example-step 3

M

MM

MSource DestinationMulticast Packet Partitioning

Page 17: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 17

RPM Example-step 4

M

M MM

Ejection Ejection

Ejection

MSource DestinationMulticast Packet Partitioning

Page 18: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 18

RPM Example-step 5

M

Ejection

M

MSource DestinationMulticast Packet Partitioning

Page 19: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 19

Deadlock Avoidance RPM has no turn restrictions, potentially introducing deadlock. We use Virtual Network (VN) to avoid deadlock.

Two VNs lie in the same physical network. Virtual Channels of each port are equally divided into each virtual network

. Virtual network Id (0 or 1) for each packet is decided at the source.

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Virtual Network 0

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Virtual Network 1

Page 20: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 20

Evaluation Methodology Performance Model: Cycle-accurate Network Simulator

Models all router pipeline stages in detail Highly parameterized

Power Model: Orion with both dynamic and leakage power models

Topology 8×8 Mesh (6×6 Mesh, 10×10 Mesh, 16×16 Mesh)

Routing RPM

VC/Port 4

VC Depth 4

Packet Length (flits) 4

Unicast Traffic Pattern Uniform Random (Bit Complement, Transpose)

Multicast Packet Portion 10% (5%, 20%, 40%, 80%)

Multicast Destination Number

0 -16 (uniformly distributed)

Network configuration

Page 21: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 21

Uniform Random Traffic

Latency is improved around 50% before network saturation. Network throughput is extended 40%.

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.15

Injection rate (flits/cycle/core)

La

ten

cy (

cycl

e)

RPM Mul unicast VCTM(20%) VCTM(40%) VCTM(80%)

50%

40%

40%

Page 22: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 22

Link Utilization

00.05

0.10.15

0.20.25

0.30.35

0.40.45

0.01

0.03

0.05

0.07

0.09

0.15

0.25

0.35

0.45

Injection Rate (flits/cycle/core)

Lin

k U

tiliz

atio

n (

op

/cyc

le)

RPM VCTM(20%) VCTM(40%) VCTM(80%)

33%

45%

In low workload, RPM saves 33% link utilization. In high workload, RPM saves 45% link utlization.

Page 23: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 23

Dynamic Power Consumption

02

46

810

12

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

RP

MV

CT

MR

PM

VC

TM

0.010.020.030.040.050.060.070.080.09 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Injection Rate(flits/cycle/core)

Dyn

am

ic P

ow

er(

W)

Buffer VC Arbiter SW Arbiter Xbar Link

40%50%

Page 24: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 24

Scalability Study-Network Size

0

20

40

60

80

100

120

140

6×6 8×8 10×10 16×16

Network Size

La

ten

cy (

cycl

e)

RPM VCTM

Over 50%

Page 25: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 25

Scalability Study-Multicast Traffic Portion

0

20

40

60

80

100

120

140

5% 10% 20% 40% 80% 100%

Portion of multicast traffic

Late

ncy

(cyc

le)

RPM VCTM

Page 26: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 26

Scalability Study-Destination Number

0

20

40

60

80

100

120

140

4 8 16 32

Max. number of destinations

Late

ncy

(cyc

le)

RPM VCTM

Page 27: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 27

Conclusion

Propose a new multicast routing algorithm, Recursive Partitioning Multicast (RPM) Bandwidth-efficient and Scalable

Performance Improvement Up to 50% latency reduction 33% link utilization reduction

Power Savings Up to 40% total dynamic power savings 25% crossbar and link power savings

Page 28: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 28

Thank you!

Page 29: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 29

Backup

Page 30: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 30

Hardware Implementation of Routing logic

Page 31: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 31

Bit Complement Traffic

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.15

Injection Rate (flits/cycle/core)

Late

ncy

(cyc

le)

RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)

Page 32: Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Lei Wang - NOCS 2009 32

Transpose Traffic

0

20

40

60

80

100

120

0.01 0.03 0.05 0.07 0.09 0.15

Injection Rate (flits/cycle/core)

Late

ncy

(cyc

le)

RPM Mul unicast VCTM (20%) VCTM (40%) VCTM (80%)