High-Performance Networks for Dataflow Architectures

31
High-Performance High-Performance Networks for Networks for Dataflow Dataflow Architectures Architectures Pravin Bhat Pravin Bhat Andrew Putnam Andrew Putnam

description

High-Performance Networks for Dataflow Architectures. Pravin Bhat Andrew Putnam. Overview. Motivation & Design Constraints Network design Performance Adaptive Routing Conclusion. Overview. Motivation & Design Constraints Network design Performance Adaptive Routing Conclusion. - PowerPoint PPT Presentation

Transcript of High-Performance Networks for Dataflow Architectures

Page 1: High-Performance Networks for  Dataflow Architectures

High-Performance High-Performance Networks for Networks for

Dataflow Dataflow ArchitecturesArchitectures

Pravin BhatPravin Bhat

Andrew PutnamAndrew Putnam

Page 2: High-Performance Networks for  Dataflow Architectures

OverviewOverview

Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 3: High-Performance Networks for  Dataflow Architectures

OverviewOverview

Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 4: High-Performance Networks for  Dataflow Architectures

MotivationMotivation

Signal delay on wires is more important Signal delay on wires is more important than transistor switching speedthan transistor switching speed

Seriously decreased reliability in future Seriously decreased reliability in future processesprocesses Factory testing will not be possibleFactory testing will not be possible Expect 20% of transistors to be DOAExpect 20% of transistors to be DOA Expect 10% more to die over several Expect 10% more to die over several

monthsmonths Dataflow is an answer, but the network Dataflow is an answer, but the network

is currently a bottleneckis currently a bottleneck

Page 5: High-Performance Networks for  Dataflow Architectures

Dataflow CharacteristicsDataflow Characteristics

Unpredictable trafficUnpredictable traffic Cannot pre-allocate resourcesCannot pre-allocate resources

Highly bursty trafficHighly bursty traffic Quick delivery of bursts is criticalQuick delivery of bursts is critical

Nodes are not guaranteed to Nodes are not guaranteed to consume messagesconsume messages Potential for livelock & deadlockPotential for livelock & deadlock

Page 6: High-Performance Networks for  Dataflow Architectures

OverviewOverview

Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 7: High-Performance Networks for  Dataflow Architectures

Network RequirementsNetwork Requirements

High-Performance during burstsHigh-Performance during bursts Area efficientArea efficient Guarantee message deliveryGuarantee message delivery Deadlock & Livelock freeDeadlock & Livelock free Fault TolerantFault Tolerant Regular 2-D physical structureRegular 2-D physical structure

Page 8: High-Performance Networks for  Dataflow Architectures

TopologyTopology

On-chip - must be implementable in 2-On-chip - must be implementable in 2-DD

Regular tiled structure suggests:Regular tiled structure suggests: GridGrid TorusTorus HypercubeHypercube Fat TreeFat Tree

Hypercube is difficult to route, scaleHypercube is difficult to route, scale Fat Tree has a single point of failureFat Tree has a single point of failure

Page 9: High-Performance Networks for  Dataflow Architectures

RoutingRouting

Static routing does not provide Static routing does not provide essential fault toleranceessential fault tolerance

Use a modified Virtual Channel Use a modified Virtual Channel algorithmalgorithm VC guarantees deadlock free if nodes VC guarantees deadlock free if nodes

consume messagesconsume messages Dynamically adaptive to handle Dynamically adaptive to handle

transient faults & congestiontransient faults & congestion Initial studies used static routingInitial studies used static routing

Page 10: High-Performance Networks for  Dataflow Architectures

Flow ControlFlow Control

Resource reservation not possibleResource reservation not possible Long-latency wires prohibit Long-latency wires prohibit

handshakeshandshakes Send messages assuming acceptSend messages assuming accept Buffer just enough to allow receiver Buffer just enough to allow receiver

to send reject signal on subsequent to send reject signal on subsequent clock cycleclock cycle

Page 11: High-Performance Networks for  Dataflow Architectures

Deadlock-Free OperationDeadlock-Free Operation

Nodes cannot always consume Nodes cannot always consume messagesmessages

Add a dedicated channel to and from Add a dedicated channel to and from memorymemory Adds 8% area overheadAdds 8% area overhead

Rotate stalled operands out of PEs to Rotate stalled operands out of PEs to ensure forward progressensure forward progress

Send first operand back at a faster Send first operand back at a faster rate to avoid livelockrate to avoid livelock

Page 12: High-Performance Networks for  Dataflow Architectures

OverviewOverview

Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 13: High-Performance Networks for  Dataflow Architectures

PerformancePerformance

Ran network-centric simulationsRan network-centric simulations 20 billion instructions20 billion instructions Spec2000, Splash2, and Dataflow Spec2000, Splash2, and Dataflow

benchmarksbenchmarks Goal is to find optimum balance of:Goal is to find optimum balance of:

Number of Virtual ChannelsNumber of Virtual Channels Queue LengthQueue Length Link BandwidthLink Bandwidth Packets per messagePackets per message

Page 14: High-Performance Networks for  Dataflow Architectures

Virtual Channels

0

0.5

1

1.5

2

2.5

0 4 8 12 16

Virtual Channels

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 15: High-Performance Networks for  Dataflow Architectures

Queue Length

0.8

1.2

1.6

2

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Queue Length

Performance (Runtime)

ocean (G)

lu (G)

fir (G)

art (G)

mcf (G)

ocean (T)

lu (T)

fir (T)

art (T)

mcf (T)

Page 16: High-Performance Networks for  Dataflow Architectures

Link Bandwidth

0.8

1

1.2

1.4

1.6

1.8

2

0 4 8 12 16

Bandwidth

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 17: High-Performance Networks for  Dataflow Architectures

Link Width

0

0.2

0.4

0.6

0.8

1

1.2

0 8 16 24 32 40 48 56 64

Packets per Message

Performance (Runtime)

ocean (G)

lu (G)

fir (G)

art (G)

mcf (G)

ocean (T)

lu (T)

fir (T)

art (T)

mcf (T)

Page 18: High-Performance Networks for  Dataflow Architectures

ASIC ModelASIC Model

Performance must be balanced with areaPerformance must be balanced with area Developed RTL model of WaveScalar Developed RTL model of WaveScalar

network architecturenetwork architecture 90 nm process ASIC standard cell library90 nm process ASIC standard cell library Timing per link:Timing per link:

Grid links: 2.76 nsGrid links: 2.76 ns Torus links: 6.16 nsTorus links: 6.16 ns

Network switch is 11.6% of chip areaNetwork switch is 11.6% of chip area

Page 19: High-Performance Networks for  Dataflow Architectures

Virtual Channels

0

0.5

1

1.5

2

2.5

3

3.5

0 2 4 6 8 10 12 14 16 18

Virtual Channels

Performance / Area

ocean (G)

lu (G)

fir (G)

art (G)

mcf (G)

ocean (T)

lu (T)

fir (T)

art (T)

mcf (T)

Page 20: High-Performance Networks for  Dataflow Architectures

Link Bandwidth

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2 4 6 8 10 12 14 16

Number of Links

Performance / Area

ocean (G)

lu (G)

fir (G)

art (G)

mcf (G)

ocean (T)

lu (T)

fir (T)

art (T)

mcf (T)

Page 21: High-Performance Networks for  Dataflow Architectures

Queue Length

0

0.5

1

1.5

2

2.5

3

0 8 16 24 32 40 48 56 64

Queue Length

Performance / Area

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 22: High-Performance Networks for  Dataflow Architectures

OverviewOverview

Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion

Page 23: High-Performance Networks for  Dataflow Architectures

Virtual Channels Flow Virtual Channels Flow ControlControl

In hardware only In hardware only Head-of-Queue can be Head-of-Queue can be dequeued in one clock dequeued in one clock cyclecycle

If the first message in If the first message in a queue is blocked a queue is blocked then every message then every message behind it is blockedbehind it is blocked

The network The network utilization suffers due utilization suffers due to idle linksto idle links

Page 24: High-Performance Networks for  Dataflow Architectures

Virtual Channels Flow Virtual Channels Flow Channel Channel

Virtual Channels – Virtual Channels – several small several small queues instead of queues instead of one long queueone long queue

Decouples buffer Decouples buffer resources from link resources from link resourcesresources

Increase network Increase network throughput by throughput by increasing link increasing link usageusage

Page 25: High-Performance Networks for  Dataflow Architectures

Dimension Order Dimension Order RoutingRouting

Old WaveScalar Routing ProtocolOld WaveScalar Routing Protocol Network topology is a static gridNetwork topology is a static grid Packets first travel to the correct Packets first travel to the correct

x-coordinate and then to the x-coordinate and then to the correct y-coordinatecorrect y-coordinate

Low network utilization from not Low network utilization from not using all available pathsusing all available paths

Not fault tolerantNot fault tolerant

Page 26: High-Performance Networks for  Dataflow Architectures

Adaptive RoutingAdaptive Routing

Progressively chooses Progressively chooses longer routes instead of longer routes instead of waiting for an unavailable waiting for an unavailable resourceresource

High Network UtilizationHigh Network Utilization Fault tolerantFault tolerant Can cause deadlockCan cause deadlock

Page 27: High-Performance Networks for  Dataflow Architectures

Deadlock Free Adaptive Deadlock Free Adaptive RoutingRouting

Some Virtual Channels are reserved for Some Virtual Channels are reserved for Dimension Order Routing, rest used for Dimension Order Routing, rest used for Adaptive routingAdaptive routing

Every time a packet is routed in the wrong Every time a packet is routed in the wrong direction the Dimension Reversal count direction the Dimension Reversal count incrementedincremented

No packet is allowed to wait in a virtual No packet is allowed to wait in a virtual channel with a packet that has a lower channel with a packet that has a lower Dimension reversal countDimension reversal count

Mathematically proven to be deadlock free.Mathematically proven to be deadlock free.

Page 28: High-Performance Networks for  Dataflow Architectures

Virtual Channels

0

0.5

1

1.5

2

2.5

3

3.5

0 4 8 12 16

Virtual Channels

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 29: High-Performance Networks for  Dataflow Architectures

Queue Length (Adaptive Speedup)

0.8

1.2

1.6

2

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Queue Length

Performance (Runtime)

ocean (G)

lu (G)

fir (G)

art (G)

mcf (G)

ocean (T)

lu (T)

fir (T)

art (T)

mcf (T)

Page 30: High-Performance Networks for  Dataflow Architectures

Link Bandwidth (Adaptive Speedup)

0.8

1

1.2

1.4

1.6

1.8

2

0 4 8 12 16

Bandwidth

Performance (Runtime)

ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)

Page 31: High-Performance Networks for  Dataflow Architectures

ConclusionConclusion

Best performance per area with:Best performance per area with: 2 Virtual Channels2 Virtual Channels 2 Links2 Links 2-4 entries per queue2-4 entries per queue Torus TopologyTorus Topology Adaptive RoutingAdaptive Routing

Dataflow chip networks can be high-Dataflow chip networks can be high-performance at reasonable areaperformance at reasonable area