7 • On-Chip Interconnection Networksrdm34/acs-slides/lec7.pdf · Introduction • On-chip...

7 • On-Chip Interconnection Networks

Chip Multiprocessors (ACS MPhil)

Robert Mullins

Introduction

• Vast transistor budgets, but....• Poor interconnect scaling

– Pressure to decentralise designs• Need to manage complexity and power• Need for flexible/fault tolerant designs• Parallel architectures

– Keep core complexity constant or simplify– The result is need to interconnect lots of

cores, memories and other IP cores.

Introduction

• On-chip communication requirements:– High-performance

• Latency and Bandwidth– Flexibility

• Move away from fixed application-specific wiring

– Scalability• Number of modules is rapidly increasing

Introduction

• On-chip communication requirements:– Simplicity (ease of design and verification)

• Structured, modular and regular• Optimize channel and router once

– Efficiency• Ability to share global wiring resources

between different flows– Fault tolerance (in the long term)

• The existence of multiple communication paths between module pairs

– Support for different traffic types and QoS

Introduction

• The design of the on-chip network is not an isolated design decision (or afterthought)– e.g. consider impact on cache coherency protocol– What is the correct balance of resources (wires

and transistors, silicon area, power etc.) between the on-chip network and computational resources?

– Where does the on-chip network stop and the design of a module or core start?

• “integrated microarchitectural networks” – Does network simply blindly allow modules to

communicate or does it have additional functionality?

Chip Multiprocessors (ACS MPhil) 6

Introduction

• Don't we already know how to design interconnection networks?– Many existing network topologies, router designs

and theory has already been developed for high-end supercomputers and telecom switches

– Yes, and we'll cover some of this material, but the trade-offs on-chip lead to very different designs.


On-chip vs. Off-chip

• Compare availability of pins and wiring tracks on-chip to cost of pins/connectors and cables off-chip

• Compare communication latencies on- and off-chip – What is the impact on router and network design?

• Applications and workloads• Amount of memory available on-chip

– What is the impact on router design/flow control?• Power budgets on- and off-chip• Need to map network to planar chip (or perhaps more

recently a 3D stack of dies)


On-chip interconnect

• Typical interconnect at 45nm node:– 10-14 metal layers– Local interconnect (M1)

• 65nm metal width, 65nm spacing

• 7700 metal tracks/mm– Global (e.g. M10)

• 400nm metal width, 400nm spacing

• 1250 metal tracks/mm• Remember global interconnects

scale poorly when compared to transistors 9Cu+1Al process (Fujitsu 2007)


Bus-based interconnects

• Bus-based interconnects– Central arbiter

provides accessto bus

– Logically the bus is simply viewed as a set of wires shared by all processors


Bus-based interconnects

• Real bus implementations are typically switch based– Multiplexers and unidirectional interconnects with

repeaters – Tri-states are rarely used now– Interconnect itself may be pipelined

• A bus-based CMP usually exploits multiple unidirectional buses– e.g. address bus, response bus and data bus


Bus-based interconnects for multicore?

• Metal/wiring is cheap on-chip!

• Avoid complexity of packet-switched networks

• Keep cache-coherency simple

• Performance issues– Centralised arbitration– Low clock frequency

(pipeline?)– Power? – Scalability?

OO RR RR

RR RR

RR RR

RR

RR

Shekhar Borkar (OCIN'06)

Repeated Bus Global Interconnect



• Optimising bus-based solutions:– Arbitrate for next cycle on current clock cycle– Use wide, low-swing interconnects– Limit broadcast to subset of processors?

• Segment bus and filter redundant broadcasts to some segments by maintaining some knowledge of cache contents. So called, “Filtered Segmented Buses”

– Employ multiple buses– Move from electrical to on-chip optical solutions?


Filtered Segmented Bus

“Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks”, Udipi et al, HPCA 2010

• Filter broadcasts to segments with Bloom filter

• Energy savings possible vs. mesh and flattened butterfly networks (for 16, 32 and 64-cores) because routers can be removed

• For large numbers of cores multiple (address-interleaved) buses are required to avoid a significant performance penalty due to contention



• Exploiting multiple buses (or rings):– Multiple address-interleaved buses

• e.g. Sun Wildfire/Starfire

– Use different buses for different message types– Subspace snooping [Huh/Burger06]• Associate (dynamic) address ranges with each bus.

Each subspace are regions of data that are shared by a stable subset of the processors.

• This technique tackles snoop bandwidth limitations as all processors are not required to snoop all buses

– Exploit buses at the lowest level of a hierarchical network (e.g. mesh interconnecting tiles, where each tile is a group of cores connected by a bus)


Sun Starfire (UE10000)

Uses 4 interleaved address buses to scale snooping protocol

16x16 Data Crossbar

Memory Module

Board Interconnect

P

$

P

$

P

$

P

$

Memory Module

Board Interconnect

P

$

P

$

P

$

P

$

4 processors + memory module per

system board

• Up to 64-way SMP using bus-based snooping protocol

Separate data transfer over

high bandwidth crossbar

Slide from Krste Asanovic (Berkeley)


Ring Networks

• Exploit short point-to-point interconnects• Can support many concurrent data transfers• Can keep coherence protocol simple and avoid need

for directory-based schemes– We may still broadcast transactions

• Modest area requirements

k-node ring (or k-ary 1-cube)


Ring Networks

• Control– May be distributed

• Need to be a little careful to avoid possibility of deadlock(more later!)

– Or a centralised arbiter/scheduler may be used• e.g. IBM Cell BE and Larrabee both appear to use a

centralised scheduler• Try and schedule as many concurrent (non-overlapping)

transfers on each available ring as possible

• Trivial routers at each node– Simple routers are attractive as they don't

introduce significant latency, power and area overheads


Ring Networks: Examples

• IBM – Power4, Power5

• IBM/Sony/Toshiba – Cell BE (PS3, HDTV, Cell blades, ...)

• Intel– Larabee (graphics), 8-core Xeon processor

• Kendall Square Research (1990's)– Massively parallel supercomputer design– Ring of rings (hierarchical or multi-ring) topology

• Cluster = 32 nodes connected in a ring• Up to 34 clusters connected by higher level ring


Ring Networks: Example IBM Cell BE

• Cell Broadband Engine– Message-passing

style (no $ coherence) – Element Interconnect

Bus (EIB)• 2-rings are provided in

each direction• Crossbar solution was

deemed too large


Ring Networks: Example Larrabee

• Cache coherent• Bi-directional ring network, 512-bit wide links

– Short linked rings proposed for >16 processors– Routing decisions are made before injecting messages

• The clockwise ring delivers on even clock cycles and the anticlockwise ring on odd clock cycles


Crossbar Networks

• A crossbar switch is able to directly connect any input to any output without any intermediate stages– It is an example of a strictly non-blocking network

• It can connect any input to any output, incrementally, without the need the rearrange any of the circuits currently set up.

– The main limitation of a crossbar is its cost. Although very useful in small configurations, n x n crossbars can quickly become prohibitively expensive as their cost increases as n2


Crossbar Networks

• A 4x3 crossbar implemented using three 4:1 multiplexers

• Each multiplexer selects a particular input to be connected to the corresponding output

(Dally/Towles book Chapter 6)


Crossbar Networks: Example Niagara

• Crossbar switch interconnects 8 processors to banked on-chip L2 cache– A crossbar is actually

provided in each direction:• Forward and Return

• Simple cache coherence protocol– See earlier seminar

Reproduced from IEEE Micro, Mar'05


Crossbar Networks: Example Cyclops

• IBM, US Dept. of Energy/Defense, Academia• Full system 1M+ processors, 80 cores per chip• Interconnect: centralised 96x96 buffered crossbar switch

with a 7-stage pipeline


Crossbar Networks: Example Cyclops

• Motivation for use of crossbar– Simple uniform memory access model– System is sequentially consistent

• Crossbar interconnects:– 80 processors, incl:

• 2 x thread units, 2 x scratch pad memories (can also be configured as global memory), 1 x FPU

– Off-chip memory, I/O and interfaces to off-chip network

• Crossbar Area– 1.6mm x 17mm (27mm2), 6% of total die area

• Communication and arbitration is centralised– Power implications? Fixed core-to-core comms. delay


Crossbar Networks: On-chip Optical

• Fully optical on-chip buses, rings and crossbars have also recently been proposed. They may be constructed using ring-resonator based modulators and detectors.

[Firefly/Corona ISCA 08/09]


Interconnection Networks

• So far we have only discussed buses, ring networks and crossbar switches

• In general a network can be described by its:– Topology: How we arrange our shared routers

and channels • Router = (buffers) + switch + control

– Flow Control: How we allocate network resources, e.g. buffer capacity and channel bandwidth

– Routing Algorithm: How we select a path through our network from the source to destination node


Nomenclature (See Dally/Towles)

• Modules (i.e. cores, memories etc.) connect to the network at network terminals

• The network itself consists of a collection of nodes (switches/routers) which are linked by channels – Channels are characterised by their width, frequency and

latency• Direct networks associate a terminal (connection to the

outside world) with each node. In a direct network a node is both a terminal and a switch

• In indirect networks nodes are either a terminal or a switch (not both)


Nomenclature

• Paths– A route or path between two nodes is given by an

ordered set of channels, P. The length or hop count of a path is |P|.

– A minimal path between two nodes is the path with the smallest hop count

– The diameter (Hmax

) of the network is the largest, minimal hop count over all pairs of terminals in the network

– For a NxN mesh, the diameter is 2x(N-1), • e.g. bottom-left node to top-right node • It is 1 for a crossbar switch


Nomenclature

• Higher level messages between network clients are communicated over the network as one or more packets.

• Each packet is composed from a number of flits (flow control digits). A flit is the smallest unit of information recognised by the flow control method. – e.g. an on-chip network may employ 64-bit flits and have

the ability to send a new flit over a channel (perhaps also 64-bits wide) every clock cycle.

– A network may expect fixed or variable length packets– The first flit in a packet is often called the head flit and the

last the tail flit


Nomenclature

• The average length of the minimum paths between all sources and destinations is known as the average minimum hop count, H

min .

• The latency of a network is the time required for a packet to traverse the network– More precisely, this is the time between the head of the

message (or packet) reaching the input terminal and its tail departing from the output terminal

– Head latency is the time for the head of the message to traverse the network

– The serialization latency is the time for the tail of the message to catch up (i.e. the time taken to send a complete packet across a channel)

– The latency of a single router is TR


Nomenclature

Average zero-load latency (in the absence of contention)provides us with a lower bound on average latency = (Router delay) + (Time of flight) + (serialization latency)

T0 = H

min*T

R + D

min/v + L/b

Dmin

– average distance The physical distance between network nodes is dependent on the layout of the network (in larger networks involving multiple chips this will be dependent on packaging decisions)

v - propagation velocity (how fast does a flit propagate between nodes?)

L – packet length, b – channel bandwidth


Nomenclature

• The bisection width is the minimum number of wires that must be cut when the network is divided into two equal sets of nodes

• The collective bandwidth over the bisection width is known as the bisection bandwidth


Traffic Patterns (Dally/Towles, 3.2)

• Random traffic– each source is equally likely to send to any destination

• Permutation traffic: (one fixed destination per source)– Bit permutation, e.g. 4-bit source address = {s3, s2, s1, s0}

• Bit complement (e.g. dest = {!s3, !s2, !s1, !s0})• Bit reverse (e.g. dest = {s0, s1, s2, s3})• Bit rotation • Shuffle (Sorting) • Transpose (FFT)

– Digit permutations• Tornado (adversary for torus topologies)• Neighbour (fluid dynamics)


Real Traffic Patterns

“A communication characterization of Splash-2 and Parsec”, Barrow-Williams/Fensch/Moore, IISWC'09


Performance• Imagine a mesh and uniform

random traffic

• If we divide the nodes of the mesh in two, it is clear to see that half the traffic from one half must cross the bisection

• Hence the topology limit is 2Bc/N

• e.g. for an 8x8 mesh:

32*traffic_per_node*0.5 = 0.5*Bc

Lets assume that the bisection bandwidth = 16 flits/cycle, sotraffic_per_node = 0.5 flits/cycle

Offered traffic is often normalised to the topology limit


Performance: Routing Limit

• A particular routing algorithm may be unable to balance the traffic across the channels of the bisection. In this case, the throughput limit imposed by the routing algorithm may be significantly lower than that of the topology limit.


Topology: The Basics

• Choosing the topology is the first step in designing a network – The flow control and routing algorithm are obviously

dependent on topology– Selecting a topology involves a trade-off between

complexity/cost and performance (bandwidth and latency)

• Don't forget we must map the network to a 2D VLSI implementation– Die stacking introduces need for 3D networks

• On-chip networks typically try hard to minimise communication latency and power– e.g. Exploit short channels to neighbouring cores– Direct networks


Topology: Mesh

A 4-ary 2-mesh

• Simple and regular mapping to 2D VLSI implementation

• Channels to nearest neighbours

• Simple routers– Low radix (or degree)

• e.g. 5x5 switch (or two 3x3 switches)

• 4 ports for neighbours and 1 for terminal

• Potentially high hop count


Topology: Concentrated Mesh

• Increase scalability of mesh by co-locating multiple terminals at each node– Here concentration has

been achieved using a larger crossbar at each node

– Could also just use a multiplexer or bus

• Concentration reduces the average hop count and hence the zero-load latency


Topology: Conc. Mesh + Express Chan.

• We can add express channels to restore lost bisection bandwidth

• Express channels in general are additional links to non-local nodes– Physically, they are often

longer than the minimum channel length

Concentrated Mesh with Express Channels

(Balfour/Dally, ICS'06)


Topology: Mesh + multidrop channels

• MECS topology– Grot el al, HPCA'09– Exploits multi-drop

express channels• Point-to-multipoint

unidirection links connect a source with multiple destinations in a given row or column

• Network diameter is 2• High connectivity with low

channel count


Topology: Torus

• Meshes suffer from load imbalances for many traffic patterns as demand for the central channels is higher than for the edge channels

• The torus topology has twice the bisection bandwidth of a mesh

• The long wraparound channels may be avoided by folding the torus

(Dally/Towles, p99)

A 4-ary 2-cube


Flow Control: Circuit Switching

• In a circuit switched network, network resources (channels) are reserved before a packet is sent – The entire path (circuit) must be reserved first– Channels are often shared between different

circuits using time-division multiplexing or by dividing the channel into multiple narrow links

– The circuit is torn down once the message has been sent


Flow Control: Circuit Switching

• Minimal buffering at each switch• Once circuit is setup, router latency and

control overheads are very low• Very poor use of channel bandwidth if lots of

short packets must be sent to many different destinations– More commonly seen in embedded SoC

applications where traffic patterns may be static and involve streaming large amounts of data between different IP blocks

– Can also provide QoS/Guaranteed Services (GS)


Flow Control: Buffered Flow Control

• We can aim to make better use of channel resources by buffering packets (or a fraction of a packet) at each node. We then arbitrate for access to network resources dynamically.

• We distinguish between different approaches by the granularity at which we reserve resources (e.g. channels and buffers) and conditions that must be met for a packet to advance to the next node.

• Packet-Buffer Flow Control – Store and forward– Virtual cut-through

• Flit-Buffer Flow Control – Wormhole

• Efficient use of (limited) buffer space


Flow Control: Buffered Flow Control

L – Packet Length


Virtual-Channel Flow Control

• Improve performance of wormhole routing, prevent a single packet blocking a free channel– e.g. if the green packet is blocked the red packet may still make

progress through the network – We can interleave flits from different packets over the same channel


Virtual-channel Flow Control

• Virtual-channels are also often used to avoid deadlock– Deadlock may be a possibility due to the use

of a particular topology, routing algorithm or higher-level protocol (message-dependent deadlock)

– Messages on one virtual-channel cannot block messages on another

• e.g. we might want to put “request” and “reply” messages on different virtual-channels

• Or provide a unique virtual channel for each message type in a directory-based cache coherency protocol


Flow Control: Backpressure

• How do we know how many buffers are free in the downstream router?

• Mechanisms for low-level flow-control between nodes (to provide backpressure)– on/off

• Able to send, yes/no?

– credit based• The upstream router maintains a counter for each

downstream VC• Counter is decremented when flit is sent• Downstream router sends credit to upstream router whenever

a flit leaves the VC buffer, when a credit is received the corresponding counter is incremented


Deadlock

• The deadlock problem– A group of agents (e.g. packets) is unable to

make progress as they are waiting on each to release a resource (e.g. buffer or channel)

– Disaster! Need to be careful even when design looks simple (e.g. ring network)

• Message-dependent deadlock– External dependencies in a higher level

protocol may also contribute to deadlock– Often need to make sure different message

classes (e.g. request/reply messages) travel on different virtual channels


Deadlock example

Glass/Ni


Deadlock

• Techniques for deadlock avoidance– Injection restriction– centralised/static scheduling– end-to-end flow control– Virtual channels– For routing related deadlock:

– the turn-model, escape channels• Deadlock Recovery

• Deadlock might be infrequent and costly to avoid altogether, may be easier just to detect and recover

• e.g. drain deadlocked network to another network, NACK requests, fall back on simpler protocol (e.g. Culler p.594).


Routing: The Basics

• A routing algorithm aims to maximise network throughput by balancing load across network channels– Latency is often important too, need to keep routes

short too– Keep complexity low

• Cycle time and energy implications• Can global network state information be exploited?

– Avoid deadlock• Deterministic and adaptive routing• Multicast and broadcast support?


Routing: Dimension Order Routing

• Simple deterministic minimal routing algorithm (aka e-cube routing)– Route in one dimension at a time– If there is a choice of directions in each dimension

(e.g. for a torus) we first compute the distance that would have to be traveled in each case and select the shortest path

• 2D version is also called XY routing– Route in X direction first, then Y direction

• Restricts turns to avoid deadlock– See The Turn Model (Dally/Towles, p.269)– and West-first, North-last, Negative-First routing


Routing: XY Routing


Virtual-Channel Router Design



(2006+) State-of-the-art routers use a single combined VC and switch allocator



• 4x4 mesh network• Single cycle routers (incl.

channel latency)• Clock ~35 FO4 • Virtual-channel support

– 4 virtual channels per input port

• Clocking: H-Tree or Distributed Clock Generator (DCG)

Lochside ChipMullins, West, Moore (2005)


Fair allocation of resources to flows

• A flow is a stream of packets between a particular source and destination

• Locally fair allocation (round-robin on input ports) does not balance flow throughputs globally

• Solutions: Arbitrate between flows, not virtual channels [Banerjee, NOCS'09] See also [Lee, ISCA'08]

Reproduced from [Lee ISCA'08]


Many Networks• Cost of replicating networks on-chip is much lower than off-chip. Is it

advantageous to build multiple networks?– Simple way to provide decoupled/isolated network resources

(e.g. to provide QoS guarantees or to help predict performance)• e.g. special-purpose low-latency synchronization networks

– Able to carefully tailor each network for its specific purpose• e.g. use both static and dynamic networks

– see MIT RAW/Tilera processor• Save power/energy this way? e.g. provide both low-power

and high-performance networks (steer messages to one network or the other depending on how critical they are)

– Increased serialization latency due to narrow channels• But improved energy efficiency?

– Lower overall router area, but more control overhead– Have to partition of network buffers


In-Network Optimisations

• The on-chip network can collect information (in a distributed manner) that in turn can be used to optimise system performance– A router can collect information from packets it

routes– Routing of subsequent packets may be modified

depending on contents of previous packets• This isn't just adaptive routing, the information collected

isn't about the state of the network but rather about the network clients or system as a whole

– e.g. “In-network cache coherence”[Eisley/Peh/Shang, MICRO'06]


Conclusions

• Brief introduction to on-chip network design– Read Dally/Towles and Duato/Yalamanchili/Li for more details– Conferences: NOCS/HPCA/ISCA/MICRO...– Lots more to routing and deadlock issues in particular

• See course wiki

• Very broad viable design-space, very active research area– Challenging power dissipation constraints– Interconnects scale poorly compared to transistors, the problem

won't get easier!

7 • On-Chip Interconnection Networksrdm34/acs-slides/lec7.pdf · Introduction • On-chip...

Documents

Transcript of 7 • On-Chip Interconnection Networksrdm34/acs-slides/lec7.pdf · Introduction • On-chip...