Interconnection Network - AndroBench

36
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) Interconnection Network Jinkyu Jeong ([email protected]) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

Transcript of Interconnection Network - AndroBench

Page 1: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

Interconnection Network

Jinkyu Jeong ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhttp://csl.skku.edu

Page 2: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 2

Topics• Taxonomy

• Metric

• Topologies

• Characteristics– Cost– Performance

Page 3: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 3

Interconnection Networks • Carry data between processors and to memory. • Components– Switches– Interfaces– Links (wires, fiber)

• Classifications– Static networks

• Point-to-point communication links among processing nodes• A.k.a. direct networks

– Dynamic networks• Using switches and communication links• A.k.a. indirect networks

Page 4: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 4

Static and Dynamic Networks

static(direct) network dynamic(indirect) network

Page 5: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 5

Dynamic Network Switch• Map a fixed number of inputs to outputs

• Number of ports– Degree of the switch.

• Switch cost– Grows as the square of switch degree– Packaging costs linearly as the number of pins

Page 6: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 6

Network Interfaces • Links processors (or node) to the interconnect• Functions

– Packetizing communication data– Computing routing information– Buffering incoming/outgoing data

• Network interface connection– I/O buss: Peripheral Component Interface express (PCIe)– Memory bus: e.g., AMD HyperTransport, Intel QuickPath

• Network performance– Depends on relative speeds of I/O and memory busses

Page 7: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 7

Example: Intel Quickpath Interconnect

Page 8: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 8

Network Topologies • A variety of network topologies exist

• Topologies tradeoff performance for cost

• Commercial machines often implement hybrids of multiple topologies

– Due to packaging, cost, and available components

Page 9: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 9

Metrics for Interconnection Networks

• Degree– # of links per node

• Diameter– Longest distance between two nodes in the network– Worst case communication latency

• Bisection width– Minimum # of wire cuts to divide the network into two

equal parts

• Cost– # of links and switches

Page 10: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 10

Network Topologies: Buses • All processors access a common bus for exchanging data• Used in simplest and earliest parallel machines

– Ex. Sun enterprise servers, Intel Pentium

• Advantages– Distance between any two nodes is O(1)– Provides a convenient broadcast media

• Disadvantage– Bus bandwidth is a performance bottleneck

P P P…

Bus

Page 11: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

Network Topologies: Buses

P-Pro bus (64-bit data, 36-bit address, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PCI b

us

PCI b

usPCII/O

cards

Bus-based interconnects without local caches

Bus-based interconnects with local caches

Interconnects in Intel Pentium Pro Quad

Page 12: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 12

Network Topologies: CrossbarsA crossbar network uses an p�m grid of switches to

connect p inputs to m outputs in a non-blocking manner

Page 13: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 13

Network Topologies: Crossbars• Cost– O(p2) for p processors (and memory banks)

• Difficult to scale for large values of p

• Ex. Sun Ultra HPC 10000 and the Fujitsu VPP500.

Page 14: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 14

Multistage Networks • Busses– Excellent cost scalability– Poor performance scalability

• Crossbars– Excellent performance scalability– Poor cost scalability

• Multistage interconnects– Compromise between these extremes

Page 15: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

Multistage Networks

The schematic of a typical multistage interconnection network.

Page 16: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 16

Multistage Omega Network• Organization

– log p stages– p Inputs and outputs

• At each stage, input i is connected to output j if:

Page 17: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 17

Multistage Omega Network

Each stage of the Omega network implements a perfect shuffle as follows:

A perfect shuffle interconnection for eight inputs and outputs

Page 18: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 18

Multistage Omega Network

• The perfect shuffle patterns are connected using 2�2 switches.

• The switches operate in two modes– Pass-through– Cross-over

Pass-through Cross-over

Page 19: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 19

Multistage Omega Network

A complete omega network connecting eight inputs and eight outputs.

Cost: p/2 � log p switches à O(p log p)

Page 20: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 20

Multistage Omega Network – Routing

• s is source processor in binary representation• d is destination processor in binary

representation• In each stage– if the most significant bits in s and d are the same

• Pass-through– Otherwise

• Cross-over– Strip the most significant bits

• Repeat for each of the log p switching stages

Page 21: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 21

Multistage Omega Network – Routing

• Example: 001 à 1001. Stage 1: 0 != 1 à cross-over2. Stage 2: 0 == 0 à pass-through3. Stage 3: 1 != 0 à cross-over

cross-over

cross-over

pass-through

Page 22: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 22

Blocking in Omega Network

One of the messages (010 to 111 or 110 to 100) is blocked at link AB

Page 23: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 23

Completely Connected Network• Each processor is connected to every other processor• Costs

– # of links is O(p2)

• Performance scales very well• Hardware complexity is not realizable for large values of p• Static counterparts of crossbars.

Page 24: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 24

Star Connected Network• Every node is connected only to a common node

at the center• Distance between any pair of nodes– O(1)– But, the central node becomes a bottleneck

• Static counterparts of buses

Page 25: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 25

Linear Array & Ring• Linear array– Each node has two neighbors, one to its left and

one to its right

• Ring (or 1-D torus)– If the nodes at either end are connected (having a

wrap-around link)

Page 26: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 26

Meshes and k-dimensional Meshes• Mesh– Generalization of linear array to 2D– 4 neighbors (north, south, east, and west)

• k-dimensional mesh– 2k neighbors

2D mesh 2D torus 3D mesh

Page 27: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 27

Hypercubes

0D 1D 2D 3D

4D

Page 28: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 28

Hypercubes• Distance between any two nodes is at most log p

• Each node has log p neighbors

• Distance between two nodes– # of bit positions at which the two nodes differ

Page 29: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 29

Tree-Based Networks

Static tree network Dynamic tree network

Page 30: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 30

Tree-Based Networks• Distance between two nodes is

at most 2 log p• Easy to layout as planar graphs– E.g. H-Trees

• Root can become bottleneck– Links closer to root carry more traffic than those at

lower levels

• Solution: fat tree– Fattens the links as we go up the tree

H-Tree

Page 31: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 31

Fat Tree

A fat tree network of 16 processing nodes.

Page 32: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 32

Evaluating Interconnection Networks• Diameter

– Longest distance between two nodes– Measuring the longest latency of possible communications

• Bisection Width– Minimum # of wire cuts to divide the network into two equal parts– Measuring # of concurrent communications

• Cost– # of links or switches– Ability to layout the network– Length of wires

two concurrent communications vs. 4 concurrent communications

Page 33: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

Static Interconnection Networks

Network Diameter Bisection Width Cost (# of links)

Completely-connected

Star

Complete binary tree

Linear array

2-D mesh, no wraparound

2-D wraparound mesh

Hypercube

Wraparound k-ary d-cube

Page 34: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

Dynamic Interconnection Networks

Network Diameter Bisection Width Cost (# of switches)

Crossbar

Omega Network

Dynamic Tree

Page 35: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 35

Summary• Interconnection network

– Performance (latency, bandwidth), Cost (#links, #switches)– Used to be important, becomes less important– Likely to be important for multi-core processors

• Topologies – Low dimension networks

• Bus, ring, mesh, torus – embedding into 2D/3D• Direct network (nodes are connected directly)

– Logarithmic networks (multi-stage networks)• More switches between nodes (nodes are connected indirectly)

– High dimension networks• Hypercube (binary n-cube) – theoretically good characteristics• But degree of node increases exponentially – impractical in real world

Page 36: Interconnection Network - AndroBench

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 36

References• Chapter 2.4.2-2.4.4 in “Introduction to Parallel

Computing” by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Addison Wesley, 2003

• COMP422: Parallel Computing by Prof. John Mellor-Crummey at Rice Univ.