Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range...

25
A Finmeccanica Company April 2003 © Quadrics Ltd. Evolution of Interconnects for Supercomputing Moray McLaren Quadrics Ltd.

Transcript of Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range...

Page 1: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Evolution of Interconnects for Supercomputing

Moray McLarenQuadrics Ltd.

Page 2: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

EtherNet and EtherNot!Will standard interconnects solve all our problems?

Whatever the volume interconnect of the future is, it will be called Ethernet.Incorporate ideas from specialised low latency interconnects into Ethernet?• RDMA is a start• Common DDI with high performance NICs?• Price advantage not so clear for equivalent BW.

Successful EtherNot technologies need clear performance advantages that deliver in applications.

Page 3: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

What’s special about Supercomputing?

You push the extremes of scale• Seamless switch scaling• Global operations• Fault tolerance

You value your compute cycles• Compute communications ratio• Ultra low latency• Overhead

Page 4: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Historical scaling…

Elan – 1990 Elan 3 – 1998 Elan 4 – 2003

Put - 9μs Put - 2μs Put - 2μsMPI - 78μs MPI - 5μs MPI - 3μs44Mbytes/s 320Mbytes/s 900Mbytes/s

Page 5: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Elan-3 MPI Latency Breakdown

0

1

2

3

4

5

6

dping tping mpi

MPIThreadMainElanPCISwitchCable

Page 6: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

MPI short message latency

05

10152025303540

0 4 8 16 32 64 128

256

512

1024

2048

4096

8192

Bytes

uS

eco

nd

s

QsNet II (elan4) A leading EtherNot technology

Page 7: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

MPI Bandwidth – Elan 4

0100200300400500600700800900

1000

64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

2621

4452

4288

1048

576

Bytes

Mby

tes/

s

QsNet II

Page 8: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Bye bye MPI?

Problems with MPI• High overhead for very short messages• Tag matching overhead• MPI ordering rules imply single point of ordering for each

node.

remote read, remote write API• draw on libElan, ShMem• Support for many outstanding transactions• Target for compiler writers, and library developers

Page 9: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Worst case latency 8-byte put(estimates)

0

1000

2000

3000

elan3 elan4(est)

elan4 direct

late

ncy

(nse

c) eliteelanlocal iocable

Low level hardware latency

Page 10: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Getting closer to the CPU…

Where?• Hypertransport, proprietary IO port

What’s the win?• Avoid bus bridge latency• Lower cache refill overhead – maybe?• Simpler interface• Smaller transfers for peak performance.

Issues – primarily commercial• Where’s the connector?• Fragmentation of silicon volumes

Page 11: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

The way forward on latency

Basic hardware latency• Many factors reaching practical limits.• Closer integration to CPU removes some delays• Pipelining to support multiple outstanding short messages

Real application latency• MPI well understood• Lower level API need for compilers etc.• What’s the API for kernel messaging? • Reliably, ordered, datagram.• Several alternates, Portals, Via constructs..

Page 12: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Bandwidth going forward

Limited by where you can connect to • Double and Quad clock PCI-X• PCI-Express• Direct connections.

Large scale multi rail systems with large SMPs• NUMA challenges• Separate rails or one big switch?

Page 13: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

QsNet (Elan 3) MultirailPerformance

Page 14: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Switch scaling

Elite – 1990 Elite 3 – 1998 Elite 4 – 2003

70Mhz 400Mhz 1.3GHz44Mbytes/s 325Mbytes/s 900Mbytes/s256 nodes ~2k nodes ~4K

Page 15: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Topology

Fat trees have been very successful• Good structure for fault

tolerance• Fairly uniform connectivity.• Good for global operations• Quite challenging to for

systems integrations

Packaging issues will dictate topology

Page 16: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Global operations

Why do they matter?• Improve application scaling on very large systems• Highly scalable single system image functions, (e.g.

cluster membership)

What works now• Range selected broadcast, barrier in the network• Integer collectives handled by IO processor in NIC

Future• Floating point collectives (probably more appropriate in the

NIC than in the switch)• Alternative broadcast constructs.

Page 17: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Barrier Scaling (QsNet)

0.0

2.0

4.0

6.0

8.0

0 128 256 384 512 640 768 896 1024

Number of Nodes

Barrier time

(microsecs)

Data Courtesy of Lawrence Livermore National Lab

Page 18: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Fault tolerance and availability

Typical current features for RAS• Use a topology with

lots of alternate routes• Dual redundant PSU• Tolerant to single fan

failure• Dual redundant control

cards

Page 19: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Development for fault tolerance & reliability

Pairwise broadcast for hardware mirroring.Rail failover in multirail systemsBest way to improve interconnect reliability is minimise connectors

Page 20: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Physical packaging considerations

Page 21: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

QsNetII Physical Link

1.333Ghz design speed• 4b5b coding for DC balance• ~900 Mbytes/s after protocol

Copper• 10 bit lvds – total 40 wires• 10-12m range

Optics• 12 bit parallel optical fiber• 100m

Page 22: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Future link technologies

Still copper on the backplane for cost and reliability• Careful design gets to up to 5Gbit/s per wire for moderate

runs. More with clever equalisation.• Max length decreases as speed increase• Improved packing technology to reduce connection

lengths, pack more into the copper zone.

Rack to rack all fibre• Future generations of parallel fibre• 12 x 5 Gbits/s ~= 6Gbytes/s

Page 23: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Optical switching ?

Optical technology has been driven by telecomms requirements• Long haul not short haul• Circuit switched not packet switched• They’re not buying anything!

Combining logic, switching and buffering – easy for silicon• Silicon switches – a distributed arbiter which delivers data

as a side effect.

Page 24: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor

A Finmeccanica CompanyApril 2003 © Quadrics Ltd.

Optical switching?

Optical crossbars or WDM basedAdvantages• Large crossbars possible• Very low latency for established connections.• Integration with optical fibre

But…• New component technologies• Optical switching generally have a separate control plane• Difficult to build self routing packet switches• Implies switch architecture with centralised control

What sort of machines can we make with these?

Page 25: Evolution of Interconnects for Supercomputing · cluster membership) What works now • Range selected broadcast, barrier in the network • Integer collectives handled by IO processor