ScicomP 11 Charles Grassl IBM May, 2005

31
© 2005 IBM HPS ScicomP 11 Charles Grassl IBM May, 2005

Transcript of ScicomP 11 Charles Grassl IBM May, 2005

Page 1: ScicomP 11 Charles Grassl IBM May, 2005

© 2005 IBM

HPS

ScicomP 11Charles Grassl

IBMMay, 2005

Page 2: ScicomP 11 Charles Grassl IBM May, 2005

2 © 2005 IBM Corporation

Agenda

• Evolution• Technology• Performance characteristics

• RDMA• Collectives

Page 3: ScicomP 11 Charles Grassl IBM May, 2005

3 © 2005 IBM Corporation

Evolution

• High Performance Switch (HPS)• Also Known As “Federation”

• Follow on to SP Switch2• Also known as “Colony”

Generation ProcessorsSwitch POWER2

SP Switch POWER2 → POWER3SP Switch 2 (Colony) POWER3 → POWER4

HPS (Federation) POWER4 → POWER5

Page 4: ScicomP 11 Charles Grassl IBM May, 2005

4 © 2005 IBM Corporation

Technology

• Internal network• In lieu of, e.g. Gig Ethernet

• Multiple links per node• Match number of links to number of processors

SP Switch2 HPS

15 microsec.

500 Mbyte/s

1 adaptor perlogical node

Latency: 5 microsec.

Bandwidth 2 Gbyte/s

Configuration:1 adaptor, 2 links, per MCM; 16 Gbyte/s per 32 processor node

Page 5: ScicomP 11 Charles Grassl IBM May, 2005

5 © 2005 IBM Corporation

HPS Packaging

• 4U, 24-inch drawer • 16 ports for server-to-switch• 16 ports for switch-to-switch connections • Host attachment directly to server bus via 2-

link or 4-link Switch Network Interface (SNI) •Up to two links per pSeries 655•Up to eight links per pSeries 690

Page 6: ScicomP 11 Charles Grassl IBM May, 2005

6 © 2005 IBM Corporation

HPS Switch Configuration

Switch board

GX bus

Power4 and Power5 Servers

•Link drivers•Copper driver•Fiber optics driver

LDC

GX bus

RAM

GX busRAM

Canopus

Canopus

LDC

GX bus

RAM

GX busRAM

Canopus

Canopus

LDC

GX bus

RAM

GX busRAM

Canopus

Canopus

HPSSwitchChip HPS Adapters

LDC

GX bus

RAM

GX busRAM

Canopus

Canopus

Page 7: ScicomP 11 Charles Grassl IBM May, 2005

7 © 2005 IBM Corporation

HPS Software

• MPI-LAPI (PE V4.1)•Uses LAPI as the reliable transport•Library uses threads, not signals for async activities

• Existing applications binary compatible• New performance characteristics

•Eager•Bulk Transfer• Improved collective communication

Page 8: ScicomP 11 Charles Grassl IBM May, 2005

8 © 2005 IBM Corporation

HPS Software Architecture

DDIPLAPI

HPSSMA3+ Adapter

TCP

Sockets

MPI

HYP

VSD

GPFS

Application

UDP

PESSL

User Space Kernel Space

Load

Leve

ler

CSM

IF_LSHAL

ESSL

Page 9: ScicomP 11 Charles Grassl IBM May, 2005

9 © 2005 IBM Corporation

Supported Communication Modes

• FIFO Mode• Message chopped into 2K

packet chunks on the host and copied by CPU

• Memory bus crossing depends on caching. At least 1 IO bus crossing

• Remote Direct Memory Access (RDMA)• No slave side protocol• CPU offload • Enhanced Programming

model• 1 IO bus crossing

UserBuffer

CPU

Network FIFO

Adapter

Ld/St

Ld/St

DMA

RDMA

Page 10: ScicomP 11 Charles Grassl IBM May, 2005

10 © 2005 IBM Corporation

Underlying Message Procedures

• Protocols:• Rendezvous

• “Large” messages• Eager

• “Small” messages• MP_EAGER_LIMIT

• Range: 0 - 65536

• Mechanisms• Packet• Bulk• MP_BULK_MIN_MSG_SIZE

• Range: any non-negative integer

Page 11: ScicomP 11 Charles Grassl IBM May, 2005

11 © 2005 IBM Corporation

MPI Transfer Protocols

Send Recv.Mess. Header.Small Messages:

Eager

Send Recv.Header.

Send Recv.Mess.

Large Messages:Rendezvous

Page 12: ScicomP 11 Charles Grassl IBM May, 2005

12 © 2005 IBM Corporation

MPI Transfer Mechanisms

SendPackets

switch Recv.

Small Messages: Packets

SendBulk

Recv.

switch

Large Messages: Bulk

Page 13: ScicomP 11 Charles Grassl IBM May, 2005

13 © 2005 IBM Corporation

Remote Direct Memory Access (RDMA)

• Overlap of computation and communication (possible)• Fragmentation and reassembly offloaded to the adapter• Minimize packet arrival interrupts• Asynchronous messaging applications

• All tasks sharing adapter not communicating at the same time• One sided programming model• Zero copy transport

• Reduced memory subsystem load• Striping of very large messages

• Implications to interference with other tasks if copying

Page 14: ScicomP 11 Charles Grassl IBM May, 2005

14 © 2005 IBM Corporation

New MPI Performance Models

• Possible striping of single message•Bandwidth ~ n * 2 Gbyte/s

• Small dependence on user large pages• Collectives• Eager limit

Page 15: ScicomP 11 Charles Grassl IBM May, 2005

15 © 2005 IBM Corporation

MPI Single Messages:Large Page vs. Small Page

Rate vs. Length

0500

100015002000250030003500

0 500000 1000000 1500000 2000000

Length

Mby

te/s

LPSP

p655 1.5 GHz, HPS, RDMA

Page 16: ScicomP 11 Charles Grassl IBM May, 2005

16 © 2005 IBM Corporation

MPI Single Messages:LP vs. SP

Time vs. Length

0

250

500

750

1000

0 500000 1000000 1500000 2000000

Length

Mic

rose

cond

s LPSP

Page 17: ScicomP 11 Charles Grassl IBM May, 2005

17 © 2005 IBM Corporation

MPI Single Messages

0

500

1000

1500

2000

2500

3000

3500

0 500000 1000000 1500000

• Message striping at ~500000 bytes

• Bandwidth:• 1.5 Gbyte/s “Modest” size• 3 Gbyte/s Large size

• Small sensitivity to page size

Page 18: ScicomP 11 Charles Grassl IBM May, 2005

18 © 2005 IBM Corporation

MPI Single Messages:Eager Limits

Time vs. Length

010203040506070

0 2000 4000 6000 8000 10000

Length

Mic

rose

cond

s

Eager 4096Eager 0

Page 19: ScicomP 11 Charles Grassl IBM May, 2005

19 © 2005 IBM Corporation

Eager Limit

Time vs. Length

0

10

20

30

40

50

60

70

0 2000 4000 6000 8000 10000

Length

Mic

rose

cond

s

Eager 4096Eager 0

• Reduce latency from 20 microseconds to 7 microseconds

Page 20: ScicomP 11 Charles Grassl IBM May, 2005

20 © 2005 IBM Corporation

MPI Single Messages:RDMA vs. no RDMA

Rate vs. Length

0500

100015002000250030003500

0 500000 1000000 1500000 2000000

Length (bytes)

Mby

te/s RDMA

no RDMA

p655 1.5 GHz, HPS, RDMA

Page 21: ScicomP 11 Charles Grassl IBM May, 2005

21 © 2005 IBM Corporation

RDMA

0

500

1000

1500

2000

2500

3000

3500

0 500000 1000000 1500000

• Message striping starts at 500000 bytes• Adjust with

MP_BULK_MIN_MSG_SIZE

Page 22: ScicomP 11 Charles Grassl IBM May, 2005

22 © 2005 IBM Corporation

MPI Collectives

• Take special advantage of 64-bit addressing•More aggressive algorithms

• Example:•MPI_Bcast, 32 tasks: 25% faster with 64-bit addressing

Page 23: ScicomP 11 Charles Grassl IBM May, 2005

23 © 2005 IBM Corporation

MPI Bcast:32-bit vs. 64-bit

64 and 32 Task MPI_Bcast

0500

1000150020002500300035004000

0 250000 500000 750000

Length

Mic

rose

cond

s

64-task 64-bit64-task 32-bit32-task 64-bit32 task 32-bit

p655 1.5 GHz, HPS, RDMA

Page 24: ScicomP 11 Charles Grassl IBM May, 2005

24 © 2005 IBM Corporation

MPI_Bcast:32-bit vs. 64-bit

32 Tasks MPI_Bcast

0500

1000150020002500

0 250000 500000 750000

Length

Mic

rose

cond

s

64-bit32-bit

p655 1.5 GHz, HPS, RDMA

Page 25: ScicomP 11 Charles Grassl IBM May, 2005

25 © 2005 IBM Corporation

MPI_Bcast:32-bit vs. 64-bit

64 Tasks MPI_Bcast

0500

1000150020002500300035004000

0 250000 500000 750000

Length

Mic

rose

cond

s

64-bit32-bit

p655 1.5 GHz, HPS, RDMA

Page 26: ScicomP 11 Charles Grassl IBM May, 2005

26 © 2005 IBM Corporation

Application Considerations• MPI-LAPI has different architecture than prior version

• Bulk transfer:• Larger messages (>500K are used)

• Set MP_SINGLE_THREAD=yes (if possible)• 32-bit applications:

• Will not use LAPI shared memory for large messages• Convert to 64-bit

• Will not use MPI Collective Communication optimizations• Convert to 64-bit.

• Applications that use signal handlers may require some changes

• MPL is no longer supported

Page 27: ScicomP 11 Charles Grassl IBM May, 2005

27 © 2005 IBM Corporation

MPI Environment Variables

Environment Variable Recommend Value

MP_EUILIB usMP_EUIDEVICE sn_single, sn_allMP_SHARED_MEMORY yes

MP_USE_BULK_XFER yesMP_SINGLE_THREAD yes*

MP_BULK_MIN_MSG_SIZE 128 kbyte (default)

* If possible

Page 28: ScicomP 11 Charles Grassl IBM May, 2005

28 © 2005 IBM Corporation

Bandwidth Structure:Performance Aspects

• Shared memory• 3 Gbyte/s POWER4• 4 Gbyte/s POWER5

• Large pages• Reduced effect

• Bulk Transfer• 3 Gbyte/s for two adaptors

• Eager Limit• 15-20 microsec. 5-7 microsec

• Single threaded• 1-2 microsec. reduced latency

Page 29: ScicomP 11 Charles Grassl IBM May, 2005

29 © 2005 IBM Corporation

HPS Performance Summary

Rate vs. Length

0

500

1000

1500

2000

2500

3000

3500

0 500000 1000000 1500000 2000000

-

LPSP

• High asymptotic peak bandwidth• ~4x vs. Colony

• Extra “kink” in performance curve• Bulk Transfer

• Small message performance improvement

• Low latency• 5 microsecond.

Page 30: ScicomP 11 Charles Grassl IBM May, 2005

30 © 2005 IBM Corporation

Prescription For Use of RDMA

• Add to LL configuration file:• /usr/lpp/LoadL/full/samples/LoadL_config

• SCHEDULE_BY_RESOURCES = RDMA• Verification:

•Run with MP_INFOLEVEL=2

•Stderr must contain the text:• “Bulk Transfer is enabled”

• Running:•LoadLeveler switch

• # @ bulkxfer = yes

Page 31: ScicomP 11 Charles Grassl IBM May, 2005

31 © 2005 IBM Corporation

Summary

• HPS• Bandwidth

• Bulk Transfers• 4x higher bandwidth (large messages)

• Latency• 5 microsecond.