Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with...

Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and

Performance Evaluation with uDAPL

Jasjit Singh, Yogeshwar SonawaneC-DAC, Pune, India.

IEEE Cluster 201021st September 2010

This work has been developed under the project 'National PARAM Supercomputing Facility and Next Generation HPC Technology' sponsored by Government of India's Department of Information Technology (DIT) under Ministry of Communication and Information Technology (MCIT) vide administrative approval No. DIT/R&D/C-DAC/2(2)/2008 dated 26/05/2008.

Presentation outline

• Introduction• Problem Statement• Proposed Design• Performance Evaluation• Related Work• Conclusion & Future Work

2

Introduction

• HPC clusters are increasing in size to address the computational needs of large challenging problems.

• MPI is the de-facto standard for writing parallel applications. It typically uses fully connected topology.

• ADI provides portability to MPI for multiple networks and network interfaces.

3

uDAPL Overview

• uDAPL is proposed by Direct Access Transport (DAT) Collaborative.

• It defines lightweight, transport-independent and platform-independent set of user level APIs to exploit RDMA capabilities, such as those present in InfiniBand, VIA and iWARP.

• Supported by many MPIs like MVAPICH2, Intel MPI, OpenMPI and HP-MPI.

4

Software

Hardware

Event Completion

Descriptor Posting

SQ

RQ

EVDEVD

Endpoint

Memory buffers

Memory buffers

CQ

Process

SQ

RQ

EVDEVD

Endpoint

Memory buffers

Memory buffers

CQ

Process

uDAPL Communication Model

5

Reliable Connection

• In RC, a connection is formed between every process pair using endpoints (equivalent to queue pairs) at both ends.

• Limited endpoints of a HCA restrict the number of connections that can be established by an MPI application.– Thus limiting nodes to be deployed in cluster.

6

Endpoint (EP) requirement

• A cluster has (N * P) number of processes.where, P = number of processes or cores per node.

N = number of nodes in cluster.

• Every process need to establish connections to rest of (N * P – 1) processes. For simplicity, assume it be (N * P).

EP requirement for a Process = (N * P)EP requirement for a node = (N * P * P)

• Increasing N or P increases EP requirement.– Increasing P drastically increases the EP requirement.

• Nmax = Endpoints with HCA / (P * P)

7

Problem Statement

• Hardware upgrade to meet increased endpoint requirements is costly and time-consuming.

• Can an optimal solution with existing HCA be thought ?

8

Multiplexing approach

• Extends scalability with existing hardware.

• Maps multiple software connections to fewer hardware connections without incurring any significant performance penalty.

• Thus, same HCA can support more number of nodes in the cluster.

9

Multiplexing Design: swep & hwep

• We distinguish software ep (swep) and hardware ep (hwep).

• Multiple sweps use single hwep for data transfer.

Software

Hardware

hweps

P1 P2 P3 P4

sweps

• A hardware connection is between hweps from two nodes. – Therefore software connections only between these two nodes will use this

hardware connection.

• One hwep is shared by sweps belonging to different processes on a node.

• Multiplexing should support both connection management as well as data transfer routines such as send, receive, RDMA Write etc.

10

Multi-Way Binding Problem

P1

N1

H0

H0

P3

P5

P7

P2

N2

P4

P6

P8

H1

H1

h0

h0

h1

h1

H1

H2

H1

H3

H2

11

Multiplexing Design: Multi-way binding

• The processing (issuing or servicing) of a connection request at a node is completely independent of the processing at the remote node.

• Without multiplexing, multi-way binding will not occur as every connection request sent or received will allocate a separate hwep.

• Issue related to Connection management.

• Connection between hweps has to be strictly one-to-one.

• Two hweps on one side (H1 and H3) are trying to bind to a single remote hwep (h2).

P1

P3

H3

H1 H2

N1N1

P2

N2N2

12

H1, vid 0

H2, vid 0

Solution with VID

P1

N1

H0

H0

P3

P5

P7

P2

N2

P4

P6

P8

H1

H1

h0

h0

h1

h1

vid 0

vid 0

H1

H3

H2

13

Multiplexing Design: Solution with VID

• For equal sharing, total number of hweps on a HCA can be divided as N * m, where N is the number of nodes in cluster.

– Here m is less than the practical EP requirement of P * P.

• If range of VID for a remote node (0 to m-1) is exhausted, a hwep already used (preferably least used) has to be reused.

• Virtual Identifier (VID) as a unique identifier for a hwep.

• Hweps with the same VID will be connected to each other.

H1, vid 0P1

P3

H3, vid 1

H2, vid 0

N1N1

P2

N2N2

14

Multiplexing Design: Endpoint as a Queue-pair

• A hwep context contains all the information about a single swep or a connection, like EVD number and PZ number.

• In multiplexing, one hwep is used by multiple sweps. – Either of the queue can own the context information.

• Fig. (a) is redrawn to show hwep as a queue-pair in fig. (b). Both queues will inherit VID.

• Generally, one hwep corresponds to one swep. hweps

P1 P2

Software

Hardware

SQRQSQRQ

P1 P2

(a) (b)

sweps

15

Separating SQ and RQ

• Many MPI libraries use single EVD, single PZ and same memory privilege for a process. – Hence all sweps of a process use the same EVD and PZ.

• We share SQ among processes and RQ with only one process.– Thus RQ owns information stored in a hwep context while the

same information for SQ is conveyed as a part of descriptor.

• During connection establishment, only RQ is selected. – Remote SQ is automatically chosen with VID of the remote SQ

same as that of the local RQ.

• SRQ functionality is feasible using RQ of a hwep.16

Static Mapping: Division of sweps

• For a fixed cluster environment, static mapping avoids various multiplexing overheads.

– Such as during allocating hweps, sweps and maintaining their association.

LPID 2 to (P-2)

RN2 to RN(N-2)

RN (N-1)

RN 1

RN 0

RPID 0 RPID 0

RPID 1 RPID 1

RPID (P - 1) RPID (P - 1)

LPID 0

LPID 1

LPID (P-1)

P number of sweps for each LPID

LPID = Local Process Identifier

RPID = Remote Process Identifier

RN = Remote Node Number

17

Static Mapping: Division of hweps

• Similarly, static allocation of hweps is possible.

• Multiplexing is (N * P * P) : (N * P * X) i.e. P : X.– where X is less than P.– P sweps will share X hweps.– X SQs and X RQs will be used by P sweps.

• Combination of LPID, RPID and RN acts as a VID.

18

Performance Evaluation

• We compare results for following two modelsa) without multiplexing termed as basic model b) with multiplexing termed as scalable model.

• We have evaluated multiplexing design using uDAPL over PARAMNet-3 (pnet3) interconnect.

19

Experimental Platform• Two clusters: Cluster A of 16 nodes, Cluster B of 48 nodes.

• Each node has quad 2.93 GHz Intel Xeon Tigerton quad-core processors, 64 GB RAM and PCI-express based pnet3 HCA.

• Intel MPI having environment variable based control for using only RDMA-Write operations.

• Pnet3 is a high-performance cluster interconnect developed by C-DAC. It comprises of – 48 port switch with 10Gbps full-duplex CX4 connectivity.– X4/x8 PCIe HCA having 4096 endpoints.– Light weight protocol software stack known as KSHIPRA.

• KSHIPRA supports uDAPL library as well as some selected components of OFED stack i.e. IPoIB, SDP and iSER.

20

Multiplexing Ratio (mux-ratio)Multiplexing

ratioSweps supported

No. of nodes (max)

No multiplexing 4k 16

2:1 8k 32

4:1 16k 64

8:1 32k 128

16:1 64k 256

hweps used (sweps / mux-ratio)

mux-ratio 8 nodes 16 nodes 32 nodes 48 nodes

Basic Model 2048 4096 8192 12288

2 1024 2048 4096 6144

4 512 1024 2048 3072

8 256 512 1024 1536

16 128 256 512 768

• Mux-ratio is the ratio in which multiple sweps use a single hwep.

• It is not possible to run applications using Basic Model beyond 16 nodes.

• In multiplexing, increasing mux-ratio increases the number of nodes that can be deployed in the cluster.

– Brings down the hwep requirement to number of hweps supported by HCA.

21

Intel MPI Benchmarks (IMB)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 2 8 32 128 512 2K

Message size (bytes)

t_av

g (u

sec)

basic model 2:1

4:1 8:1

16:1

0

5000000

10000000

15000000

20000000

25000000

8K 32K 128K 512K 2M

Message size (bytes)t_

avg

(use

c)

basic model 2:1

4:1 8:1

16:1

• Very little variation in readings is observed across all the mux-ratios in nearly all of the benchmarks.

IMB Alltoall, 128 processes on 8 nodes

22

NAS Parallel Benchmarks (NPB)

0

1

2

3

4

5

6

7

8

9

EP IS MG

Tim

e (s

ec)

basic model 2:1

4:1 8:1

16:1

0

20

40

60

80

100

120

FT LU BT CG SP

Tim

e (s

ec)

basic model 2:1

4:1 8:1

16:1

• NPB contains computing kernels typical of various CFD scientific applications. Each benchmark has different communication pattern.

• IS shows maximum of 5 % degradation with 16:1 multiplexing.

NAS Class C readings, 256 processes on 16 nodes

23

HPL Benchmark

Nodes Processes% Memory used for N

Peak Computing

power (TFlops)

Basic Model

(Gflops)

2:1 MUX

(Gflops)

4:1 MUX

(Gflops)

8:1 MUX

(Gflops)

16:1 MUX

(Gflops)

16 256 90 3 2121 2182 2144 2181 2173

32 512 80 6 Not Applicable 4157 4111 4247 4274

48 768 80 9 Not Applicable

Not Applicable 6076 6031 6024

• 32 and 48 nodes run shows successful scalability of MPI applications using multiplexing technique.

• The marginal improvement is due to management of lesser number of hweps on HCA. 24

Related Work

• SRQ based designs for reducing communication buffer requirements.

• On-demand connection management: connection only when required.– Worst case all-to-all pattern may emerge. – As our work is incorporated into uDAPL provider, many features of MPI can be used in

conjunction with our technique.

• eXtended Reliable Connection (XRC) transport provides services of RC transport while providing additional scalability for multi-core clusters.

– It allows a single connection from one process to entire node.

• Hybrid programming model (e.g. OpenMP with MPI) uses threads within a node and MPI processes across nodes.

– All threads running on a node share same set of connections. – For hybrid model to work, MPI applications should be thread enabled. – Our work is part of transport library, so MPI applications can run seamlessly.

25

Conclusion and Future Work

• Proposed multiplexing technique to extend scalability of MPI applications.

– effort is to map the MPI requirement to the available pool of endpoints on HCA.

• The multiplexing technique can be applied to any transport library that provides connection-oriented service.

• We can scale the cluster size in a proportion same as the mux-ratio.– E.g. with 16:1 mux-ratio, the number of nodes in the cluster can be 16 times with

the same HCA.

• No visible performance degradation is observed up to 48 nodes.

• Future work includes evaluation at larger scale, addition of send-receive support and addition of SRQ support.

26

Thank you

[email protected]

www.cdac.in

www.cdac.in/html/htdg/products.asp

Backup slides

uDAPL Communication Model• Support for both Channel Semantics (Send/Receive) and Memory

Semantics (RDMA Write and RDMA Read).

• Reliable Connection oriented model with endpoints as source and sink of a communication channel.

• Data Transfer Operations (DTO) (i.e. Work Requests or descriptors) are posted on an endpoint.

• Completion of DTO is reported as an event on Event Dispatcher (EVD) (similar to CQ).

– Either polling/de-queue or wait model can be used for completion reaping.

• Protection Zone (PZ) and Memory Privilege flags validates memory access.

• defines SRQ mechanism that provides the ability to share receive buffers among several connections.

29

Send-Receive Handling Complexities

• During recv DTO processing, mismatch in receive descriptors corresponding to their send descriptors can happen. This is due to sharing of hwep RQ.

• Hwep RQ can have descriptors from different sweps of varied lengths.

• Additional hardware support to handle above complexities is required.

30

Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with...

Documents

Transcript of Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with...