Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with...
-
Upload
marion-newman -
Category
Documents
-
view
235 -
download
0
Transcript of Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with...
Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and
Performance Evaluation with uDAPL
Jasjit Singh, Yogeshwar SonawaneC-DAC, Pune, India.
IEEE Cluster 201021st September 2010
This work has been developed under the project 'National PARAM Supercomputing Facility and Next Generation HPC Technology' sponsored by Government of India's Department of Information Technology (DIT) under Ministry of Communication and Information Technology (MCIT) vide administrative approval No. DIT/R&D/C-DAC/2(2)/2008 dated 26/05/2008.
Presentation outline
• Introduction• Problem Statement• Proposed Design• Performance Evaluation• Related Work• Conclusion & Future Work
2
Introduction
• HPC clusters are increasing in size to address the computational needs of large challenging problems.
• MPI is the de-facto standard for writing parallel applications. It typically uses fully connected topology.
• ADI provides portability to MPI for multiple networks and network interfaces.
3
uDAPL Overview
• uDAPL is proposed by Direct Access Transport (DAT) Collaborative.
• It defines lightweight, transport-independent and platform-independent set of user level APIs to exploit RDMA capabilities, such as those present in InfiniBand, VIA and iWARP.
• Supported by many MPIs like MVAPICH2, Intel MPI, OpenMPI and HP-MPI.
4
Software
Hardware
Event Completion
Descriptor Posting
SQ
RQ
EVDEVD
Endpoint
Memory buffers
Memory buffers
CQ
Process
SQ
RQ
EVDEVD
Endpoint
Memory buffers
Memory buffers
CQ
Process
uDAPL Communication Model
5
Reliable Connection
• In RC, a connection is formed between every process pair using endpoints (equivalent to queue pairs) at both ends.
• Limited endpoints of a HCA restrict the number of connections that can be established by an MPI application.– Thus limiting nodes to be deployed in cluster.
6
Endpoint (EP) requirement
• A cluster has (N * P) number of processes.where, P = number of processes or cores per node.
N = number of nodes in cluster.
• Every process need to establish connections to rest of (N * P – 1) processes. For simplicity, assume it be (N * P).
EP requirement for a Process = (N * P)EP requirement for a node = (N * P * P)
• Increasing N or P increases EP requirement.– Increasing P drastically increases the EP requirement.
• Nmax = Endpoints with HCA / (P * P)
7
Problem Statement
• Hardware upgrade to meet increased endpoint requirements is costly and time-consuming.
• Can an optimal solution with existing HCA be thought ?
8
Multiplexing approach
• Extends scalability with existing hardware.
• Maps multiple software connections to fewer hardware connections without incurring any significant performance penalty.
• Thus, same HCA can support more number of nodes in the cluster.
9
Multiplexing Design: swep & hwep
• We distinguish software ep (swep) and hardware ep (hwep).
• Multiple sweps use single hwep for data transfer.
Software
Hardware
hweps
P1 P2 P3 P4
sweps
• A hardware connection is between hweps from two nodes. – Therefore software connections only between these two nodes will use this
hardware connection.
• One hwep is shared by sweps belonging to different processes on a node.
• Multiplexing should support both connection management as well as data transfer routines such as send, receive, RDMA Write etc.
10
Multi-Way Binding Problem
P1
N1
H0
H0
P3
P5
P7
P2
N2
P4
P6
P8
H1
H1
h0
h0
h1
h1
H1
H2
H1
H3
H2
11
Multiplexing Design: Multi-way binding
• The processing (issuing or servicing) of a connection request at a node is completely independent of the processing at the remote node.
• Without multiplexing, multi-way binding will not occur as every connection request sent or received will allocate a separate hwep.
• Issue related to Connection management.
• Connection between hweps has to be strictly one-to-one.
• Two hweps on one side (H1 and H3) are trying to bind to a single remote hwep (h2).
P1
P3
H3
H1 H2
N1N1
P2
N2N2
12
H1, vid 0
H2, vid 0
Solution with VID
P1
N1
H0
H0
P3
P5
P7
P2
N2
P4
P6
P8
H1
H1
h0
h0
h1
h1
vid 0
vid 0
H1
H3
H2
13
Multiplexing Design: Solution with VID
• For equal sharing, total number of hweps on a HCA can be divided as N * m, where N is the number of nodes in cluster.
– Here m is less than the practical EP requirement of P * P.
• If range of VID for a remote node (0 to m-1) is exhausted, a hwep already used (preferably least used) has to be reused.
• Virtual Identifier (VID) as a unique identifier for a hwep.
• Hweps with the same VID will be connected to each other.
H1, vid 0P1
P3
H3, vid 1
H2, vid 0
N1N1
P2
N2N2
14
Multiplexing Design: Endpoint as a Queue-pair
• A hwep context contains all the information about a single swep or a connection, like EVD number and PZ number.
• In multiplexing, one hwep is used by multiple sweps. – Either of the queue can own the context information.
• Fig. (a) is redrawn to show hwep as a queue-pair in fig. (b). Both queues will inherit VID.
• Generally, one hwep corresponds to one swep. hweps
P1 P2
Software
Hardware
SQRQSQRQ
P1 P2
(a) (b)
sweps
15
Separating SQ and RQ
• Many MPI libraries use single EVD, single PZ and same memory privilege for a process. – Hence all sweps of a process use the same EVD and PZ.
• We share SQ among processes and RQ with only one process.– Thus RQ owns information stored in a hwep context while the
same information for SQ is conveyed as a part of descriptor.
• During connection establishment, only RQ is selected. – Remote SQ is automatically chosen with VID of the remote SQ
same as that of the local RQ.
• SRQ functionality is feasible using RQ of a hwep.16
Static Mapping: Division of sweps
• For a fixed cluster environment, static mapping avoids various multiplexing overheads.
– Such as during allocating hweps, sweps and maintaining their association.
LPID 2 to (P-2)
RN2 to RN(N-2)
RN (N-1)
RN 1
RN 0
RPID 0 RPID 0
RPID 1 RPID 1
RPID (P - 1) RPID (P - 1)
LPID 0
LPID 1
LPID (P-1)
P number of sweps for each LPID
LPID = Local Process Identifier
RPID = Remote Process Identifier
RN = Remote Node Number
17
Static Mapping: Division of hweps
• Similarly, static allocation of hweps is possible.
• Multiplexing is (N * P * P) : (N * P * X) i.e. P : X.– where X is less than P.– P sweps will share X hweps.– X SQs and X RQs will be used by P sweps.
• Combination of LPID, RPID and RN acts as a VID.
18
Performance Evaluation
• We compare results for following two modelsa) without multiplexing termed as basic model b) with multiplexing termed as scalable model.
• We have evaluated multiplexing design using uDAPL over PARAMNet-3 (pnet3) interconnect.
19
Experimental Platform• Two clusters: Cluster A of 16 nodes, Cluster B of 48 nodes.
• Each node has quad 2.93 GHz Intel Xeon Tigerton quad-core processors, 64 GB RAM and PCI-express based pnet3 HCA.
• Intel MPI having environment variable based control for using only RDMA-Write operations.
• Pnet3 is a high-performance cluster interconnect developed by C-DAC. It comprises of – 48 port switch with 10Gbps full-duplex CX4 connectivity.– X4/x8 PCIe HCA having 4096 endpoints.– Light weight protocol software stack known as KSHIPRA.
• KSHIPRA supports uDAPL library as well as some selected components of OFED stack i.e. IPoIB, SDP and iSER.
20
Multiplexing Ratio (mux-ratio)Multiplexing
ratioSweps supported
No. of nodes (max)
No multiplexing 4k 16
2:1 8k 32
4:1 16k 64
8:1 32k 128
16:1 64k 256
hweps used (sweps / mux-ratio)
mux-ratio 8 nodes 16 nodes 32 nodes 48 nodes
Basic Model 2048 4096 8192 12288
2 1024 2048 4096 6144
4 512 1024 2048 3072
8 256 512 1024 1536
16 128 256 512 768
• Mux-ratio is the ratio in which multiple sweps use a single hwep.
• It is not possible to run applications using Basic Model beyond 16 nodes.
• In multiplexing, increasing mux-ratio increases the number of nodes that can be deployed in the cluster.
– Brings down the hwep requirement to number of hweps supported by HCA.
21
Intel MPI Benchmarks (IMB)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 2 8 32 128 512 2K
Message size (bytes)
t_av
g (u
sec)
basic model 2:1
4:1 8:1
16:1
0
5000000
10000000
15000000
20000000
25000000
8K 32K 128K 512K 2M
Message size (bytes)t_
avg
(use
c)
basic model 2:1
4:1 8:1
16:1
• Very little variation in readings is observed across all the mux-ratios in nearly all of the benchmarks.
IMB Alltoall, 128 processes on 8 nodes
22
NAS Parallel Benchmarks (NPB)
0
1
2
3
4
5
6
7
8
9
EP IS MG
Tim
e (s
ec)
basic model 2:1
4:1 8:1
16:1
0
20
40
60
80
100
120
FT LU BT CG SP
Tim
e (s
ec)
basic model 2:1
4:1 8:1
16:1
• NPB contains computing kernels typical of various CFD scientific applications. Each benchmark has different communication pattern.
• IS shows maximum of 5 % degradation with 16:1 multiplexing.
NAS Class C readings, 256 processes on 16 nodes
23
HPL Benchmark
Nodes Processes% Memory used for N
Peak Computing
power (TFlops)
Basic Model
(Gflops)
2:1 MUX
(Gflops)
4:1 MUX
(Gflops)
8:1 MUX
(Gflops)
16:1 MUX
(Gflops)
16 256 90 3 2121 2182 2144 2181 2173
32 512 80 6 Not Applicable 4157 4111 4247 4274
48 768 80 9 Not Applicable
Not Applicable 6076 6031 6024
• 32 and 48 nodes run shows successful scalability of MPI applications using multiplexing technique.
• The marginal improvement is due to management of lesser number of hweps on HCA. 24
Related Work
• SRQ based designs for reducing communication buffer requirements.
• On-demand connection management: connection only when required.– Worst case all-to-all pattern may emerge. – As our work is incorporated into uDAPL provider, many features of MPI can be used in
conjunction with our technique.
• eXtended Reliable Connection (XRC) transport provides services of RC transport while providing additional scalability for multi-core clusters.
– It allows a single connection from one process to entire node.
• Hybrid programming model (e.g. OpenMP with MPI) uses threads within a node and MPI processes across nodes.
– All threads running on a node share same set of connections. – For hybrid model to work, MPI applications should be thread enabled. – Our work is part of transport library, so MPI applications can run seamlessly.
25
Conclusion and Future Work
• Proposed multiplexing technique to extend scalability of MPI applications.
– effort is to map the MPI requirement to the available pool of endpoints on HCA.
• The multiplexing technique can be applied to any transport library that provides connection-oriented service.
• We can scale the cluster size in a proportion same as the mux-ratio.– E.g. with 16:1 mux-ratio, the number of nodes in the cluster can be 16 times with
the same HCA.
• No visible performance degradation is observed up to 48 nodes.
• Future work includes evaluation at larger scale, addition of send-receive support and addition of SRQ support.
26
Backup slides
uDAPL Communication Model• Support for both Channel Semantics (Send/Receive) and Memory
Semantics (RDMA Write and RDMA Read).
• Reliable Connection oriented model with endpoints as source and sink of a communication channel.
• Data Transfer Operations (DTO) (i.e. Work Requests or descriptors) are posted on an endpoint.
• Completion of DTO is reported as an event on Event Dispatcher (EVD) (similar to CQ).
– Either polling/de-queue or wait model can be used for completion reaping.
• Protection Zone (PZ) and Memory Privilege flags validates memory access.
• defines SRQ mechanism that provides the ability to share receive buffers among several connections.
29
Send-Receive Handling Complexities
• During recv DTO processing, mismatch in receive descriptors corresponding to their send descriptors can happen. This is due to sharing of hwep RQ.
• Hwep RQ can have descriptors from different sweps of varied lengths.
• Additional hardware support to handle above complexities is required.
30