August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim...

August 22, 2005 of (#)

Datacenter Fabric Workshop

Open MPIOverview and Current Status

Tim Woodall - LANLGalen Shipman - LANL/UNM

Overview

• Point-to-Point Architecture

• OpenIB– Implementation– Results

• Future Work

Point-to-Point Architecture

• Component Architecture:– “Plug-ins” for different capabilities (e.g. different networks)

– Tunable run-time parameters

• Three component frameworks:– Point-to-point messaging layer (PML) implements MPI semantics

– Byte Transfer Layer (BTL) abstracts network interfaces

– Memory Pool (mpool) provides for memory management/registration

PML Framework

• Single PML manages multiple BTL modules– Maintains set of BTLs on a per-peer basis

– Message fragmentation and scheduling

• Implements MPI semantics− Synchronous / buffered / ready / normal sends

− Persistent requests / Request completion

• Eager/Rendezvous protocol− Eager send of short messages

− Configurable threshold (short vs. long)

− Multiple long protocols

PML Protocols

• Send / receive pipeline to / from pre-registered buffers (non-contiguous data)

• MPI_Alloc_mem support– Red/black tree of memory registrations

– BTL associated with registration is used by scheduler

– Xfer of contiguous data with 1 RDMA (after match)

• “Leave pinned” run-time parameter– Registration on first-use

– MRU cache (configurable size) of registrations

– Bandwidth equivalent to pre-registered buffers (MPI_Alloc_mem)

PML Protocols (Continued)

• Dynamic memory registration/deregistration– Fragment message and build pipeline of RDMA

requests– Overlap [de-]registration with RDMA– Bandwidth 97% of pre-registered memory at

large message sizes (8Mbytes)– Performance impacted by bus type/bandwidth

BTL Framework

• MPI agnostic

• Provides simple API to upper layers– Tagged send/receive primitives

– One-sided put/get operations

• Access to data type engine for zero copy data transfer

• BTL modules natively support commodity networks:– Current (self, shared memory, myrinet GM/MX, Infiniband mvapi/OpenIB,

Portals, TCP)

– Planned (LAPI, Quadrics Elan4)

OpenIB BTL

• BTL module initialization

• Resources allocation

• Connection management

• Small message Xfer

• Large message Xfer

• OpenIB Issues

• Future Work

BTL module initialization

• A separate BTL module is initialized for each port on each HCA

• The PML schedules across these BTL modules just as any other interconnect

• When multiple BTL modules exist peers establish QP connections by matching subnets

Resource Allocation

SRQ Scalability

5105126481024

48256648512

24128648256

1264 648128

SRQ-

Mbytes

K*RQ per QP-

Mbytes

#postedFrag size-

Kbytes

Nodes

K- multiplier based on number of nodes

Connection management

• Addressing information is exchanged dynamically via an OOB channel – This greatly improves scalability but at the

cost of increased first message latency– Connections are established with peers in the

same subnet (local subnet routing only)

Small Message Xfer

– Maintain list of pre-registered fragments for send and recv

– List grows dynamically in chunks (more efficient to register)

– Small messages are copied to/from pre-registered buffers

– Recv descriptors are posted as needed based on min/max thresholds

Small Message Performance

Average Latency

OpenMPI - OpenIB - *optimized 5.13 usec

OpenMPI - OpenIB - *defaults 5.43 usec

OpenMPI - Mvapi - *optimized 5.64 usec

OpenMPI - Mvapi - *defaults 5.94 usec

Mvapich - Mvapi (rdma/mem poll) 4.19 usec

Mvapich - Mvapi (send/recv) 6.51 usec

* Send/Recv based protocol

Large Message Xfer

• RDMA Write and RDMA Read are both supported

• RDMA Read provides better performance than RDMA Write - control messages are reduced

• RDMA pipeline protocol performance highly dependent on I/O Bus performance

Results OpenMPI/OpenIB - All

Results OpenMPI/OpenIB - All - Log

Results OpenMPI/OpenIB - Eager limit

Results Combined Results

Results Combined Results - Log

OpenIB Opportunities

– User level notification of VM activity• Caching of memory registrations can be

dangerous • Need the ability to detect VM changes that effect

memory registrations (such as sbrk and munmap)

– Reliable Multicast for collectives – SRQ performance, 2/10 usec penalty, but

who’s counting?

Future Work

• Small message RDMA (using working set of peers) - optional

• Dynamic connection management using Unreliable Datagrams

• Dynamic connection teardown - optional

Source Code Access

• Subversion repository

• Download client from:– http://subversion.tigris.org/– v1.2.1 or later

• Check out with:– svn co http://svn.open-mpi.org/svn/ompi/trunk

ompi– Anonymous, read-only access

Questions?

Tim Woodall

Email: [email protected]

Phone: 505-665-5224

Galen Shipman

Email: [email protected]

Hardware Specs

• Dual Intel Xeon 3.2 GHz– 1024 KB Cache

• 2 Gbytes memory• Bus: Intel Corp. E7525/E7520/E7320 PCI

Express• Mellanox Technologies MT25208

InfiniHost III Ex• 288 Port Voltaire switch

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim...

Documents

Transcript of August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim...