High-performance and Scalable MPI+X Library for...

High-performance and Scalable MPI+X Library for Emerging HPC Clusters & Cloud Platforms

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at Intel HPC Developer Conference (SC ‘17)

by

Hari Subramoni

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~subramon

http://www.cse.ohio-state.edu/%7Epanda

http://www.cse.ohio-state.edu/%7Epotluri

Intel HPC Dev Conf (SC ‘17) 2Network Based Computing Laboratory

High-End Computing (HEC): ExaFlop & ExaByte

100-200 PFlops in 2016-2018

1 EFlops in 2023-2024?

10K-20K EBytes in 2016-2018

40K EBytes in 2020 ?

ExaFlop & HPC•

ExaByte & BigData•


Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• High Performance Interconnects

• High Performance Storage and Compute devices

• MPI is used by vast majority of HPC applications

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects –InfiniBand, Omni-Path

<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

Tianhe – 2 TitanK - ComputerSunway TaihuLight


Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,

OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications

Networking Technologies(InfiniBand, 40/100GigE,

Aries, and OmniPath)

Multi/Many-coreArchitectures

Accelerators(GPU and FPGA)

MiddlewareCo-Design

Opportunities and

Challenges across Various

Layers

PerformanceScalability

Fault-Resilience

Communication Library or Runtime for Programming ModelsPoint-to-point Communicatio

n

Collective Communicatio

n

Energy-Awareness

Synchronization and Locks

I/O andFile Systems

FaultTolerance


Exascale Programming models

Next-Generation Programming modelsMPI+X

X= ?OpenMP, OpenACC, CUDA, PGAS, Tasks….

Highly-Threaded Systems (KNL)

Irregular Communications

Heterogeneous Computing with

Accelerators

• The community believes exascale programming model will be MPI+X

• But what is X? – Can it be just OpenMP?

• Many different environments and systems are emerging

– Different `X’ will satisfy the respective needs


• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)– Scalable job start-up

• Scalable Collective communication– Offload– Non-blocking– Topology-aware

• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)– Multiple end-points per node

• Support for efficient multi-threading• Integrated Support for GPGPUs and FPGAs• Fault-tolerance/resiliency• QoS support for communication and I/O• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI+UPC++, MPI +

OpenSHMEM, CAF, …)• Virtualization • Energy-Awareness

MPI+X Programming model: Broad Challenges at Exascale


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,825 organizations in 85 countries

– More than 432,000 (> 0.4 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 15th, 241,108-core (Pleiades) at NASA

• 20th, 462,462-core (Stampede) at TACC

• 44th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Sunway TaihuLight (1st in Jun’17, 10M cores, 100 PFlops)

http://mvapich.cse.ohio-state.edu/


MVAPICH2 Software Family High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications


• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication– Scalable Start-up– Dynamic and Adaptive Communication Protocols and Tag Matching– Optimized Collectives using SHArP and Multi-Leaders– Optimized CMA-based Collectives

• Hybrid MPI+PGAS Models for Irregular Applications

• Heterogeneous Computing with Accelerators

• HPC and Cloud

Outline


Startup Performance on KNL + Omni-Path

0

5

10

15

20

25

64 128

256

512 1K 2K 4K 8K 16K

32K

64K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

MPI_Init & Hello World - Oakforest-PACS

Hello World (MVAPICH2-2.3a)

MPI_Init (MVAPICH2-2.3a)

• MPI_Init takes 22 seconds on 229,376 processes on 3,584 KNL nodes (Stampede2 – Full scale)• 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC)• At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS)• All numbers reported with 64 processes per node

5.8s

21s

22s

New designs available in latest MVAPICH2 libraries and as patch for SLURM-15.08.8 and SLURM-16.05.1


Dynamic and Adaptive MPI Point-to-point Communication Protocols

Process on Node 1 Process on Node 2

Eager Threshold for Example Communication Pattern with Different Designs

0 1 2 3

4 5 6 7

Default

16 KB 16 KB 16 KB 16 KB

0 1 2 3

4 5 6 7

Manually Tuned

128 KB 128 KB 128 KB 128 KB

0 1 2 3

4 5 6 7

Dynamic + Adaptive

32 KB 64 KB 128 KB 32 KB

H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper

0

200

400

600

128 256 512 1K

Wal

l Clo

ck T

ime

(sec

onds

)

Number of Processes

Execution Time of Amber

Default Threshold=17K Threshold=64KThreshold=128K Dynamic Threshold

0

5

10

128 256 512 1K

Rela

tive

Mem

ory

Cons

umpt

ion

Number of Processes

Relative Memory Consumption of Amber

Default Threshold=17K Threshold=64KThreshold=128K Dynamic Threshold

Default Poor overlap; Low memory requirement Low Performance; High Productivity

Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity

Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity

Process Pair Eager Threshold (KB)

0 – 4 32

1 – 5 64

2 – 6 128

3 – 7 32

Desired Eager Threshold


Dynamic and Adaptive Tag Matching

Normalized Total Tag Matching Time at 512 ProcessesNormalized to Default (Lower is Better)

Normalized Memory Overhead per Process at 512 ProcessesCompared to Default (Lower is Better)

Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 2016. [Best Paper Nominee]

Chal

leng

e Tag matching is a significant overhead for receiversExisting Solutions are- Static and do not adapt dynamically to communication pattern- Do not consider memory overhead

Solu

tion A new tag matching design

- Dynamically adapt to communication patterns- Use different strategies for different ranks - Decisions are based on the number of request object that must be traversed before hitting on the required one

Resu

lts Better performance than other state-of-the art tag-matching schemesMinimum memory consumption

Will be available in future MVAPICH2 releases


Advanced Allreduce Collective Designs Using SHArP and Multi-Leaders

• Socket-based design can reduce the communication latency by 23% and 40% on Xeon + IB nodes

• Support is available in MVAPICH2 2.3a and MVAPICH2-X 2.3b

HPCG (28 PPN)

0

0.1

0.2

0.3

0.4

0.5

0.6

56 224 448

Com

mun

icat

ion

Late

ncy

(Sec

onds

)

Number of Processes

MVAPICH2 Proposed-Scoket-Based MVAPICH2+SHArP

0

10

20

30

40

50

60

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

Message Size (Byte)

MVAPICH2 Proposed-Scoket-Based MVAPICH2+SHArP

OSU Micro Benchmark (16 Nodes, 28 PPN)

23%

40%

Lower is better

M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, Supercomputing '17.


Performance of MPI_Allreduce On Stampede2 (10,240 Processes)

0

50

100

150

200

250

300

4 8 16 32 64 128 256 512 1024 2048 4096

Late

ncy

(us)

Message SizeMVAPICH2 MVAPICH2-OPT IMPI

0200400600800

100012001400160018002000

8K 16K 32K 64K 128K 256KMessage Size

MVAPICH2 MVAPICH2-OPT IMPIOSU Micro Benchmark 64 PPN

2.4X

• MPI_Allreduce latency with 32K bytes reduced by 2.4X


01020304050607080

512 1024 1280 2048 2560 4096

Late

ncy

(s)

Number of Processes

Stampede2 (32 PPN)

MVAPICH2 DPML IMPI

Performance of MiniAMR Application On Stampede2 and Bridges

01020304050607080

448 896 1792Number of Processes

Bridges (28 PPN)

MVAPICH2 MVAPICH2-OPT IMPI

• For MiniAMR Application latency with 2,048 processes, MVAPICH2-OPT can reduce the latency by 2.6X on Stampede2

• On Bridges, with 1,792 processes, MVAPICH2-OPT can reduce the latency by 1.5X

2.4X 1.5X


Enhanced MPI_Bcast with Optimized CMA-based Design

1

10

100

1000

10000

100000

1K 4K 16K 64K 256K 1M 4MMessage Size

KNL (64 Processes)

MVAPICH2-2.3a

Intel MPI 2017

OpenMPI 2.1.0

Proposed

Late

ncy

(us)

1

10

100

1000

10000

100000

Message Size

Broadwell (28 Processes)

MVAPICH2-2.3a

Intel MPI 2017

OpenMPI 2.1.0

Proposed1

10

100

1000

10000

100000

1K 4K 16K 64K 256K 1M 4MMessage Size

Power8 (160 Processes)

MVAPICH2-2.3a

OpenMPI 2.1.0

Proposed

• Up to 2x - 4x improvement over existing implementation for 1MB messages• Up to 1.5x – 2x faster than Intel MPI and Open MPI for 1MB messages

Use CMA

UseSHMEM

Use CMA

UseSHMEM

Use CMA

UseSHMEM

• Improvements obtained for large messages only• p-1 copies with CMA, p copies with Shared memory• Fallback to SHMEM for small messages

S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist

Support is available in MVAPICH2-X 2.3b


• Scalability for million to billion processors



• HPC and Cloud

Outline


Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics

• Benefits:

– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Cons

– Two different runtimes

– Need great care while programming

– Prone to deadlock if not careful

Kernel 1MPI

Kernel 2MPI

Kernel 3MPI

Kernel NMPI

HPC Application

Kernel 2PGAS

Kernel NPGAS


MVAPICH2-X for Hybrid MPI + PGAS ApplicationsMPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI +

PGAS) Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI CallsUPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF, UPC++ available with MVAPICH2-X 1.9 onwards! (since 2012)

– http://mvapich.cse.ohio-state.edu• Feature Highlights

– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, UPC++, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC

– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH), UPC++

– Scalable Inter-node and intra-node communication – point-to-point and collectives

CAF Calls UPC++ Calls

http://mvapich.cse.ohio-state.edu/overview/mvapich2x


UPC++ Collectives Performance

MPI + {UPC++} application

GASNet Interfaces

UPC++ Runtime

Network

Conduit (MPI)

MVAPICH2-XUnified

communication Runtime (UCR)

MPI + {UPC++} application

UPC++ Runtime MPI Interfaces

• Full and native support for hybrid MPI + UPC++ applications

• Better performance compared to IBV and MPI conduits

• OSU Micro-benchmarks (OMB) support for UPC++

• Available since MVAPICH2-X 2.2RC1

05000

10000150002000025000300003500040000

Tim

e (u

s)

Message Size (bytes)

GASNet_MPIGASNET_IBVMV2-X

14x

Inter-node Broadcast (64 nodes 1:ppn)

J. M. Hashmi, K. Hamidouche, and D. K. Panda, Enabling Performance Efficient Runtime Support for hybrid MPI+UPC++ Programming Models, IEEE International Conference on High Performance Computing and Communications (HPCC 2016)


Application Level Performance with Graph500 and SortGraph500 Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013

05

101520253035

4K 8K 16K

Tim

e (s

)

No. of Processes

MPI-SimpleMPI-CSCMPI-CSRHybrid (MPI+OpenSHMEM)

13X

7.6X

• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design• 8,192 processes

- 2.4X improvement over MPI-CSR- 7.6X improvement over MPI-Simple

• 16,384 processes- 1.5X improvement over MPI-CSR- 13X improvement over MPI-Simple

Sort Execution Time

0500

10001500200025003000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Tim

e (s

econ

ds)

Input Data - No. of Processes

MPI Hybrid

51%

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application

• 4,096 processes, 4 TB Input Size- MPI – 2408 sec; 0.16 TB/min- Hybrid – 1172 sec; 0.36 TB/min- 51% improvement over MPI-design

J. Jose, S. Potluri, H. Subramoni, X. Lu, K. Hamidouche, K. Schulz, H. Sundar and D. Panda Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models, PGAS’14


Accelerating MaTEx k-NN with Hybrid MPI and OpenSHMEM

KDD (2.5GB) on 512 cores

9.0%

KDD-tranc (30MB) on 256 cores

27.6%

• Benchmark: KDD Cup 2010 (8,407,752 records, 2 classes, k=5)• For truncated KDD workload on 256 cores, reduce 27.6% execution time• For full KDD workload on 512 cores, reduce 9.0% execution time

J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu, D. Panda. Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM, OpenSHMEM 2015

• MaTEx: MPI-based Machine learning algorithm library• k-NN: a popular supervised algorithm for classification• Hybrid designs:

– Overlapped Data Flow; One-sided Data Transfer; Circular-buffer Structure


Performance of PGAS Models on KNL using MVAPICH2-X

0.01

0.1

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size

shmem_put

upc_putmem

upcxx_async_put

Intra-node PUT

0.01

0.1

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size

shmem_get

upc_getmem

upcxx_async_get

Intra-node GET

• Intra-node performance of one-sided Put/Get operations of PGAS libraries/languages using MVAPICH2-X communication conduit

• Near-native communication performance is observed on KNL


Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation

Heat Image Kernel

• On heat diffusion based kernels AVX-512 vectorization showed better performance• MCDRAM showed significant benefits on Heat-Image kernel for all process counts.

Combined with AVX-512 vectorization, it showed up to 4X improved performance

1

10

100

1000

16 32 64 128

Tim

e (s

)

No. of processes

KNL (Default)KNL (AVX-512)KNL (AVX-512+MCDRAM)Broadwell

Heat-2D Kernel using Jacobi method

0.1

1

10

100

16 32 64 128

Tim

e (s

)

No. of processes

KNL (Default)KNL (AVX-512)KNL (AVX-512+MCDRAM)Broadwell





• HPC and Cloud

Outline


At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU


CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases

• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication

(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU

adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from

GPU device buffers• Unified memory


0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bandwidth


0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


GPU-GPU Inter-node Latency


MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.88us11X


• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Aver

age

Tim

e St

eps

per

seco

nd (T

PS)

Number of Processes

MV2 MV2+GDR

0500

100015002000250030003500

4 8 16 32

Aver

age

Tim

e St

eps

per

seco

nd (T

PS)

Number of Processes

64K Particles 256K Particles

2X2X

mailto:[email protected]



Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/


http://www2.cosmo-model.org/content



• @ RI2 cluster, 16 GPUs, 1 GPU/node– Microsoft Cognitive Toolkit (CNTK) [https://github.com/Microsoft/CNTK]

Application Evaluation: Deep Learning Frameworks

0

50

100

150

200

250

300

8 16

Trai

ning

Tim

e (s

)

Number of GPU nodes

AlexNet modelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

0

500

1000

1500

2000

2500

8 16

Trai

ning

Tim

e (s

)

Number of GPU nodes

VGG modelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

Lower is better

15% 24%6%

15%

C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17.

• Reduces up to 24% and 15% of latency for AlexNet and VGG models• Higher improvement can be observed for larger system sizes


Latency oriented: Kernel+Send and Recv+Kernel

MVAPICH2-GDS: Preliminary Results Overlap with host computation/communication

• Latency Oriented: Able to hide the kernel launch overhead– 8-15% improvement compared to default behavior

• Overlap: Asynchronously to offload queue the Communication and computation tasks– 89% overlap with host computation at 128-Byte message size

Intel Sandy Bridge, NVIDIA Tesla K40c and Mellanox FDR HCACUDA 8.0, OFED 3.4, Each kernel is ~50us

Will be available in a public release soon

0

20

40

60

80

1 4 16 64 256 1K 4K

Default MPI Enhaced MPI+GDS


Late

ncy

(us)

020406080

100

1 4 16 64 256 1K 4K

Default MPI Enhanced MPI+GDS


Ove

rlap

(%)





• HPC and Cloud

Outline


• Virtualization has many benefits– Fault-tolerance– Job migration– Compaction

• Have not been very popular in HPC due to overhead associated with Virtualization

• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field

• Enhanced MVAPICH2 support for SR-IOV• MVAPICH2-Virt 2.2 supports:

– OpenStack, Docker, and singularity

Can HPC and Virtualization be Combined?

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15


0

10

20

30

40

50

60

70

80

90

100

MG.D FT.D EP.D LU.D CG.D

Exec

utio

n Ti

me

(s)

Container-Def

Container-Opt

Native

• 64 Containers across 16 nodes, pining 4 Cores per Container

• Compared to Container-Def, up to 11% and 73% of execution time reduction for NAS and Graph 500

• Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Application-Level Performance on Docker with MVAPICH2

Graph 500NAS

11%

0

50

100

150

200

250

300

1Cont*16P 2Conts*8P 4Conts*4P

BFS

Exec

utio

n Ti

me

(ms)

Scale, Edgefactor (20,16)

Container-Def

Container-Opt

Native

73%


0

500

1000

1500

2000

2500

3000

22,16 22,20 24,16 24,20 26,16 26,20

BFS

Exec

utio

n Ti

me

(ms)

Problem Size (Scale, Edgefactor)

Graph500

Singularity

Native

0

50

100

150

200

250

300

CG EP FT IS LU MG

Exec

utio

n Ti

me

(s)

NPB Class D

Singularity

Native

• 512 Processes across 32 nodes

• Less than 7% and 6% overhead for NPB and Graph500, respectively

Application-Level Performance on Singularity with MVAPICH2

7%

6%

J. Zhang, X .Lu and D. K. Panda, Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC ’17


• Architectures for Exascale systems are evolving• Exascale systems will be constrained by

– Power– Memory per core– Data movement cost– Faults

• Programming Models, Runtimes and Middleware need to be designed for– Scalability– Performance– Fault-resilience– Energy-awareness– Programmability– Productivity

• High Performance and Scalable MPI+X libraries are needed• Highlighted some of the approaches taken by the MVAPICH2 project• Need continuous innovation to have the right MPI+X libraries for Exascale

systems

Looking into the Future ….


Funding AcknowledgmentsFunding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– R. Rajachandrasekar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– S. Guganani (Ph.D.)

– J. Hashmi (Ph.D.)

– N. Islam (Ph.D.)

– M. Li (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Current Research Scientists– X. Lu

– H. Subramoni

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– M. Arnold

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– A. Ruhela


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/



Please join us for other events at SC’17• Workshops

– ESPM2 2017: Third International Workshop on Extreme Scale Programming Models and Middleware

• Tutorials

– InfiniBand, Omni-Path, and High-Speed Ethernet for Dummies

– InfiniBand, Omni-Path, and High-Speed Ethernet: Advanced Features, Challenges in Designing, HEC Systems and Usage

• BoFs

– MPICH BoF: MVAPICH2 Project: Latest Status and Future Plans

• Technical Talks

– EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

– An In-Depth Performance Characterization of CPU- and GPU-Based DNN Training on Modern Architectures

– Scalable Reduction Collectives with Data Partitioning-Based Multi-Leader Design

Please refer to http://mvapich.cse.ohio-state.edu/talks/ for more details

• Technical Talks

– Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

– Co-designing MPI Runtimes and Deep Learning Frameworks for Scalable Distributed Training on GPU Clusters

– High-Performance and Scalable Broadcast Schemes for Deep Learning on GPU Clusters

• Booth Talks– Scalability and Performance of MVAPICH2 on OakForest-PACS

– The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

– Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

– Exploiting Latest Networking and Accelerator Technologies for MPI, Streaming, and Deep Learning: An MVAPICH2-Based Approach

– MVAPICH2-GDR Library: Pushing the Frontier of HPC and Deep Learning

– MVAPICH2-GDR for HPC and Deep Learning

http://mvapich.cse.ohio-state.edu/talks/

High-performance and Scalable MPI+X Library for...

Documents

Transcript of High-performance and Scalable MPI+X Library for...