Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview •...

42
Experiences with HPC and Big Data Applications on the SDSC Comet Cluster: Using Virtualization, Singularity containers, and RDMA enabled Data Analytics tools Mahidhar Tatineni, Director of User Services, SDSC HPC Advisory Council China Conference October 18, 2017, Hefei, China Acknowledgements: Trevor Cooper, Dmitry Mishin, Christopher Irving, Gregor von Laszewski (IU) Fugang Wang (IU), Rick Wagner (Globus group, U. Chicago), Phil Papadopoulos

Transcript of Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview •...

Page 1: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Experiences with HPC and Big Data Applications on theSDSC Comet Cluster: Using Virtualization, Singularity containers,and RDMA enabled Data Analytics tools

Mahidhar Tatineni, Director of User Services, SDSC

HPC Advisory Council China ConferenceOctober 18, 2017, Hefei, China

Acknowledgements: Trevor Cooper, Dmitry Mishin, Christopher Irving, Gregorvon Laszewski (IU) Fugang Wang (IU), Rick Wagner (Globus group, U.Chicago), Phil Papadopoulos

Page 2: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

This work supported by the National Science Foundation, award ACI-1341698.

Page 3: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Overview• Comet Hardware – Compute,GPU nodes, network, flesystems• MPI implementations, including MVAPICH2-GDR results• Data analytics frameworks and tools on Comet – RDMA-Hadoop, RDMA-

Spark, OSU-Caffe• Virtual Cluster (VC) Design – layout, software• VC Benchmarks: MPI, Applications

NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of SciencePI: Michael NormanCo-PIs: Shawn Strande, Philip Papadopoulos, Robert Sinkovits, Nancy Wilkins-DiehrSDSC Project in Collaboration with Indiana University (led by Geoffrey Fox)

Page 4: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet: System Characteristics • Total peak fops ~2.1 PF• Dell primary integrator

• Intel Haswell processors w/ AVX2

• Mellanox FDR InfniBand

• 1,944 standard compute nodes (46,656cores)

• Dual CPUs, each 12-core, 2.5 GHz

• 128 GB DDR4 2133 MHz DRAM

• 2*160GB GB SSDs (local disk)

72 GPU nodes• 36 nodes with two NVIDIA K80 cards, each

with dual Kepler3 GPUs, same CPU as mainpartition.

• 36 nodes with 4 P100 GPUs each, 2 IntelBroadwell processors (14 cores each)

• 4 large-memory nodes• 1.5 TB DDR4 1866 MHz DRAM

• Four Haswell processors/node

• 64 cores/node

• Hybrid fat-tree topology● FDR (56 Gbps) InfniBand

● Rack-level (72 nodes, 1,728 cores) full bisection

bandwidth

● 4:1 oversubscription cross-rack

• Performance Storage (Aeon)● 7.6 PB, 200 GB/s; Lustre

● Scratch & Persistent Storage segments

• Durable Storage (Aeon)● 6 PB, 100 GB/s; Lustre

● Automatic backups of critical data

• Home directory storage

• Gateway hosting nodes

• Virtual image repository

• 100 Gbps external connectivity to Internet2 & ESNet

Page 5: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet Network Architecture

Page 6: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet Lustre Filesystems• Comet features two Lustre filesystems - scratch and projects storage.• Projects storage is mounted on multiple systems, including non IB

connected clusters.• Backend storage servers are connected via 40 Gbit ethernet fabric. • Comet network design handles this aspect by using bridge switches

(Mellanox). • Design provides flexibility to mount filesystem on multiple machine and

keeps the aggregate performance high. • Each filesystem (scratch and projects) achieved 100 GB/s on and IOR

bandwidth test at scale.

Page 7: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet: MPI options, RDMA enabled software• MVAPICH2 (v2.1) is the default MPI on Comet. Intel MPI and

OpenMPI also available.

• MVAPICH2-X v2.2a to provide unified high-performance runtimesupporting both MPI and PGAS programming models.

• MVAPICH2-GDR (v2.2) on the GPU nodes featuring NVIDIA K80s andP100s. Benchmark and application performance results in this talk.

• RDMA-Hadoop (2x-1.1.0), RDMA-Spark (0.9.4) (from Dr. Panda’sHiBD lab) also available.

Page 8: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet K80 node architecture

• 4 GPUs per node• GPUs (0,1) and (2,3) can do P2P communication• Mellanox InfniBand adapter associated with second socket (GPUs 2, 3)

Page 9: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

OSU Latency (osu_latency) Benchmark Intra-node, K80 nodes

• Latency between GPU 2 , GPU 3: 2.82 µs

• Latency between GPU 1 , GPU 2: 3.18 µs

Page 10: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

OSU Latency (osu_latency) Benchmark Inter-node, K80 nodes

• Latency between GPU 2 , process bound to CPU 1 on both nodes: 2.27 µs• Latency between GPU 2 , process bound to CPU 0 on both nodes: 2.47 µs• Latency between GPU 0 , process bound to CPU 0 on both nodes: 2.43 µs

Page 11: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet P100 node architecture

• 4 GPUs per node• GPUs (0,1) and (2,3) can do P2P communication• Mellanox InfniBand adapter associated with frst socket (GPUs 0, 1)

Page 12: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

OSU Latency (osu_latency) Benchmark Intra-node, P100 nodes

• Latency between GPU 0 , GPU 1: 2.73 µs• Latency between GPU 2 , GPU 3: 2.95 µs• Latency between GPU 1 , GPU 2: 3.13 µs

Page 13: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

OSU Latency (osu_latency) Benchmark Inter-node, P100 nodes

• Latency between GPU 0 , process bound to CPU 0 on both nodes: 2.17 µs• Latency between GPU 2 , process bound to CPU 1 on both nodes: 2.35 µs

Page 14: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

MVAPICH2-GDR Application Example:HOOMD-blue • HOOMD-blue is a general-purpose particle simulation toolkit• Results for the Hexagon benchmark are presented..

References:• HOOMD-blue web page: http://glotzerlab.engin.umich.edu/hoomd-blue/• HOOMD-blue Benchmarks page: http://

glotzerlab.engin.umich.edu/hoomd-blue/benchmarks.html• J. A. Anderson, C. D. Lorenz, and A. Travesset. General purpose molecular

dynamics simulations fully implemented on graphics processing units Journal ofComputational Physics 227(10): 5342-5359, May 2008. 10.1016/j.jcp.2008.01.047

• J. Glaser, T. D. Nguyen, J. A. Anderson, P. Liu, F. Spiga, J. A. Millan, D. C.Morse, S. C. Glotzer. Strong scaling of general-purpose molecular dynamicssimulations on GPUs Computer Physics Communications 192: 97-107, July2015. 10.1016/j.cpc.2015.02.028

Page 15: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

HOOMD-Blue: Hexagon Benchmark• N=1,048,576N=1,048,576• Hard particle Monte Carlo

● Vertices: [[0.5,0],[0.25,0.433012701892219],[-0.25,0.433012701892219],[-0.5,0],[-0.25,-0.433012701892219],[0.25,-0.433012701892219]]

● d=0.17010672166874857● a=1.0471975511965976● nselect=4

• Log fle period: 10000 time steps• SDF analysis

● xmax==0.02● δx=10−4● period: 50 time steps● navg=2000

• DCD dump period: 100000

Page 16: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

HOOMD-Blue: Hexagon BenchmarkStrong scaling on K80 nodes

Page 17: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

RDMA-Hadoop and RDMA-SparkNetwork-Based Computing Lab, Ohio State University

NSF funded project in collaboration with Dr. DK Panda*

• HDFS, MapReduce, and RPC over native InfiniBand and RDMA overConverged Ethernet (RoCE).

• Based on Apache distributions of Hadoop and Spark.• Version RDMA-Apache-Hadoop-2.x 1.1.0 (based on Apache Hadoop

2.6.0) available on Comet• Version RDMA-Spark 0.9.4 (based on Apache Spark 2.1.0) is

available on Comet.• More details on the RDMA-Hadoop and RDMA-Spark projects at:

– http://hibd.cse.ohio-state.edu/

*NSF BIGDATA F: DKM: Collaborative Research: Scalable Middleware for Managing and ProcessingBig Data on Next Generation HPC Systems, Award #s 1447804 (Ohio State), and 1447861 (SDSC).

Page 18: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

● Exploit performance on modern clusters with RDMA-enabledinterconnects for Big Data applications.

● Hybrid design with in-memory and heterogeneous storage (HDD,SSDs, Lustre).

● Keep compliance with standard distributions from Apache.

RDMA-Hadoop, Spark

Page 19: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,
Page 20: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

OSU-Caffe, CIFAR10 Quick on K80 nodes

• Results with K80 nodes.

• Current runs with data in Lustre filesystem(/oasis/scratch/comet)

• All Comet GPU nodes have 280GB of SSD basedlocal scratch space. Future tests with larger testcases planned to evaluate performanceadvantages of using the SSDs.

Page 21: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

OSU-Caffe, CIFAR10 Quick on K80 nodes

Page 22: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Virtualization on Comet

• Comet Virtual Clusters – KVM based, SRIOVenabled full virtualization.

• Singularity based containerization – user spaceonly with namespaces and minimal SetUID.

Page 23: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet VC Use Cases• Root access to nodes for custom OS and

software stack. – Example : CAIDA group used it for a workshop

allowing attendees to modify network stack forresearch. Allows for isolation of tests.

• Simplified install for groups with existingmanagement infrastructure.

– Example: Open Science Group (OSG) used theirexisting installation procedures to enable multipleresearch groups to run on Comet (including LIGO).

Page 24: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Singularity Use Cases

• Applications with newer library OS requirements thanavailable on Comet – e.g. Tensorflow, Torch, Caffe.

• Commercial application binaries with specific OSrequirements.

• Importing docker images to enable use in a shared HPCenvironment.

Page 25: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Overview of Virtual Clusters on Comet

• Projects have persistent VM for cluster management● Modest: single core, 1-2 GB of RAM

• Standard compute nodes will be scheduled as containers via

batch system● One virtual compute node per container

• Virtual disk images stored as ZFS datasets● Migrated to and from containers at job start and end

• VM use allocated and tracked like regular computing

Page 26: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Nucleus

Persistent virtual front end

API

• Request nodes• Console & power

• Scheduling• Storage management• Coordinating network changes• VM launch & shutdown

Idle disk images

Active virtual compute nodes

Disk images

User Perspective

Attached and synchronized

Page 27: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Enabling Technologies

• KVM—Lets us run virtual machines (all processor features)

• SR-IOV—Makes MPI go fast on VMs

• Rocks—Systems management

• ZFS—Disk image management

• VLANs—Isolate virtual cluster management network

• pkeys—Isolate virtual cluster IB network

• Nucleus—Coordination engine (scheduling, provisioning, status, etc.)

• Client – Cloudmesh

Page 28: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

User-Customized HPC

Frontend

Virtual FrontendHosting Disk Image Vault

Compute

Compute

Compute

Compute

Compute

Compute

Compute

Compute

Compute

publicnetwork

private

VirtualCompute

VirtualCompute

VirtualCompute

private

VirtualCompute

VirtualCompute

VirtualCompute

private

physicalvirtual virtual

Virtual Frontend

Virtual Frontend

Page 29: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

High Performance Virtual Cluster Characteristics

Virtual Frontend

VirtualCompute

VirtualCompute

VirtualCompute

privateEthernet

Infniband

All nodes have• Private Ethernet• Infniband• Local Disk Storage

Virtual Compute Nodes can Network boot(PXE) from its virtual frontend

All Disks retain state• keep user confguration between boots

Infniband Virtualization • 8% latency overhead. • Nominal bandwidth overhead

Comet: ProvidingVirtualized HPC for XSEDE

Page 30: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Data Storage/Filesystems

• Local SSD storage on each compute node

• Limited number of large SSD nodes (1.4TB) for large VM

images

• Local (SDSC) network access same as compute nodes

• Modest (TB) storage available via NFS now

• Future: Secure access to Lustre?

Page 31: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Cloudmesh Developed by IU collaborators

• Cloudmesh client enables access to multiple cloud environmentsfrom a command shell and command line.

• We leverage this easy to use CLI for virtual cluster management,allowing the use of Comet as infrastructure for virtual clusters.

• Cloudmesh has more functionality with ability to access hybridclouds OpenStack, (EC2, AWS, Azure); possible to extend to othersystems like Jetstream, Bridges etc.

• Plans for customizable launchers available through command line orbrowser – can target specifc application user communities.

Reference: https://github.com/cloudmesh/client

Page 32: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet Cloudmesh Client (selected commands)

• cm comet cluster ID • Show the cluster details

• cm comet power on ID vm-ID -[0-3] --walltime=6h• Power 3 nodes on for 6 hours

• cm comet image attach image.iso ID vm-ID-0• Attach an image

• cm comet boot ID vm-ID-0

• Boot node 0

• cm comet console vc4

• Console

Page 33: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet Cloudmesh Usage Examples

Page 34: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Comet Cloudmesh Client : Console access

(COMET)host:client$ cm comet console vc4 vm-vc4-0

Page 35: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

MPI bandwidth slowdown from SR-IOV is at most 1.21 for medium-sized messages & negligible for small & large ones

Page 36: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

MPI latency slowdown from SR-IOV is at most 1.32 for smallmessages & negligible for large ones

Page 37: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

WRF Weather Modeling

• 96-core (4-node) calculation

• Nearest-neighbor communication

• Test Case: 3hr Forecast, 2.5km

resolution of Continental US

(CONUS).

• Scalable algorithms

• 2% slower w/ SR-IOV vs native IB.

Page 38: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

PSDNS, 1024x1024x1024 : Strong scaling case

• 32 (2-node), 64 (4-node), and 128 (8-

node) core tests.

• Computational core based on FFTs.

• Communication intensive, mainly

alltoallv – bisection bandwidth

limited.

Cores (Nodes) Time/Step

32 (2) 101.51

64 (4) 67.03

128 (8) 33.99

Page 39: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Quantum ESPRESSO

• 48-core (3 node) calculation

• CG matrix inversion - irregular

communication

• 3D FFT matrix transposes (all-

to-all communication)

• Test Case: DEISA AUSURF

112 benchmark.

• 8% slower w/ SR-IOV vs native

IB.

Page 40: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

RAxML: Code for Maximum Likelihood-basedinference of large phylogenetic trees. 

• Widely used, including by CIPRES

gateway.• 48-core (2 node) calculation• Hybrid MPI/Pthreads Code.• 12 MPI tasks, 4 threads per task.• Compilers: gcc + mvapich2 v2.2, AVX

options.• Test Case: Comprehensive analysis,

218 taxa, 2,294 characters, 1,846

patterns, 100 bootstraps specifed.• 19% slower w/ SR-IOV vs native IB.

Page 41: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

MrBayes: Software for Bayesian inference ofphylogeny.

• Widely used, including by CIPRES

gateway.

• 32-core (2 node) calculation

• Hybrid MPI/OpenMP Code.

• 8 MPI tasks, 4 OpenMP threads per task.

• Compilers: gcc + mvapich2 v2.2, AVX

options.

• Test Case: 218 taxa, 10,000 generations.

• 3% slower with SR-IOV vs native IB.

Page 42: Experiences with HPC and Big Data Applications on the SDSC ...€¦ · 4/11/2017  · Overview • Comet Hardware – Compute,GPU nodes, network, flesystems • MPI implementations,

Summary• Comet uses the flexibility and performance of IB network + Bridging to

– provide multi-machine mounted parallel filesystems– enhance GPU applications using GPU Direct RDMA– enhance performance of data analytics tools with RDMA enabled frameworks– enable virtualized HPC clusters at scale.

• OSU Benchmarks and HOOMD-Blue applications show good performanceusing MVAPICH2-GDR.

• Results for OSU-Caffe with CIFAR10 benchmark show good scaling. Futuretests with larger test cases planned to evaluate performance advantages ofusing the SSDs.

• Application benchmarks show good performance on virtualized cluster.PSDNS example, which is communication intensive, shows good strongscaling for a test example.