New Corona System & CTS-2 UpdateCTS-2 contract awarded CTS-2 Market surveys CTS-2 and TOSS teams...

13
LLNL-PRES-770721 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC New Corona System & CTS-2 Update March 2019 LC User Meeting Matt Leininger CTS-2 POC March 28, 2019

Transcript of New Corona System & CTS-2 UpdateCTS-2 contract awarded CTS-2 Market surveys CTS-2 and TOSS teams...

LLNL-PRES-770721This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

New Corona System & CTS-2 UpdateMarch 2019 LC User Meeting

Matt LeiningerCTS-2 POC

March 28, 2019

2LLNL-PRES-xxxxxx

Corona is a Follow-on to Catalyst:First AMD GPU Cluster for HPC, ML, and Data Science

GbE Management Net

164 2-Socket 24-Core Compute Nodes + 328 GPUs

ManagementNode

LoginNode

Lustre Parallel File SystemMD MD…

Infiniband SAN

Mellanox HDR High Performance Interconnect

Mgmt LoginGW1 …..GW2 GW3 GW4 Gateway

Nodes

AMDNaples

0-23

AMD Naples24-47

Remote ServerManagement

PCIe3 x16

1Gbe

InfiniBandHCA IPMIHigh performance

Network fabric

PCIe GPGPUPCIe3 x16PCIe GPGPUPCIe3 x16PCIe3 x8

PCIe GPGPUPCIe NVRAM

PCIe3 x16PCIe GPGPUPCIe3 x16

Node• AMD Naples 24-core 2.0 GHz• Memory: 256 GB; 5.3 GB/core• Memory BW: > 300 GB/s DDR• 1.6 TB NVMe• Mellanox HDR100• 4 GPU per compute node

System Nodes• 82 CPU-only nodes• 82 CPU+GPU• 4 Gateways• 1 Login• 1 Management

3LLNL-PRES-xxxxxx

Corona Highlights

. TF/s5. TF/s

10. TF/s15. TF/s20. TF/s25. TF/s30. TF/s35. TF/s40. TF/s

MI-25 MI-60 V100

GPU FP Performance

FP64 FP32 PF16

TF/s

5,000 TF/s

10,000 TF/s

15,000 TF/s

20,000 TF/s

MI-25 MI-60 Total

Corona FP Performance

FP64 PF32 PF16

Considering adding 328 AMD MI-60 GPUs to Corona

112

4LLNL-PRES-xxxxxx

Corona NVMe

NVMeHGST 2N2003 DWPD

Read Write

Sequential @ 128 KiB 3.35 GB/s 2.1 GB/s

Random @ 4 KiB 835K IOPs 200K IOPs

Total 549 TB/s; 137M IOPs 344 TB/s; 32.8M IOPs

5LLNL-PRES-xxxxxx

Corona Software Environment

• Tri-Lab Operating System Software HPC environment as base foundation• TOSS 3.x based on RHEL 7.x• Provides smooth transition for TOSS team and LLNL HPC users• Includes AMD drivers, compilers, etc.• Slurm + Flux scheduler and resource manager

• Additional software for Data science & Machine Learning• Containers supported• Working with early users to explore other software

Corona is onsite and undergoing burn-in. Early User access in April.

6LLNL-PRES-xxxxxx

Commodity Technology Systems

• Status of CTS-2 procurement• Approximate Timeline• Potential Architectures

7LLNL-PRES-xxxxxx

Market surveys

LANL SNL

Market surveys

Update Tech requirements

Release DRAFT RFP

Vendor Selection

Tri-lab negotiations

CTS-2 contract awarded

CTS-2 Market surveys

CTS-2 and TOSS teams continue to work together during CTS-2 deployment & lifetime support

Market surveys

LLNL

Feedback on DRAFT RFP

Final RFP

Jan. 2020

April 2019

August 2019

Sept 2019

Sept-Oct 2019

April 2019 - May 2019

Oct. 2018 – March 2019

Oct. 2018 – March 2019

2018-2019CTS-2 activities leadingto RFP and Contract

8LLNL-PRES-xxxxxx

DRAFTCTS-2 Procurement Timeline

2019

Market Survey Begins

CTS-2 contract awarded

Jan. Feb. March April May June July Aug. Sept. Oct. Nov. Dec.

2020

2021-2023

Jan. Feb. March April May June July Aug. Sept. Oct. Nov. Dec.

Jan. Feb. March April May June July Aug. Sept. Oct. Nov. Dec.

Oct. Nov. Dec.2018Release DRAFT

CTS-2 RFP

Release Final

CTS-2 RFP

CTS-2 Proposal Review & Vendor

Selection

Contract Negotiations

Complete

CTS-2 SU: Phase 0 Deliveries

Begin softwareIntegration with TOSS

TOSS Early Evaluation System

PotentialArchitecture

Decision Point

CTS-2 SU: Phase 1 Deliveries CTS-2 SU: Phase 2 Deliveries

Deliveries may start in 2H2020ASC Deployments may start in 1H2021

9LLNL-PRES-xxxxxx

Potential CTS-2 Node Design

CPU0-47

CPU48-95

IPC Link

Remote ServerManagement

32-64 GB DIMMs DDR532GB x 8 DIMMs = 256 GB/socket> 200 GB/s per socket

High-SpeedNetwork HCA

High performanceNetwork fabric

CPU Architecture & Software Readiness are key aspect of CTS-2 Selection• Intel Xeon, AMD Epyc, Marvell ThunderX, IBM Power all viable processors•Maturity of platform?•TOSS support•Maturity of system software and overall software ecosystem?•Cost/performance of platform?

What about GPU systems and HBM memory?

10LLNL-PRES-xxxxxx

Bringing ATS features to CTS-2

• GPU are becoming more widely adopted• Past commodity procurements were dominated by CPU-only SU’s• GPU system will be available under CTS-2• Programs responsible for determining the mix of CPU-only + GPU

nodes/clusters best address workloads• How much GPU memory do you need?• What is the ratio of CPU’s to GPU’s?• Is hardware support for unified memory required?• Can all codes utilize GPU’s?• Can all workloads utilize GPU’s – 3D vs 2D?

11LLNL-PRES-xxxxxx

Bringing ATS features to CTS-2

• Give me the fast GPU memory but on CPU’s!!!

• Today’s GPU utilize High Bandwidth Memory (HBM v2 or HBM2)

• CPU + HBM may be a nice architecture for CTS

• Time to market is likely 2022+

• High Bandwidth Memory provides

— ~3X more bandwidth per socket

— ~4X less memory capacity per socket

— 1-1.5 GB/core – adapt applications accordingly

• CTS-2 will include options for CPU+HBM if/when available

12LLNL-PRES-xxxxxx

Questions?

• Matt [email protected]

DisclaimerThis document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC.The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.