An Agile Approach to Building a GPU-enabled and...

14
An Agile Approach to Building a GPU-enabled and Performance- portable Global Cloud-resolving Atmospheric Model Dr. Richard Loft* Director, Technology Development CISL/NCAR *National Center for Atmospheric Research GTC, San Jose, CA March 26, 2018

Transcript of An Agile Approach to Building a GPU-enabled and...

Page 1: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

An Agile Approach to Building a GPU-enabled and Performance-portable Global Cloud-resolving Atmospheric Model

Dr. Richard Loft*

Director, Technology Development

CISL/NCAR

*National Center for Atmospheric Research

GTC, San Jose, CA March 26, 2018

Page 2: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

Outline

• Origins Backstory

• The MPAS Model

• Team

• Tools and Design

• Status

2

Page 3: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

Project began with research based on student projects

• Two years of student internship projects in the Summer Internships in Parallel

Computational Science (SIParCS) at NCAR funded student projects related to

architectural inter-comparison.

• Projects focused on optimizing atmospheric numerical PDE solvers for both

CPUs and GPUs with performance portability in mind.

• Architectures compared:

o Xeon Broadwell, Haswell;

o Xeon Phi KNL;

o NVIDIA Tesla P100->V100.

3

Page 4: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

Benchmark Problem

• Shallow Water Equations (SWE) – A set of non-linear partial differential equations (PDE)

– Capture features of atmospheric flow around the Earth

• Radial basis function-generated finite difference (RBF-FD) methods

RBF-FD solution to SWE test case “Flow over an isolated mountain” using 655,532 points [1] 3

An example of 75-point stencil on a sphere [1]

Evaluate differential operator D at every point

Stencil points

Non-stencil points

Cone-shaped mountain

Day 1 Day 15

4

Optimizing Stencils for different architectures

Page 5: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

Insufficient Workload

Parallelism

Sufficient Workload

Parallelism

Directive-based portability in the RBF-FD shallow water equations (2-D unstructured stencil)

• CI roofline model generally predicts performance well, even for more complicated algorithms.

• Xeon performance crashes to DRAM BW limit when cache size is exceeded, with some state reuse.

• Xeon Phi (KNL) HBM memory is less sensitive to problem size that Xeon, saturates with CI figure.

• NVIDIA Pascal P100 performance fits CI model GPU’s require higher levels of parallelism to reach saturation.

0

50

100

150

200

250

300

350

Per

form

an

ce (

GF

LO

PS

)

Broadwell KNL P100

5

Page 6: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

What is MPAS? – The Model for Prediction Across ScalesNCAR’s Global Meteorological/Climate Model; ~100,000 SLOC

6Simulation of 2012 Tropical Cyclones at 4Km Resolution – Courtesy of Falko Judt, NCAR

Page 7: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

Weather and Climate Alliance (WACA):

• NCAR

• NVIDIA Corporation

• IBM Corporation/The Weather Company

• University of Wyoming, CE&EE Department

• Korean Institute of Science and Technology Information (KISTI)

7

Page 8: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

Initial Divide and Conquer Strategy

8

MPAS Dynamics MPAS Physics

Problem Reports and Support

Ideas and Results

Page 9: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

Weather and Climate Alliance (WACA):A Collaboration for Earth System Model Acceleration

• NCAR (2+4)o Dr. Rich Loft, Director TDD

o Dr. Raghu Raj Kumar, Project Scientist TDD

o Clint Olson, TDD

o Bill Skamarock, Senior Science, MMM

o Michael Duda, Software Engineer, MMM

o Dave Gill, Software Engineer, MMM

• KISTI (2+1)o Minsu Joh, KISTI Director, Disaster Management Research Center

o Dr. Ji-Sun Kang. Senior Researcher

o Jae-Youp Kim, GRA

• NVIDIA/PGI (1+3)o Greg Branch, NVIDIA, Sales

o Dr. Carl Ponder, Senior Applications Engineer

o Brent Leback, PGI Compiler Engineering Manager

o Craig Tierny, Solutions Architect

• University of Wyoming (1+5)o Dr. Suresh Muknahallipatna, Professor E&CE, UW

o Supreeth Suresh, Pranay Reddy, Sumathi Lakshmiranganathan, Cena Miller, Bradley Riotto - GRAs

9

~6 PI +13 technical staff Started in September 2016 (18 months) ~9 FTE-years

Page 10: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

10

Problem Reports and Support

Since September: added IBM and The Weather Company

IBM/TWC participants (1+2)o Jaime Morenoo Todd Hutchinsono Constantinos Evangelinos

Page 11: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

Tools for Accelerating Code Optimization

• Kernel GENerator (KGEN)

o Extracts kernels from Fortran applications

o Creates:

• Standalone source code

• Input and output state for verification

• Added support for code coverage and representation

o Broad user community

• 8 Domestic institutions

• 5 international institutions

• 1 Company

o Available on Github:

https://github.com/NCAR/KGen

11

KGEN is a useful tool for accelerating code porting and optimization

Page 12: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

MPAS Synchronous and Asynchronous Execution

LW and SW Radiation

Dynamics and Physics

AsynchI/O

Land Surface

:

Dynamics and Physics

Land Surface

:

LW and SW Radiation

or

LW and SW Radiation

LW and SW Radiation

or

or

Disk

𝛥t

or

Page 13: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

Phase 2: pushing on to a full MPAS port• Status of GPU-based model components

o Ported, optimized, verified

• Dry dynamical core• GPU-direct implementation of MPAS halo exchanges

o Ported, optimized

• Moist dynamics (tracer transport)• Xu-Randall Cloud fraction

o Ported, undergoing optimization• WSM6 Microphysics• YSU Boundary layer scheme

o Awaiting Porting

• Scale Insensitive Tiedtke convection scheme • Monin-Obukhov surface layer scheme

• CPU-based components

o Overlapping SW and LW RRTMG Radiation (lagged radiation)

o NOAH Land Surface Model (synchronous, remains on CPU)

o SIONlib I/O subsystem13

Page 14: An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

IBM/TWC MPAS Objectives • MPAS grid with local refinement

24-hour global forecasts

• 12 km global grid

• 3 km refinement over selected regions.

• 32.8 M horizontal points

• 56 layers

Forecast requirement

• Complete 20 hour simulation

• …in 45 minutes

• xRe = 26.7

• For 𝛥t = 18 sec, timestepbudget is 0.674 seconds

14

Refined grids can be generated anywhere desired.

Dr. Kumar will show next that as few as 800 V100s could achieve this goal…