Towards GPU-accelerated Operational Weather Forecasting

Towards GPU-accelerated Operational Weather Forecasting

Oliver Fuhrer (MeteoSwiss), Carlos Osuna (C2SM/ETH), Xavier Lapillonne (C2SM/ETH), Tobias Gysi (Supercomputing Systems AG), Mauro Bianco (CSCS), Thomas Schulthess (CSCS/ETH)

Overview

•  Weather and climate modeling

•  The COSMO model

•  GPU porting approach

•  Single node performance results

•  Multi-GPU Communication

Weather & Climate System

10–6 10–3 100 103 106

1 µm 1 m 1000 km

[m]

106 m =

1 Mm

10-‐6 m =

1 µm

adapted from C. Schär

Planetary flow Mountains Clouds Snow flake

The modeling chain

COSMO-7 with 6.6 km grid 3 x per dat 72 h forecast

COSMO-2 with 2.2 km grid 7 x per day 33 h forecast 1 x per day 45 h forecast

EZMWF-Modell with 16 km grid 2 x per day 10 day forecast

What is COSMO?

•  Consortium for Small-Scale Modeling

•  http://www.cosmo-model.org/

•  Limited-area climate model

•  Used by 7 weather services and O(50) universities and research institutes

Initialization

Cleanup

Boundary Conditions

Physics

Dynamics

Data assimilation

Diagnostics

Input / Output

COSMO Workflow

Properties •  PDEs •  Finite differences •  Structured grid •  Local operators •  All steps read / write model

state •  Time splitting •  Sequential workflow

Δt

Lines vs. Runtime

•  300’000 LOC Fortran 90

% Lines of Code % Runtime

k

i

j

Algorithmic Motifs

k

i

j

•  Stencils (finite diferences)

•  Horizontal dependencies •  No loop carried

dependencies •  Tridiagonal linear solves

•  Vertical dependencies •  Loop carried dependencies •  Parallelizable in horiztonal

Stencil Computations •  Stencils are kernels updating array elements according to a fixed access pattern •  Stencil computations are typically memory bandwidth bound

à Stencil codes are memory bandwidth bound rather than flop limited

lap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k);

2D-Laplacian

•  5 flops per 6 memory accesses ~ 0.1 FLOPs/Byte •  A Tesla K20X delivers up to 1300 Gflops / 250 GB/s ~ 5.2 FLOPs/Byte

•  Arithmetic Intensity (= FLOPs pro memory access) –  High arithmetic intensity à processor bound

–  Low arithmetic intensity à memory bound

Algorithmic Motifs

O(1) O(log n) O(n)

Stencil Particle methods Sparse Linear Algebra FFT BLAS 1 Dense Linear Algebra BLAS 2 Lattice Methods (BLAS 3)

Top500 (Linpack) Focus of HPC Systemdesign

COSMO dynamical core (stencils on structured grid)

What can be done? •  Adapt the code employing bandwidth saving strategies

•  Computation on-the-fly •  Increase data locality

•  Leverage the high memory bandwidth of GPUs

Architecture Peak Performance Memory Bandwidth

Xeon E5 2690 371 GFLOPS 51.2 GB/s

Tesla K20X 1300 GFLOPS 250 GB/s

Accelerator approach

•  Leverage high peak performance of GPU •  CPU and GPU have different memories

FLOPs CPU

CPU

FLOPs GPU

Copy data

Why does this not work for COSMO?

•  Transfer of data on each timestep too expensive

All code which accesses the prognostic variables within the timestepping has to be ported

Part Time/Δt Dynamics 172 ms Physics 36 ms Total 253 ms

vs 118 ms

Transfer of ten prognostic variables

* §

Initialization

Cleanup

Boundary Conditions

Physics

Dynamics

Data assimilation

Diagnostics

Input / Output

Porting Approach

Δt

!$acc data copy(u,v,w,t,pp,…)!$acc end data

Initialization

Cleanup

Boundary Conditions

Physics

Dynamics

Data assimilation

Diagnostics

Input / Output

Porting Approach: Physics

Δt

•  Large group of developers •  Code may be shared with other models •  Less memory bandwidth bound •  Large part of code (50% of the lines) •  25% of runtime à GPU port with compiler directives

(OpenACC) à  Code optimization à  Critical routines currently have CPU and

GPU version

Performance: Physics

•  AMD Interlagos vs. Tesla K20x

•  Test: 128x128x60

•  90% of physics time in turbulence, radiation and microphysics

•  Speedup scales with effort

0 1 2 3 4 5 6 7

Microphysics

RadiaDon

Turbulence

Soil

Total Physics

Speedup

"PGI"

"Cray CCE"

Initialization

Cleanup

Boundary Conditions

Physics

Dynamics

Data assimilation

Diagnostics

Input / Output

Porting Approach: Dynamics

Δt

•  Small group of developers •  Memory bandwidth bound •  Complex stencils (3D) •  60% of runtime à  Complete rewrite in C++/CUDA à  Development of a stencil library (DSL) à  Target architectures CPU (x86) and GPU. à  Extendable to other architectures à  Long term adaptation of the model

See talk by Tobias Gysi of GTC12 for details

Performance: Dynamics

•  Test domain: 128 x 128 x 60 on a single CPU / GPU

•  CPU (OpenMP, kji-storage) •  Factor 1.6x – 1.8x faster

than Fortran version •  No explicit use of vector

instructions (up to 30% improvement)

•  GPU (CUDA, ijk-storage) •  Factor 2.8x faster than

CPU version •  Ongoing performance

optimization

A single switch in order to compile for GPU

Initialization

Cleanup

Boundary Conditions

Physics

Dynamics

Data assimilation

Diagnostics

Input / Output

Porting Approach: Dynamics

Δt

!$acc data copy(u,v,w,…)!$acc host_data use_device(u,v,w,…)call dynamics(u,v,w,…)!$acc end host_data!$acc end data

No copies required for Fortran/OpenACC and C++/CUDA

interoperability

Initialization

Cleanup

Boundary Conditions

Physics

Dynamics

Data assimilation

Diagnostics

Input / Output

Porting Approach: I/O

Δt

!$acc data copy(u,v,w,…)!$acc update host(u,v,w,…)call inputoutput(u,v,w,…)!$acc update device(u,v,w,…)!$acc end data

Initialization

Cleanup

Boundary Conditions

Physics

Dynamics

Data assimilation

Diagnostics

Input / Output

Porting Approach: Other

Δt

•  Avoid (large) data transfers •  Not performance critical •  A lot of code à GPU port with compiler directives

(OpenACC) à  Little or no code optimization à  Strict single source

OpenACC Experience •  Retain existing user code (e.g. Fortran) •  Relatively easy to get code running & validating •  Useful for large code bases

•  Performance may require significant restructuring à single source? •  Data placement can be tricky •  Compiler support still improving à thanks for the help! •  No fine grain control (e.g. data placement, register allocation)

Demonstration Project

•  Prototype implementation of the MeteoSwiss production suite

•  Same time-to-solution

1 cabinet Cray XE6

18cm

Tyan Server FT77A

144 CPUs (AMD Magny Cours) 8 GPUs

(Tesla K20x)

(“fat” node)

Tyan Server: Block Diagram

PLX PLX

PLX PLX

E5 2600

E5 2600 IOH

IOH

PCIe QPI

Inter-GPU Communication

•  GPU-aware MPI library can directly handle GPU pointers

•  Eliminates bottleneck of CPU transfers

•  Pack / unpack kernels implemented by GCL

•  GPUs can be one the same or different node

•  Bandwidth depends on “proximity” of GPUs

Fortran / OpenACC C++ / CUDA

Communication Framework

GCL

MPI (P2P capable)

Performance: Communication

•  Bi-directional BW between two GPUs

•  Scaling (COSMO, per GPU BW, 1d-decomposition, 2–8 GPUs)

Same… P2P MVAPICH2 COSMO

PLX 13.1 GB/s 8.2 GB/s 7.1 GB/s

IOH 11.0 GB/s 7.4 GB/s 6.9 GB/s

Node 10.3 GB/s 4.2 GB/S 4.1 GB/s

2 3 4 5 6 7 8

7.0 GB/s 6.3 GB/s 6.5 GB/s 4.0 GB/s 4.5 GB/s 4.6 GB/s 4.6 GB/s

Communication Experience •  Leveraging P2P is the key to

getting performance on “fat” nodes

•  GPU-aware MPI libraries make this transparent to user

•  Packet size matters! Use real benchmark for performance numbers!

•  Pack / unpack of MPI library does not perform à custom kernels

•  Pack / unpack time is 10x smaller than communication

•  Long startup times when UVA is enabled

Conclusions & Final Thoughts •  A full port of a legacy weather and

climate model is feasible!

•  Stay tuned for the demonstration project!

•  OpenACC eases the process but “no free lunch”

•  Compilers are not perfect but improving rapidly

•  Separation of concerns? à DSL, DSEL, …

Backup slides

Requirements

•  Performance portability •  Single source code •  Code understandable / maintainable by domain scientists •  Possibility to exchange code

Basic Equations

wind

pressure

temperature

water

density

Computational Grid

1 – 10 km

20 – 600 m k

i

j u(i,j,k), v(i,j,k), w(i,j,k),

p(i,j,k), t(i,j,k), qv(i,j,k),

qc(i,j,k), qi(i,j,k), rho(i,j,k)

Optimal hardware for different use cases

8x8 16x8 32x8 64x8 64x16 64x32 64x64 128x64 128x128 256x128 256x256

10−2

10−1

100

mesh dimensions

wall

time/tim

e s

tep (

s)

Interlagos SocketSandy Bridge SocketX2090

high throughput

high performance

Running COSMO on a multi-GPU node

Intel Xeon E5-2600

Domain decomposition with 1 MPI task and

1 GPU per subdomain

Towards GPU-accelerated Operational Weather Forecasting

Documents

Transcript of Towards GPU-accelerated Operational Weather Forecasting