Post on 03-Feb-2022
Towards GPU-accelerated Operational Weather Forecasting
Oliver Fuhrer (MeteoSwiss), Carlos Osuna (C2SM/ETH), Xavier Lapillonne (C2SM/ETH), Tobias Gysi (Supercomputing Systems AG), Mauro Bianco (CSCS), Thomas Schulthess (CSCS/ETH)
Overview
• Weather and climate modeling
• The COSMO model
• GPU porting approach
• Single node performance results
• Multi-GPU Communication
Weather & Climate System
10–6 10–3 100 103 106
1 µm 1 m 1000 km
[m]
106 m =
1 Mm
10-‐6 m =
1 µm
adapted from C. Schär
Planetary flow Mountains Clouds Snow flake
The modeling chain
COSMO-7 with 6.6 km grid 3 x per dat 72 h forecast
COSMO-2 with 2.2 km grid 7 x per day 33 h forecast 1 x per day 45 h forecast
EZMWF-Modell with 16 km grid 2 x per day 10 day forecast
What is COSMO?
• Consortium for Small-Scale Modeling
• http://www.cosmo-model.org/
• Limited-area climate model
• Used by 7 weather services and O(50) universities and research institutes
Initialization
Cleanup
Boundary Conditions
Physics
Dynamics
Data assimilation
Diagnostics
Input / Output
COSMO Workflow
Properties • PDEs • Finite differences • Structured grid • Local operators • All steps read / write model
state • Time splitting • Sequential workflow
Δt
Lines vs. Runtime
• 300’000 LOC Fortran 90
% Lines of Code % Runtime
k
i
j
Algorithmic Motifs
k
i
j
• Stencils (finite diferences)
• Horizontal dependencies • No loop carried
dependencies • Tridiagonal linear solves
• Vertical dependencies • Loop carried dependencies • Parallelizable in horiztonal
Stencil Computations • Stencils are kernels updating array elements according to a fixed access pattern • Stencil computations are typically memory bandwidth bound
à Stencil codes are memory bandwidth bound rather than flop limited
lap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k);
2D-Laplacian
• 5 flops per 6 memory accesses ~ 0.1 FLOPs/Byte • A Tesla K20X delivers up to 1300 Gflops / 250 GB/s ~ 5.2 FLOPs/Byte
• Arithmetic Intensity (= FLOPs pro memory access) – High arithmetic intensity à processor bound
– Low arithmetic intensity à memory bound
Algorithmic Motifs
O(1) O(log n) O(n)
Stencil Particle methods Sparse Linear Algebra FFT BLAS 1 Dense Linear Algebra BLAS 2 Lattice Methods (BLAS 3)
Top500 (Linpack) Focus of HPC Systemdesign
COSMO dynamical core (stencils on structured grid)
What can be done? • Adapt the code employing bandwidth saving strategies
• Computation on-the-fly • Increase data locality
• Leverage the high memory bandwidth of GPUs
Architecture Peak Performance Memory Bandwidth
Xeon E5 2690 371 GFLOPS 51.2 GB/s
Tesla K20X 1300 GFLOPS 250 GB/s
Accelerator approach
• Leverage high peak performance of GPU • CPU and GPU have different memories
FLOPs CPU
CPU
FLOPs GPU
Copy data
Why does this not work for COSMO?
• Transfer of data on each timestep too expensive
All code which accesses the prognostic variables within the timestepping has to be ported
Part Time/Δt Dynamics 172 ms Physics 36 ms Total 253 ms
vs 118 ms
Transfer of ten prognostic variables
* §
Initialization
Cleanup
Boundary Conditions
Physics
Dynamics
Data assimilation
Diagnostics
Input / Output
Porting Approach
Δt
!$acc data copy(u,v,w,t,pp,…)!$acc end data
Initialization
Cleanup
Boundary Conditions
Physics
Dynamics
Data assimilation
Diagnostics
Input / Output
Porting Approach: Physics
Δt
• Large group of developers • Code may be shared with other models • Less memory bandwidth bound • Large part of code (50% of the lines) • 25% of runtime à GPU port with compiler directives
(OpenACC) à Code optimization à Critical routines currently have CPU and
GPU version
Performance: Physics
• AMD Interlagos vs. Tesla K20x
• Test: 128x128x60
• 90% of physics time in turbulence, radiation and microphysics
• Speedup scales with effort
0 1 2 3 4 5 6 7
Microphysics
RadiaDon
Turbulence
Soil
Total Physics
Speedup
"PGI"
"Cray CCE"
Initialization
Cleanup
Boundary Conditions
Physics
Dynamics
Data assimilation
Diagnostics
Input / Output
Porting Approach: Dynamics
Δt
• Small group of developers • Memory bandwidth bound • Complex stencils (3D) • 60% of runtime à Complete rewrite in C++/CUDA à Development of a stencil library (DSL) à Target architectures CPU (x86) and GPU. à Extendable to other architectures à Long term adaptation of the model
See talk by Tobias Gysi of GTC12 for details
Performance: Dynamics
• Test domain: 128 x 128 x 60 on a single CPU / GPU
• CPU (OpenMP, kji-storage) • Factor 1.6x – 1.8x faster
than Fortran version • No explicit use of vector
instructions (up to 30% improvement)
• GPU (CUDA, ijk-storage) • Factor 2.8x faster than
CPU version • Ongoing performance
optimization
A single switch in order to compile for GPU
Initialization
Cleanup
Boundary Conditions
Physics
Dynamics
Data assimilation
Diagnostics
Input / Output
Porting Approach: Dynamics
Δt
!$acc data copy(u,v,w,…)!$acc host_data use_device(u,v,w,…)call dynamics(u,v,w,…)!$acc end host_data!$acc end data
No copies required for Fortran/OpenACC and C++/CUDA
interoperability
Initialization
Cleanup
Boundary Conditions
Physics
Dynamics
Data assimilation
Diagnostics
Input / Output
Porting Approach: I/O
Δt
!$acc data copy(u,v,w,…)!$acc update host(u,v,w,…)call inputoutput(u,v,w,…)!$acc update device(u,v,w,…)!$acc end data
Initialization
Cleanup
Boundary Conditions
Physics
Dynamics
Data assimilation
Diagnostics
Input / Output
Porting Approach: Other
Δt
• Avoid (large) data transfers • Not performance critical • A lot of code à GPU port with compiler directives
(OpenACC) à Little or no code optimization à Strict single source
OpenACC Experience • Retain existing user code (e.g. Fortran) • Relatively easy to get code running & validating • Useful for large code bases
• Performance may require significant restructuring à single source? • Data placement can be tricky • Compiler support still improving à thanks for the help! • No fine grain control (e.g. data placement, register allocation)
Demonstration Project
• Prototype implementation of the MeteoSwiss production suite
• Same time-to-solution
1 cabinet Cray XE6
18cm
Tyan Server FT77A
144 CPUs (AMD Magny Cours) 8 GPUs
(Tesla K20x)
(“fat” node)
Tyan Server: Block Diagram
PLX PLX
PLX PLX
E5 2600
E5 2600 IOH
IOH
PCIe QPI
Inter-GPU Communication
• GPU-aware MPI library can directly handle GPU pointers
• Eliminates bottleneck of CPU transfers
• Pack / unpack kernels implemented by GCL
• GPUs can be one the same or different node
• Bandwidth depends on “proximity” of GPUs
Fortran / OpenACC C++ / CUDA
Communication Framework
GCL
MPI (P2P capable)
Performance: Communication
• Bi-directional BW between two GPUs
• Scaling (COSMO, per GPU BW, 1d-decomposition, 2–8 GPUs)
Same… P2P MVAPICH2 COSMO
PLX 13.1 GB/s 8.2 GB/s 7.1 GB/s
IOH 11.0 GB/s 7.4 GB/s 6.9 GB/s
Node 10.3 GB/s 4.2 GB/S 4.1 GB/s
2 3 4 5 6 7 8
7.0 GB/s 6.3 GB/s 6.5 GB/s 4.0 GB/s 4.5 GB/s 4.6 GB/s 4.6 GB/s
Communication Experience • Leveraging P2P is the key to
getting performance on “fat” nodes
• GPU-aware MPI libraries make this transparent to user
• Packet size matters! Use real benchmark for performance numbers!
• Pack / unpack of MPI library does not perform à custom kernels
• Pack / unpack time is 10x smaller than communication
• Long startup times when UVA is enabled
Conclusions & Final Thoughts • A full port of a legacy weather and
climate model is feasible!
• Stay tuned for the demonstration project!
• OpenACC eases the process but “no free lunch”
• Compilers are not perfect but improving rapidly
• Separation of concerns? à DSL, DSEL, …
Backup slides
Requirements
• Performance portability • Single source code • Code understandable / maintainable by domain scientists • Possibility to exchange code
Basic Equations
wind
pressure
temperature
water
density
Computational Grid
1 – 10 km
20 – 600 m k
i
j u(i,j,k), v(i,j,k), w(i,j,k),
p(i,j,k), t(i,j,k), qv(i,j,k),
qc(i,j,k), qi(i,j,k), rho(i,j,k)
Optimal hardware for different use cases
8x8 16x8 32x8 64x8 64x16 64x32 64x64 128x64 128x128 256x128 256x256
10−2
10−1
100
mesh dimensions
wall
time/tim
e s
tep (
s)
Interlagos SocketSandy Bridge SocketX2090
high throughput
high performance
Running COSMO on a multi-GPU node
Intel Xeon E5-2600
Domain decomposition with 1 MPI task and
1 GPU per subdomain