Lattice Boltzmann simulations on heterogeneous CPU-GPU...

Post on 17-Oct-2020

3 views 0 download

Transcript of Lattice Boltzmann simulations on heterogeneous CPU-GPU...

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Lattice Boltzmann simulations on heterogeneous CPU-GPU

clusters H. Köstler, Ch. Feichtinger

2nd International Symposium

“Computer Simulations on GPU” Freudenstadt 29.05.2013

1

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Contents

Motivation

waLBerla software concepts

LBM simulations on Tsubame

Future Work

2

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Computational Science and Engineering @ LSS

3

Applications • Multiphysics • fluid, structure • medical imaging • laser

Applied Math • LBM • multigrid • FEM • numerics

Computer Science • HPC / hardware • Performance

engineering • software

engineering USE_SweepSection( getLBMsweepUID() ) USE_Sweep() swUseFunction(„LBM",sweep::LBMsweep,FSUIDSet::all(),hsCPU,BSUIDSet::all()); USE_After() //Communication

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Problems

Hardware: Modern HPC clusters are massively parallel Intra-core, intra-node, and inter-node

Software: Applications become more complex with increasing computational power

More complex (physical) models

Code development in interdisciplinary teams

Algorithm: Many variants exist Components and parameters depend on computational domain or grid, type of problem, …

4

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

WALBERLA Applications

5

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

waLBerla: parallel block-structured grid framework

6

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

waLBerla @ GPU

7

Geometric multigrid solver on Tsubame

Computational Steering (VIPER)

CFD, fluid-structure interaction

0 500

1000 1500 2000 2500 3000 3500

unknowns in million

runt

ime

in m

s

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Boltzmann equation

Mesoscopic approach to solving the Navier-Stokes equations

Boltzmann equation describes the statistical distribution of one particle in a fluid

f is the probability distribution function (PDF), the particle velocity, and Ω(f) is the change due to collision

Models behavior of fluids in statistical physics

Lattice Boltzmann Method (LBM) solves the discrete Boltzmann equation

)(f f ft Ω=∇⋅+∂ ζ

ζ

8

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Particulate Flow Simulation

9

D3Q19 LBM cell Collide and Stream

amF ⋅=α⋅= JM

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

WALBERLA CPU-GPU cluster software concepts

10

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

waLBerla framework

Main goal: provide a massive parallel and efficient software framework for multi-physics simulations

WaLBerla is mainly designed for HPC clusters

11

waLBerla (C++) Code management,

standard implementations

Low-level kernels for optimized architecture-

specific computations (in

C++, CUDA, Assembler)

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

waLBerla: Block concept

12

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

waLBerla: Sweep concept

13

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Challenges on heterogeneous clusters I

Problem: Description of the heterogeneous compute resources Solution: Description of all compute components per compute node in the input file

Problem: Management of the communication and compute kernels for each architecture Solution: Kernel management based on meta data

14

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Challenges on heterogeneous clusters II

Problem: Common communication interface Solution: Data exchange via communication buffers also for intra node communication

Problem: Minimization of the heterogeneous communication overhead Solution: Overlapping of work and communication, non-uniform domain decomposition, and intra-node communication in z-dimension

15

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

waLBerla: Communication concept

16

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Overlapping of work and communication

17

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

WaLBerla: Subblocks

Assumption: A block corresponds to a (shared-memory) compute node

Can possibly be heterogeneous (CPU + GPU)

Distributed memory communication (via MPI) is not required within one block

Divide one block into subblocks of different sizes for (static) load balancing

Subblocks map to (local) devices

18

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Domain decomposition on one compute node

19

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

RESULTS LBM Simulations on Tsubame 2.0

20

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

21

Tsubame 2.0 in Japan

Compute nodes: 1442

Processor: Intel Xeon X5670

GPU: 3 x Nvidia Tesla M205

Peak performance:

2.2 PFlop/s

633 TB/s memory bandwidth

LINPACK performance: 1.2 Petaflops

Power consumption: 1.4 MW

Interconnect: QDR Infiniband

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Performance Engineering

22

Create performance model

Identify performance bottlenecks

Create problem-specific, hardware-

dependent, and highly efficient kernel

Integrate them in software framework

Algorithm Hardware

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Input Algorithm: LBM kernel

Generic Implementation

Hardware information (bandwidth, peak performance)

Assumption

Computation time limited by memory bandwidth and instruction throughput

Communication time limited by network bandwidth and latency (for direct and collective communication)

Performance Model I

),max( ,,,, MPIcommGPUCPUcommbufferinnercompoutercomptotal tttttt +++=

23

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Single node performance on Tsubame

Machine balance

Code balance

Lightspeed estimate (if l < 1 code is bandwidth limited)

Performance Model II

24

eperformancpeak bandwidth esustainabl

=mB

=

c

m

BBl ,1min

200304

FLOPS executed no.stored and loaded bytes no.

==cB

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Single Compute Node Performance I

25

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Single Compute Node Performance II

26

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Single Compute Node Performance III

27

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Single Compute Node Performance IV

28

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Communication model

29

),max( ,,,, MPIcommGPUCPUcommbufferinnercompoutercomptotal tttttt +++=

Communication time for one message depends on size of message s

number of messages x that are concurrently transferred over the communication link (communication pattern)

type of communication link ω

relative position of the communication partners e.g.intra- or inter-node communication p

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Weak scaling, 3 GPUs per node

30

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Strong scaling, 3 GPUs per node

31

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Test case: Packed bed of hollow cylinders

32

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Porous media: 100x100x1536, 1D dom. decomp.

33

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Porous media: 100x100x1536, 1D dom. decomp.

34

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Porous media: 100x100x1536, 1D/2D/3D

35

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Porous media: 256x256x3600, 1D/2D

36

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Current Work

Focus in waLBerla currently on Juqueen and SuperMUC

37

Computer Science X - System Simulation Group Harald Köstler (harald.koestler@fau.de)

Future Work

Tests on Nvidia Kepler cluster

Programming paradigms on future HPC clusters?

Code generation techniques to improve portability

Dynamic load balancing

38