Michael L. Norman, UC San Diego and SDSC [email protected].

36
ENZO AND EXTREME SCALE AMR FOR HYDRODYNAMIC COSMOLOGY Michael L. Norman, UC San Diego and SDSC [email protected]

Transcript of Michael L. Norman, UC San Diego and SDSC [email protected].

Page 1: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

ENZO AND EXTREME SCALE AMR FOR HYDRODYNAMIC COSMOLOGY

Michael L. Norman, UC San Diego and [email protected]

Page 2: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

WHAT IS ENZO?

A parallel AMR application for astrophysics and cosmology simulations Hybrid physics: fluid + particle + gravity + radiation Block structured AMR MPI or hybrid parallelism

Under continuous development since 1994 Greg Bryan and Mike Norman @ NCSA Shared memorydistributed memoryhierarchical memory C++/C/F, >185,000 LOC

Community code in widespread use worldwide Hundreds of users, dozens of developers Version 2.0 @ http://enzo.googlecode.com

Page 3: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.
Page 4: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

TWO PRIMARY APPLICATION DOMAINS

ASTROPHYSICAL FLUID DYNAMICS HYDRODYNAMIC COSMOLOGY

Supersonic turbulence Large scale structure

Page 5: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

ENZO PHYSICSPhysics Equations Math type Algorithm(s) Communicati

on

Dark matter Newtonian N-body

Numerical integration

Particle-mesh Gather-scatter

Gravity Poisson Elliptic FFTmultigrid

Global

Gas dynamics

Euler Nonlinear hyperbolic

Explicit finite volume

Nearest neighbor

Magnetic fields

Ideal MHD Nonlinear hyperbolic

Explicit finite volume

Nearest neighbor

Radiation transport

Flux-limited radiation diffusion

Nonlinear parabolic

Implicit finite differenceMultigrid solves

Global

Multispecies chemistry

Kinetic equations

Coupled stiff ODEs

Explicit BE ,implicit

None

Inertial, tracer, source , and sink particles

Newtonian N-body

Numerical integration

Particle-mesh Gather-scatter

Physics modules can be used in any combination in 1D, 2D and 3D making ENZO a very powerful and versatile code

Page 6: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

ENZO MESHING

Berger-Collela structured AMR

Cartesian base grid and subgrids

Hierarchical timetepping

Page 7: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

Level 0

AMR = collection of grids (patches);each grid is a C++ object

Level 1

Level 2

Page 8: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

Unigrid = collection of Level 0 grid patches

Page 9: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

EVOLUTION OF ENZO PARALLELISM

Shared memory (PowerC) parallel (1994-1998) SMP and DSM architecture (SGI Origin 2000, Altix) Parallel DO across grids at a given refinement level

including block decomposed base grid O(10,000) grids

Distributed memory (MPI) parallel (1998-2008) MPP and SMP cluster architectures (e.g., IBM PowerN) Level 0 grid partitioned across processors Level >0 grids within a processor executed sequentially Dynamic load balancing by messaging grids to

underloaded processors (greedy load balancing) O(100,000) grids

Page 10: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.
Page 11: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

Projection of refinement levels

160,000 grid patches at 4 refinement levels

Page 12: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

1 MPI task per processor

Task = a Level 0 grid patch and all associated subgrids;

processed sequentially across and within levels

Page 13: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

EVOLUTION OF ENZO PARALLELISM

Hierarchical memory (MPI+OpenMP) parallel (2008-) SMP and multicore cluster architectures (SUN

Constellation, Cray XT4/5) Level 0 grid partitioned across shared memory

nodes/multicore processors Parallel DO across grids at a given refinement

level within a node Dynamic load balancing less critical because of

larger MPI task granularity (statistical load balancing)

O(1,000,000) grids

Page 14: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

N MPI tasks per SMPM OpenMP threads per task

Task = a Level 0 grid patch and all associated subgrids processed concurrently within levels and

sequentially across levels

Each grid is an OpenMP thread

Page 15: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

ENZO ON PETASCALE PLATFORMS

ENZO ON CRAY XT5 1% OF THE 64003 SIMULATION

Non-AMR 64003 80 Mpc box 15,625 (253) MPI tasks,

2563 root grid tiles 6 OpenMP threads per

task 93,750 cores 30 TB per checkpoint/re-

start/data dump >15 GB/sec read, >7

GB/sec write Benefit of threading

reduce MPI overhead & improve disk I/O

Page 16: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

ENZO ON PETASCALE PLATFORMS

ENZO ON CRAY XT5 105 SPATIAL DYNAMIC RANGE

AMR 10243 50 Mpc box, 7 levels of refinement 4096 (163) MPI tasks, 643

root grid tiles 1 to 6 OpenMP threads

per task - 4096 to 24,576 cores

Benefit of threading Thread count increases

with memory growth reduce replication of grid

hierarchy data

Page 17: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

Using MPI+threads to access more RAM as the AMR calculation grows in size

Page 18: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

ENZO ON PETASCALE PLATFORMS

ENZO-RHD ON CRAY XT5 COSMIC REIONIZATION

Including radiation transport 10x more expensive LLNL Hypre multigrid

solver dominates run time near ideal scaling to at

least 32K MPI tasks Non-AMR 10243 8 and

16 Mpc boxes 4096 (163) MPI tasks, 643

root grid tiles

Page 19: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

BLUE WATERS TARGET SIMULATIONRE-IONIZING THE UNIVERSE

Cosmic Reionization is a weak-scaling problem large volumes at a fixed resolution to span range of scales

Non-AMR 40963 with ENZO-RHD Hybrid MPI and OpenMP SMT and SIMD tuning 1283 to 2563 root grid tiles 4-8 OpenMP threads per task 4-8 TBytes per checkpoint/re-start/data dump (HDF5) In-core intermediate checkpoints (?) 64-bit arithmetic, 64-bit integers and pointers Aiming for 64-128 K cores 20-40 M hours (?)

Page 20: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

PETASCALE AND BEYOND

ENZO’s AMR infrastructure limits scalability to O(104) cores

We are developing a new, extremely scalable AMR infrastructure called Cello http://lca.ucsd.edu/projects/cello

ENZO-P will be implemented on top of Cello to scale to

Page 21: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

CURRENT CAPABILITIES: AMR VS TREECODE

Page 22: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

CELLO EXTREME AMR FRAMEWORK: DESIGN PRINCIPLES

Hierarchical parallelism and load balancing to improve localization

Relax global synchronization to a minimum Flexible mapping between data structures

and concurrency Object-oriented design Build on best available software for fault-

tolerant, dynamically scheduled concurrent objects (Charm++)

Page 23: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

CELLO EXTREME AMR FRAMEWORK: APPROACH AND SOLUTIONS

1. hybrid replicated/distributed octree-based AMR approach, with novel modifications to improve AMR scaling in terms of both size and depth;

2. patch-local adaptive time steps; 3. flexible hybrid parallelization strategies; 4. hierarchical load balancing approach based on actual

performance measurements; 5. dynamical task scheduling and communication; 6. flexible reorganization of AMR data in memory to permit

independent optimization of computation, communication, and storage;

7. variable AMR grid block sizes while keeping parallel task sizes fixed;

8. address numerical precision and range issues that arise in particularly deep AMR hierarchies;

9. detecting and handling hardware or software faults during run-time to improve software resilience and enable software self-management.

Page 24: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

IMPROVING THE AMR MESH:PATCH COALESCING

Page 25: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

IMPROVING THE AMR MESH:TARGETED REFINEMENT

Page 26: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

IMPROVING THE AMR MESH:TARGETED REFINEMENT WITH BACKFILL

Page 27: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

CELLO SOFTWARE COMPONENTS

http://lca.ucsd.edu/projects/cello

Page 28: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

ROADMAP

Page 29: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

Enzo website (code, documentation) http://lca.ucsd.edu/projects/enzo

2010 Enzo User Workshop slides http://lca.ucsd.edu/workshops/enzo2010

yt website (analysis and vis.) http://yt.enzotools.org

Jacques website (analysis and vis.) http://jacques.enzotools.org/doc/Jacques/Ja

cques.html

ENZO RESOURCES

Page 30: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

BACKUP SLIDES

Page 31: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

Level 0

x x

x

Level 1

Level 2

GRID HIERARCHY DATA STRUCTURE

(0,0)

(1,0)

(2,0) (2,1)

Page 32: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

(0)

(1,0) (1,1)

(2,0) (2,1) (2,2) (2,3) (2,4)

(3,0) (3,1) (3,2) (3,4) (3,5) (3,6) (3,7)

(4,0) (4,1) (4,3) (4,4)

Depth

(le

vel)

Breadth (# siblings)

Scaling the AMR grid hierarchy in depth and breadth

Page 33: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

10243, 7 LEVEL AMR STATS

Level Grids Memory (MB) Work = Mem*(2^level)

0 512 179,029 179,029

1 223,275 114,629 229,258

2 51,522 21,226 84,904

3 17,448 6,085 48,680

4 7,216 1,975 31,600

5 3,370 1,006 32,192

6 1,674 599 38,336

7 794 311 39,808

Total 305,881 324,860 683,807

Page 34: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

real grid object

virtual grid object

grid metadataphysics data

grid metadata

Current MPI Implementation

Page 35: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

SCALING AMR GRID HIERARCHY

Flat MPI implementation is not scalable because grid hierarchy metadata is replicated in every processor For very large grid counts, this dominates memory

requirement (not physics data!) Hybrid parallel implementation helps a lot!

Now hierarchy metadata is only replicated in every SMP node instead of every processor

We would prefer fewer SMP nodes (8192-4096) with bigger core counts (32-64) (=262,144 cores)

Communication burden is partially shifted from MPI to intranode memory accesses

Page 36: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.

CELLO EXTREME AMR FRAMEWORK

Targeted at fluid, particle, or hybrid (fluid+particle) simulations on millions of cores

Generic AMR scaling issues: Small AMR patches restrict available parallelism Dynamic load balancing Maintaining data locality for deep hierarchies Re-meshing efficiency and scalability Inherently global multilevel elliptic solves Increased range and precision requirements for

deep hierarchies