Presentation of the open source CFD code Code_Saturne

HPC and CFD at EDF withCode_Saturne

Yvan Fournier, Jérôme Bonelle

EDF R&DFluid Dynamics, Power Generation and Environment Department

Open Source CFD International Conference Barcelona 2009

2 Open Source CFD International Conference 2009

Summary

1. General Elements on Code_Saturne

2. Real-world performance of Code_Saturne

3. Example applications: fuel assemblies

4. Parallel implementation of Code_Saturne

5. Ongoing work and future directions


General elements on Code_Saturne


Code_Saturne: main capabilities

Physical modellingSingle-phase laminar and turbulent flows : k-ε, k-ω SST, v2f, RSM, LESRadiative heat transfer (DOM, P-1)

Combustion coal, gas, heavy fuel oil (EBU, pdf, LWP)Electric arc and Joule effectLagrangian module for dispersed particle trackingAtmospheric flows (aka Mercure_Saturne)

Specific engineering module for cooling towersALE method for deformable meshesConjugate heat transfer (SYRTHES & 1D)Common structure with NEPTUNE_CFD for eulerian multiphase flows

FlexibilityPortability (UNIX, Linux and MacOS X)Standalone GUI and integrated in SALOME platform

Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces)Wide range of unstructured meshes with arbitrary interfaces

Code coupling capabilities (Code_Saturne/Code_Saturne, Code_Saturne/Code_Aster, ...)


Code_Saturne: general features

Technology Co-located finite volume, arbitrary unstructured meshes (polyhdral cells), predictor-corrector method500 000 lines of code, 50% FORTRAN 90, 40% C, 10% Python

Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S, ...)

2000: version 1.0 (basic modelling, wide range of meshes)

2001: Qualification for single phase nuclear thermal-hydraulic applications

2004: Version 1.1 (complex physics, LES, parallel computing)

2006: Version 1.2 (state of the art turbulence models, GUI)

2008: Version 1.3 (massively parallel, ALE, code coupling, ...)

Released as open source (GPL licence)

2008: Dvpt version 1.4 (parallel IO, multigrid, atmospheric, cooling towers, ...)

2009: Dvp version 2.0-beta (parallel mesh joining, code coupling, easy install & packaging, extended GUI)

� schedule for industrial release beginning of 2010

Code_Saturne developed under Quality Assurance


FVMlibrary

Code_Saturne subsystems

Code_Saturne

Preprocessormesh import mesh joining

periodicity

domain partitioning

Post-processing

output

Restart files

Meshes

Parallel Kernel

Parallel mesh setupCFD Solver

code coupling

parallel treatmentCode_Saturne

SYRTHESCode_Aster

SALOME platform...

GUI

Xml data file

External libraries (EDF, LGPL):• BFT: Base Functions and Types• FVM: Finite Volume Mesh• MEI: Mathematical Expression Interpreter

BFT library

Run-time environmentMemory logging

parallel meshmanagement

MEI library

MathematicalExpressionInterpreter


Code_Saturne environment

Graphical User Interfacesetting up of calculation parametersparameter stored in Xml file

interactive launch of calculations

some specific physics not yet covered by GUIadvanced setting up by Fortran user routines

Integration in the SALOME platformextension of GUI capabilitiesmouse selection of boundary zones

advanced user files management

from CAD to post-processing in one tool


Allowable mesh examples

Example of mesh with stretched cells and hanging nodes

Example of composite mesh

3D polyhedral cells

PWR lower plenum


Joining of non-conforming meshes

Arbitrary interfacesMeshes may be contained in one single file or in several separate files, in any orderArbitrary interfaces can be selected by mesh referencesCaution must be exercised if arbitrary interfaces are used:

in critical regions, or with LESwith very different mesh refinements, or on curved CAD surfaces

Often used in ways detrimental to mesh quality,but a functionality we can not do without as long as we do not have a proven alternative.

Joining of meshes built in several pieces may also be used to circumvent meshing tool memory limitations.

Periodicity is also constructed as an extension of mesh joining.


Real-world performance of Code_Saturne


Code_Saturne Features of note to HPC

Segregated solverAll variables are solved or independently, coupling terms are explicit

Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi-CGstab) used for other variables

More important, matrices have no block structure, and are very sparseTypically 7 non-zeroes per row for hexahedra, 5 for tetrahedra

Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops.

Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20%

The larger the mesh, the higher the relative cost of the pressure step


Current performance (1/3) �

2 LES test cases (most I/O factored out)�1 M cells: (n_cells_min + n_cells_max)/2 = 880 at 1024 cores, 109 at 819210 M cells (n_cells_min + n_cells_max)/2 = 9345 at 1024 cores, 1150 at 8192

1 10 100 1000 10000

100

1000

10000

100000

FATHER

1 M hexahedra LES tes t case

Opteron + infiniband

Opteron + Myrinet

NovaScale

Blue Gene/L

n core s

Elapsed tim

e

1 10 100 1000 10000

500

5000

50000

500000

HYPPI

10 M he xa he dra LES te st c a se

Opte ron + Myrine t

Opte ron + infinibandNovaS ca le

Blue Gene /L

n cores

Elapsed tile


Current performance (2/3) �

RANS, 100 M tetrahedra + polyhedra (most I/O factored out) �Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts

96286/102242 min/max cells/core at 1024 cores11344/12781 min/max cells cores at 8192 cores

100 1000 10000

10

100

1000

10000

FA Grid

RANS te s t ca s e

Nov aScale

Blue Gene/L (CO)

Blue Gene/L (VN)

n c ore s

Elapsed tim

e per iteration


Current performance (3/3)

Efficiency often goes through an optimum (due du better cache hit rates) before dropping (due to latency induced by parallel synchronization)

Example shown here: HYPI (10 M cell LES test case)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1 10 100 1000 10000

Number of MPI ranks

Effi

caci

té

Cluster Chatou Tantale Platine BlueGene


High Performance Computing with Code_Saturne

Code_Saturne used extensively on HPC machinesin-house EDF clusters

CCRT calculation centre (CEA based)

EDF IBM BlueGene machines (8 000 and 32 000 cores)Run also on MareNostrum (Barcelona Computing Centre),Cray XT, …

Code_Saturne used as reference in PRACE European projectreference code for CFD benchmarks on 6 large european HPC centresCode_Saturne obtained “gold medal” status in scalability by Daresbury Laboratory (UK, HPCx) machine)


Example HPC applications:fuel assemblies


Fuel Assembly Studies

Conflicting design goalsGood thermal mixing properties, requiring tubulent flowLimit head lossLimit vibrationsFuel rods held by dimples and springs, and not welded,as they lengthen slightly over the years due to irradiation

Complex core geometryCirca 150 to 250 fuel assemblies per core dependingon reactor type, 8 to 10 grids per fuel assembly,17x17 grid (mostly fuel rods, 24 guide tubes)Geometry almost periodic, except for mix of several fuel assembly types in a given core (reload by 1/3 or 1/4)Inlet an wall conditions not periodic, heat production not uniform at fine scale

Why we study these flowsDeformation may lead to difficulties in core unload/reloadTurbulent-induced vibrations of fuel assemblies in PWR power plants is a potential cause of deformation and of fretting wear damageThese may lead to weeks or months of interruption of operations


Prototype FA calculation with Code_Saturne

PWR nuclear reactor mixing grid mock-up (5x5)100 million cells

calculation run on 4 000to 8 000 cores

Main issue is mesh generation


LES simulation of reduced FA domain

Particular features for LESSIMPLEC algorithm with Rhie and Chow interpolation2nd order in time (Crank-Nicolson and Adams-Bashforth) 2nd order in space (fully centered and sub-iterations for non-orthogonal faces)Fully hexahedral mesh, 8 million cells

Boundary ConditionsImplicit periodicity in x and y directionsConstant inlet conditionsWall function when neededFree outlet

Simulation1 million time-steps: 40 flow passes, 20 flow passes for averaging (no homogeneous direction)CFLmax= 0.8 (dt=5.10-6s)BlueGene/L system, 1024 processorsPer time-step: 5sFor 100 000 time-steps: 1week


Parallel implementation of Code_Saturne


Base parallel operations (1/4) �

Distributed memory parallelism using domain partitioningUse classical “ghost cell” method for both parallelism and periodicity

Most operations require only ghost cells sharing facesExtended neighborhoods for gradients also require ghost cells sharing vertices

Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm

Periodicity uses the same mechanismVector and tensorrotation also required


Base parallel operations (/)

Use of global numberingWe associate a global number to each mesh entity

A specific C type (fvm_gnum_t) is used for this. Currently an unsigned integer (usually 32-bit), but an unsigned long integer (64-bit) will be necessary

Face-cell connectivity for hexahedral cells : size 4.n_faces, and n_faces about 3.n_cells, → size around 12.n_cells, so numbers requiring 64 bit around 350 million cells.

Currently equal to the initial (pre-partitioning) number

Allows for partition-independent single-image filesEssential for restart files, also used for postprocessor outputAlso used for legacy coupling where matches can be saved



Use of global numberingRedistribution on n blocks

n blocks ≤ n coresMinimum block size may be set to avoid many small blocks (for some communication or usage schemes), or to force 1 block (for I/O with non-parallel libraries)�

In the future, using at most 1 of every pprocessors may improve MPI/IO performance if we use a smaller communicator (to be tested)



Conversely, simply using global numbers allows reconstructing neighbor partition entity equivalents mapping

Used for parallel ghost cell construction frominitially partitioned mesh with no ghost data

Arbitrary distribution, inefficient for haloexchange, but allow for simpler datastructure related algorithms withdeterministic performance bounds

Owning processor determined simply byglobal number, messages are aggregated


Parallel IO (1/2)

We prefer using single (partition independent) filesEasily run different stages or restarts of a calculation on different machines or queuesAvoids having thousands or tens of thousands of files in a directoryBetter transparency of parallelism for the user

Use MPI I/O when availableUses block to partition exchange when reading, partition to block when writing

Use of indexed datatypes may be tested in the future, but will not be possible everywhere

Used for reading of preprocessor and partitioner output, as well as for restart files

These files use a unified binary format, consisting of an simple header an a succession of sections

MPI IO pattern is thus a succession of global reads (or local read + broadcast) for section headers and collective reading of data (with a different portion for each rank)We could switch to HDF5 but preferred a lighter model and also avoid an extra dependency or dependency conflicts

Infrastructure in progress for postprocessor outputLayered approach as we allow for multiple formats


Parallel IO (2/2)

Parallel I/O only of benefit with parallel filesystemsUse of MPI IO may be disabled either at build time, or for a given file using specific hintsWithout MPI IO, data for each block is written or read successively by rank 0, using the same FVM file API

Not much feedback yet, but initial results dissapointing

Similar performance with and without MPI IO on at least 2 systemsWhether using MPI_File_read/write_at_all or MPI_File_read/write_allNeed to retest this forcing less processors in the MPI IO communicator

Bugs encountered in several MPI/IO implementations


Ongoing work and future directions


Parallelization of mesh joining (2008-2009)

Parallelizing this algorithm requires the same main steps as the serial algorithm:

Detect intersections (within a given tolerance) between edges of overlapping facesUses parallel octree for face bounding boxes, built in a bottom-up fashion (no balance condition required) �

Subdivide edges according to inserted intersection verticesMerge coincident or nearly-coincident vertices/intersections

This is the most complex

Must be synchronized in parallelChoice of merging criteria has a profound impact on the quality of the resulting mesh

Re-build sub-faces

With parallel mesh joining, the most memory-intensive serialpreprocessing step is removed

We will add parallel mesh « append » within a few months (for version 2.1);this will allow generation of huge meshes even with serial meshing tools


Coupling of Code_Saturne with itself

Objectivecoupling of different models (RANS/LES)fluid-structure interaction with large displacements

rotating machines

Two kinds of communicationsdata exchange at boundaries for interface coupling

volume forcing for overlapping regions

Still under development, but ...data exchange already implemented in FVM library

optimised localisation algorithmcompliance with parallel/parallel coupling

prototype versions with promising resultsmore work needed on conservativity at the exchange

first version adapted to pump modelling implemented in version 2.0rotor/stator couplingcompares favourably with CFX


Multigrid

Currently, multigrid coarsening does not cross processor boundaries

This implies that on p processors, the coarsest matrix may not contain less than p cellsWith a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count

This reduces scalability, and may be checked (if suspected) using the solver summary info at the end of the log file

Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local grid size is too small

The communication pattern is not expected to change too much, as partitioning is of a recursive nature, and should already exhibit a “multigrid” natureThis may be less optimal than repartitioning at each level, but setup time should also remain much cheaper

Important, as grids may be rebuilt each time step


Partitioning

We currently use METIS or SCOTCH, but should move to ParMETIS or Pt-SCOTCH within a few months

The current infrastructure makes this quite easy

We have recently added a « backup » partitioning based on space-filling curves

We currently use the Z curve (from our octree construction for parallel joining), but the appropriate changes in the coordinate comparison rules should allow switching to a Hilbert curve (reputed to lead to better partitioning)

This is fully parallel and deterministic

Performance on initial tests isabout 20% worse on a single10-million cell case on 256 processes

reasonable compared tounoptimized partitioning


Tool chain evolution

Code_Saturne V1.3 (current production version) added many HPC-oriented improvements compared to prior versions:

Post-processor output handled by FVM / KernelGhost cell construction handled by FVM / Kernel

Up to 40% gain in preprocessor memory peak compared to V1.2Parallelized and scales (manages 2 ghost cell sets and multiple periodicities)

Well adapted up to 150 million cells (with 64 Gb for preprocessing) All fundamental limitations are pre-processing related

Version 2.0 separates partitioning from preprocessingAlso reduces their memory footprint a bit, moving newly parallelized operations to the kernel

Post-processing

outputMeshes

Kernel + FVM:distributed run

Pre-Processor:serial run

Post-processing

outputMeshes

Kernel + FVM:distributed run

Partitioner:serial run

Pre-Processor:serial run


Future direction: Hybrid MPI / OpenMP (1/2)

Currently, a pure MPI model is used:Everything is parallel, synchronization is explicit when required

On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) �

Parallel sections must be marked, and parallel loops must avoid modifying the same values

Specific numberings must be used, similar to those used for vectorization, but with different constraints:

Avoid false sharing, keep locality to limit cache misses


Future direction: Hybrid MPI / OpenMP (2/2)

Hybrid MPI / OpenMP is being testedIBM is testing this on Blue Gene/P

Requires work on renumbering algorithmsOpenMP parallelism would ease of packaging / installation on workstations

No dependencies on source but not binary-compatible MPI library choices, only on the compiler runtime

Good enough for current multicore workstations

Coupling the code with itself or with with SYRTHES 4 will still require MPI

The main goal is to allow MPI communicators of “only” 10000’s of ranks on machines with 100000 cores

Performance benefits expected mainly at the very high end

Reduce risk of medium-term issues with MPI_Alltoallv used in I/O and parallelism-related data redistribution

Though sparse collective algorithms is the long term solution for this specific issue


Code_Saturne HPC roadmap2003 2010 201520072006

Consecutive to the Civaux thermal fatigue event

Computations enable to better understand the wall thermal loading in an injection.

Knowing the root causes of the event ⇒ define a new design to avoid this problem.

Part of a fuel assembly3 grid assemblies

Computation with an L.E.S. approach for turbulent modelling

Refined mesh near the wall.

9 fuel assemblies

No experimental approach up to now

Will enable the study of side effects implied by the flow around neighbour fuel assemblies.

Better understanding of vibration phenomena and wear-out of the rods.

The whole vessel reactor

106 cells3.1013 operations

108 cells1016 operations




Fujistu VPP 5000

1 of 4 vector processors

2 month length computation

Cluster, IBM Power5

400 processors

9 days

# 1 Gb of storage

2 Gb of memory

IBM Blue Gene/L « Frontier »

8000 processors

# 1 month

30 times the power of


# 1 month

# 15 Gb of storage

25 Gb of memory

# 10 Tb of storage

25 Tb of memory

# 1 Tb of storage

2,5 Tb of memory

# 200 Gb of storage

250 Gb of memory

Power of the computer Pre-processing not parallelized Pre-processing not parallelized

Mesh generation

… ibid. …

… ibid. …

Scalability / Solver

… ibid. …

… ibid. …

… ibid. …

Visualisation



# 1 month

2003 2010 201520072006

Consecutive to the Civaux thermal fatigue event

Computations enable to better understand the wall thermal loading in an injection.

Knowing the root causes of the event ⇒ define a new design to avoid this problem.

Part of a fuel assembly3 grid assemblies

Computation with an L.E.S. approach for turbulent modelling

Refined mesh near the wall.

9 fuel assemblies

No experimental approach up to now

Will enable the study of side effects implied by the flow around neighbour fuel assemblies.

Better understanding of vibration phenomena and wear-out of the rods.

The whole vessel reactor











Fujistu VPP 5000

1 of 4 vector processors

2 month length computation

Cluster, IBM Power5

400 processors

9 days

# 1 Gb of storage

2 Gb of memory


8000 processors

# 1 month



# 1 month

# 15 Gb of storage

25 Gb of memory

# 10 Tb of storage

25 Tb of memory

# 1 Tb of storage

2,5 Tb of memory

# 200 Gb of storage

250 Gb of memory

Power of the computer Pre-processing not parallelized Pre-processing not parallelized

Mesh generation

… ibid. …

… ibid. …

Scalability / Solver

… ibid. …

… ibid. …

… ibid. …

Visualisation



# 1 month


Thank you for your attention!


Additional Notes


Load imbalance (1/3)

In this example, using 8 partitions (with METIS), we have the following local minima and maxima:

Cells:416 / 440 (6% imbalance)�Cells + ghost cells:469/519 (11% imbalance)�Interior faces:852/946 (11% imbalance)�

Most loops are on cells,but some are on cells + ghosts,and MatVec is in cells + faces



If load imbalance increases with processor count, scalability decreases

If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, though some processor power is wasted

Perfect balancing is impossible to reach, as different loops show different imbalance levels, an synchronizations may be required between these loops

GCP uses MatVec and dot productsLoad imbalance might be reduced using weights for domain partitioning, with Cell weight = 1 + f(n_faces)�



Another possible source of load imbalance is different cache miss rates on different ranks

Difficult to estimate a prioriWith otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20%

Presentation of the open source CFD code Code_Saturne

Technology

Transcript of Presentation of the open source CFD code Code_Saturne