Presentation of the open source CFD code Code_Saturne
-
Upload
renuda-sarl -
Category
Technology
-
view
2.036 -
download
2
description
Transcript of Presentation of the open source CFD code Code_Saturne
HPC and CFD at EDF withCode_Saturne
Yvan Fournier, Jérôme Bonelle
EDF R&DFluid Dynamics, Power Generation and Environment Department
Open Source CFD International Conference Barcelona 2009
2 Open Source CFD International Conference 2009
Summary
1. General Elements on Code_Saturne
2. Real-world performance of Code_Saturne
3. Example applications: fuel assemblies
4. Parallel implementation of Code_Saturne
5. Ongoing work and future directions
3 Open Source CFD International Conference 2009
General elements on Code_Saturne
4 Open Source CFD International Conference 2009
Code_Saturne: main capabilities
Physical modellingSingle-phase laminar and turbulent flows : k-ε, k-ω SST, v2f, RSM, LESRadiative heat transfer (DOM, P-1)
Combustion coal, gas, heavy fuel oil (EBU, pdf, LWP)Electric arc and Joule effectLagrangian module for dispersed particle trackingAtmospheric flows (aka Mercure_Saturne)
Specific engineering module for cooling towersALE method for deformable meshesConjugate heat transfer (SYRTHES & 1D)Common structure with NEPTUNE_CFD for eulerian multiphase flows
FlexibilityPortability (UNIX, Linux and MacOS X)Standalone GUI and integrated in SALOME platform
Parallel on distributed memory machines Periodic boundaries (parallel, arbitrary interfaces)Wide range of unstructured meshes with arbitrary interfaces
Code coupling capabilities (Code_Saturne/Code_Saturne, Code_Saturne/Code_Aster, ...)
5 Open Source CFD International Conference 2009
Code_Saturne: general features
Technology Co-located finite volume, arbitrary unstructured meshes (polyhdral cells), predictor-corrector method500 000 lines of code, 50% FORTRAN 90, 40% C, 10% Python
Development 1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S, ...)
2000: version 1.0 (basic modelling, wide range of meshes)
2001: Qualification for single phase nuclear thermal-hydraulic applications
2004: Version 1.1 (complex physics, LES, parallel computing)
2006: Version 1.2 (state of the art turbulence models, GUI)
2008: Version 1.3 (massively parallel, ALE, code coupling, ...)
Released as open source (GPL licence)
2008: Dvpt version 1.4 (parallel IO, multigrid, atmospheric, cooling towers, ...)
2009: Dvp version 2.0-beta (parallel mesh joining, code coupling, easy install & packaging, extended GUI)
� schedule for industrial release beginning of 2010
Code_Saturne developed under Quality Assurance
6 Open Source CFD International Conference 2009
FVMlibrary
Code_Saturne subsystems
Code_Saturne
Preprocessormesh import mesh joining
periodicity
domain partitioning
Post-processing
output
Restart files
Meshes
Parallel Kernel
Parallel mesh setupCFD Solver
code coupling
parallel treatmentCode_Saturne
SYRTHESCode_Aster
SALOME platform...
GUI
Xml data file
External libraries (EDF, LGPL):• BFT: Base Functions and Types• FVM: Finite Volume Mesh• MEI: Mathematical Expression Interpreter
BFT library
Run-time environmentMemory logging
parallel meshmanagement
MEI library
MathematicalExpressionInterpreter
7 Open Source CFD International Conference 2009
Code_Saturne environment
Graphical User Interfacesetting up of calculation parametersparameter stored in Xml file
interactive launch of calculations
some specific physics not yet covered by GUIadvanced setting up by Fortran user routines
Integration in the SALOME platformextension of GUI capabilitiesmouse selection of boundary zones
advanced user files management
from CAD to post-processing in one tool
8 Open Source CFD International Conference 2009
Allowable mesh examples
Example of mesh with stretched cells and hanging nodes
Example of composite mesh
3D polyhedral cells
PWR lower plenum
9 Open Source CFD International Conference 2009
Joining of non-conforming meshes
Arbitrary interfacesMeshes may be contained in one single file or in several separate files, in any orderArbitrary interfaces can be selected by mesh referencesCaution must be exercised if arbitrary interfaces are used:
in critical regions, or with LESwith very different mesh refinements, or on curved CAD surfaces
Often used in ways detrimental to mesh quality,but a functionality we can not do without as long as we do not have a proven alternative.
Joining of meshes built in several pieces may also be used to circumvent meshing tool memory limitations.
Periodicity is also constructed as an extension of mesh joining.
10 Open Source CFD International Conference 2009
Real-world performance of Code_Saturne
11 Open Source CFD International Conference 2009
Code_Saturne Features of note to HPC
Segregated solverAll variables are solved or independently, coupling terms are explicit
Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi-CGstab) used for other variables
More important, matrices have no block structure, and are very sparseTypically 7 non-zeroes per row for hexahedra, 5 for tetrahedra
Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as memory bandwidth is as important as peak flops.
Linear equation solvers usually amount to 80% of CPU cost (dominated by pressure), gradient reconstruction about 20%
The larger the mesh, the higher the relative cost of the pressure step
12 Open Source CFD International Conference 2009
Current performance (1/3) �
2 LES test cases (most I/O factored out)�1 M cells: (n_cells_min + n_cells_max)/2 = 880 at 1024 cores, 109 at 819210 M cells (n_cells_min + n_cells_max)/2 = 9345 at 1024 cores, 1150 at 8192
1 10 100 1000 10000
100
1000
10000
100000
FATHER
1 M hexahedra LES tes t case
Opteron + infiniband
Opteron + Myrinet
NovaScale
Blue Gene/L
n core s
Elapsed tim
e
1 10 100 1000 10000
500
5000
50000
500000
HYPPI
10 M he xa he dra LES te st c a se
Opte ron + Myrine t
Opte ron + infinibandNovaS ca le
Blue Gene /L
n cores
Elapsed tile
13 Open Source CFD International Conference 2009
Current performance (2/3) �
RANS, 100 M tetrahedra + polyhedra (most I/O factored out) �Polyhedra due to mesh joinings may lead to higher load imbalance in local MatVec for large core counts
96286/102242 min/max cells/core at 1024 cores11344/12781 min/max cells cores at 8192 cores
100 1000 10000
10
100
1000
10000
FA Grid
RANS te s t ca s e
Nov aScale
Blue Gene/L (CO)
Blue Gene/L (VN)
n c ore s
Elapsed tim
e per iteration
14 Open Source CFD International Conference 2009
Current performance (3/3)
Efficiency often goes through an optimum (due du better cache hit rates) before dropping (due to latency induced by parallel synchronization)
Example shown here: HYPI (10 M cell LES test case)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1 10 100 1000 10000
Number of MPI ranks
Effi
caci
té
Cluster Chatou Tantale Platine BlueGene
15 Open Source CFD International Conference 2009
High Performance Computing with Code_Saturne
Code_Saturne used extensively on HPC machinesin-house EDF clusters
CCRT calculation centre (CEA based)
EDF IBM BlueGene machines (8 000 and 32 000 cores)Run also on MareNostrum (Barcelona Computing Centre),Cray XT, …
Code_Saturne used as reference in PRACE European projectreference code for CFD benchmarks on 6 large european HPC centresCode_Saturne obtained “gold medal” status in scalability by Daresbury Laboratory (UK, HPCx) machine)
16 Open Source CFD International Conference 2009
Example HPC applications:fuel assemblies
17 Open Source CFD International Conference 2009
Fuel Assembly Studies
Conflicting design goalsGood thermal mixing properties, requiring tubulent flowLimit head lossLimit vibrationsFuel rods held by dimples and springs, and not welded,as they lengthen slightly over the years due to irradiation
Complex core geometryCirca 150 to 250 fuel assemblies per core dependingon reactor type, 8 to 10 grids per fuel assembly,17x17 grid (mostly fuel rods, 24 guide tubes)Geometry almost periodic, except for mix of several fuel assembly types in a given core (reload by 1/3 or 1/4)Inlet an wall conditions not periodic, heat production not uniform at fine scale
Why we study these flowsDeformation may lead to difficulties in core unload/reloadTurbulent-induced vibrations of fuel assemblies in PWR power plants is a potential cause of deformation and of fretting wear damageThese may lead to weeks or months of interruption of operations
18 Open Source CFD International Conference 2009
Prototype FA calculation with Code_Saturne
PWR nuclear reactor mixing grid mock-up (5x5)100 million cells
calculation run on 4 000to 8 000 cores
Main issue is mesh generation
19 Open Source CFD International Conference 2009
LES simulation of reduced FA domain
Particular features for LESSIMPLEC algorithm with Rhie and Chow interpolation2nd order in time (Crank-Nicolson and Adams-Bashforth) 2nd order in space (fully centered and sub-iterations for non-orthogonal faces)Fully hexahedral mesh, 8 million cells
Boundary ConditionsImplicit periodicity in x and y directionsConstant inlet conditionsWall function when neededFree outlet
Simulation1 million time-steps: 40 flow passes, 20 flow passes for averaging (no homogeneous direction)CFLmax= 0.8 (dt=5.10-6s)BlueGene/L system, 1024 processorsPer time-step: 5sFor 100 000 time-steps: 1week
20 Open Source CFD International Conference 2009
Parallel implementation of Code_Saturne
21 Open Source CFD International Conference 2009
Base parallel operations (1/4) �
Distributed memory parallelism using domain partitioningUse classical “ghost cell” method for both parallelism and periodicity
Most operations require only ghost cells sharing facesExtended neighborhoods for gradients also require ghost cells sharing vertices
Global reductions (dot products) are also used, especially by the preconditioned conjugate gradient algorithm
Periodicity uses the same mechanismVector and tensorrotation also required
22 Open Source CFD International Conference 2009
Base parallel operations (/)
Use of global numberingWe associate a global number to each mesh entity
A specific C type (fvm_gnum_t) is used for this. Currently an unsigned integer (usually 32-bit), but an unsigned long integer (64-bit) will be necessary
Face-cell connectivity for hexahedral cells : size 4.n_faces, and n_faces about 3.n_cells, → size around 12.n_cells, so numbers requiring 64 bit around 350 million cells.
Currently equal to the initial (pre-partitioning) number
Allows for partition-independent single-image filesEssential for restart files, also used for postprocessor outputAlso used for legacy coupling where matches can be saved
23 Open Source CFD International Conference 2009
Base parallel operations (/)
Use of global numberingRedistribution on n blocks
n blocks ≤ n coresMinimum block size may be set to avoid many small blocks (for some communication or usage schemes), or to force 1 block (for I/O with non-parallel libraries)�
In the future, using at most 1 of every pprocessors may improve MPI/IO performance if we use a smaller communicator (to be tested)
24 Open Source CFD International Conference 2009
Base parallel operations (/)
Conversely, simply using global numbers allows reconstructing neighbor partition entity equivalents mapping
Used for parallel ghost cell construction frominitially partitioned mesh with no ghost data
Arbitrary distribution, inefficient for haloexchange, but allow for simpler datastructure related algorithms withdeterministic performance bounds
Owning processor determined simply byglobal number, messages are aggregated
25 Open Source CFD International Conference 2009
Parallel IO (1/2)
We prefer using single (partition independent) filesEasily run different stages or restarts of a calculation on different machines or queuesAvoids having thousands or tens of thousands of files in a directoryBetter transparency of parallelism for the user
Use MPI I/O when availableUses block to partition exchange when reading, partition to block when writing
Use of indexed datatypes may be tested in the future, but will not be possible everywhere
Used for reading of preprocessor and partitioner output, as well as for restart files
These files use a unified binary format, consisting of an simple header an a succession of sections
MPI IO pattern is thus a succession of global reads (or local read + broadcast) for section headers and collective reading of data (with a different portion for each rank)We could switch to HDF5 but preferred a lighter model and also avoid an extra dependency or dependency conflicts
Infrastructure in progress for postprocessor outputLayered approach as we allow for multiple formats
26 Open Source CFD International Conference 2009
Parallel IO (2/2)
Parallel I/O only of benefit with parallel filesystemsUse of MPI IO may be disabled either at build time, or for a given file using specific hintsWithout MPI IO, data for each block is written or read successively by rank 0, using the same FVM file API
Not much feedback yet, but initial results dissapointing
Similar performance with and without MPI IO on at least 2 systemsWhether using MPI_File_read/write_at_all or MPI_File_read/write_allNeed to retest this forcing less processors in the MPI IO communicator
Bugs encountered in several MPI/IO implementations
27 Open Source CFD International Conference 2009
Ongoing work and future directions
28 Open Source CFD International Conference 2009
Parallelization of mesh joining (2008-2009)
Parallelizing this algorithm requires the same main steps as the serial algorithm:
Detect intersections (within a given tolerance) between edges of overlapping facesUses parallel octree for face bounding boxes, built in a bottom-up fashion (no balance condition required) �
Subdivide edges according to inserted intersection verticesMerge coincident or nearly-coincident vertices/intersections
This is the most complex
Must be synchronized in parallelChoice of merging criteria has a profound impact on the quality of the resulting mesh
Re-build sub-faces
With parallel mesh joining, the most memory-intensive serialpreprocessing step is removed
We will add parallel mesh « append » within a few months (for version 2.1);this will allow generation of huge meshes even with serial meshing tools
29 Open Source CFD International Conference 2009
Coupling of Code_Saturne with itself
Objectivecoupling of different models (RANS/LES)fluid-structure interaction with large displacements
rotating machines
Two kinds of communicationsdata exchange at boundaries for interface coupling
volume forcing for overlapping regions
Still under development, but ...data exchange already implemented in FVM library
optimised localisation algorithmcompliance with parallel/parallel coupling
prototype versions with promising resultsmore work needed on conservativity at the exchange
first version adapted to pump modelling implemented in version 2.0rotor/stator couplingcompares favourably with CFX
30 Open Source CFD International Conference 2009
Multigrid
Currently, multigrid coarsening does not cross processor boundaries
This implies that on p processors, the coarsest matrix may not contain less than p cellsWith a high processor count, less grid levels will be used, and solving for the coarsest matrix may be significantly more expensive than with a low processor count
This reduces scalability, and may be checked (if suspected) using the solver summary info at the end of the log file
Planned solution: move grids to nearest rank multiple of 4 or 8 when mean local grid size is too small
The communication pattern is not expected to change too much, as partitioning is of a recursive nature, and should already exhibit a “multigrid” natureThis may be less optimal than repartitioning at each level, but setup time should also remain much cheaper
Important, as grids may be rebuilt each time step
31 Open Source CFD International Conference 2009
Partitioning
We currently use METIS or SCOTCH, but should move to ParMETIS or Pt-SCOTCH within a few months
The current infrastructure makes this quite easy
We have recently added a « backup » partitioning based on space-filling curves
We currently use the Z curve (from our octree construction for parallel joining), but the appropriate changes in the coordinate comparison rules should allow switching to a Hilbert curve (reputed to lead to better partitioning)
This is fully parallel and deterministic
Performance on initial tests isabout 20% worse on a single10-million cell case on 256 processes
reasonable compared tounoptimized partitioning
32 Open Source CFD International Conference 2009
Tool chain evolution
Code_Saturne V1.3 (current production version) added many HPC-oriented improvements compared to prior versions:
Post-processor output handled by FVM / KernelGhost cell construction handled by FVM / Kernel
Up to 40% gain in preprocessor memory peak compared to V1.2Parallelized and scales (manages 2 ghost cell sets and multiple periodicities)
Well adapted up to 150 million cells (with 64 Gb for preprocessing) All fundamental limitations are pre-processing related
Version 2.0 separates partitioning from preprocessingAlso reduces their memory footprint a bit, moving newly parallelized operations to the kernel
Post-processing
outputMeshes
Kernel + FVM:distributed run
Pre-Processor:serial run
Post-processing
outputMeshes
Kernel + FVM:distributed run
Partitioner:serial run
Pre-Processor:serial run
33 Open Source CFD International Conference 2009
Future direction: Hybrid MPI / OpenMP (1/2)
Currently, a pure MPI model is used:Everything is parallel, synchronization is explicit when required
On multiprocessor / multicore nodes, shared memory parallelism could also be used (using OpenMP directives) �
Parallel sections must be marked, and parallel loops must avoid modifying the same values
Specific numberings must be used, similar to those used for vectorization, but with different constraints:
Avoid false sharing, keep locality to limit cache misses
34 Open Source CFD International Conference 2009
Future direction: Hybrid MPI / OpenMP (2/2)
Hybrid MPI / OpenMP is being testedIBM is testing this on Blue Gene/P
Requires work on renumbering algorithmsOpenMP parallelism would ease of packaging / installation on workstations
No dependencies on source but not binary-compatible MPI library choices, only on the compiler runtime
Good enough for current multicore workstations
Coupling the code with itself or with with SYRTHES 4 will still require MPI
The main goal is to allow MPI communicators of “only” 10000’s of ranks on machines with 100000 cores
Performance benefits expected mainly at the very high end
Reduce risk of medium-term issues with MPI_Alltoallv used in I/O and parallelism-related data redistribution
Though sparse collective algorithms is the long term solution for this specific issue
35 Open Source CFD International Conference 2009
Code_Saturne HPC roadmap2003 2010 201520072006
Consecutive to the Civaux thermal fatigue event
Computations enable to better understand the wall thermal loading in an injection.
Knowing the root causes of the event ⇒ define a new design to avoid this problem.
Part of a fuel assembly3 grid assemblies
Computation with an L.E.S. approach for turbulent modelling
Refined mesh near the wall.
9 fuel assemblies
No experimental approach up to now
Will enable the study of side effects implied by the flow around neighbour fuel assemblies.
Better understanding of vibration phenomena and wear-out of the rods.
The whole vessel reactor
106 cells3.1013 operations
108 cells1016 operations
1010 cells5.1018 operations
109 cells3.1017 operations
107 cells6.1014 operations
Fujistu VPP 5000
1 of 4 vector processors
2 month length computation
Cluster, IBM Power5
400 processors
9 days
# 1 Gb of storage
2 Gb of memory
IBM Blue Gene/L « Frontier »
8000 processors
# 1 month
30 times the power of
IBM Blue Gene/L « Frontier »
# 1 month
# 15 Gb of storage
25 Gb of memory
# 10 Tb of storage
25 Tb of memory
# 1 Tb of storage
2,5 Tb of memory
# 200 Gb of storage
250 Gb of memory
Power of the computer Pre-processing not parallelized Pre-processing not parallelized
Mesh generation
… ibid. …
… ibid. …
Scalability / Solver
… ibid. …
… ibid. …
… ibid. …
Visualisation
500 times the power of
IBM Blue Gene/L « Frontier »
# 1 month
2003 2010 201520072006
Consecutive to the Civaux thermal fatigue event
Computations enable to better understand the wall thermal loading in an injection.
Knowing the root causes of the event ⇒ define a new design to avoid this problem.
Part of a fuel assembly3 grid assemblies
Computation with an L.E.S. approach for turbulent modelling
Refined mesh near the wall.
9 fuel assemblies
No experimental approach up to now
Will enable the study of side effects implied by the flow around neighbour fuel assemblies.
Better understanding of vibration phenomena and wear-out of the rods.
The whole vessel reactor
106 cells3.1013 operations
108 cells1016 operations
1010 cells5.1018 operations
109 cells3.1017 operations
107 cells6.1014 operations
106 cells3.1013 operations
108 cells1016 operations
1010 cells5.1018 operations
109 cells3.1017 operations
107 cells6.1014 operations
Fujistu VPP 5000
1 of 4 vector processors
2 month length computation
Cluster, IBM Power5
400 processors
9 days
# 1 Gb of storage
2 Gb of memory
IBM Blue Gene/L « Frontier »
8000 processors
# 1 month
30 times the power of
IBM Blue Gene/L « Frontier »
# 1 month
# 15 Gb of storage
25 Gb of memory
# 10 Tb of storage
25 Tb of memory
# 1 Tb of storage
2,5 Tb of memory
# 200 Gb of storage
250 Gb of memory
Power of the computer Pre-processing not parallelized Pre-processing not parallelized
Mesh generation
… ibid. …
… ibid. …
Scalability / Solver
… ibid. …
… ibid. …
… ibid. …
Visualisation
500 times the power of
IBM Blue Gene/L « Frontier »
# 1 month
36 Open Source CFD International Conference 2009
Thank you for your attention!
37 Open Source CFD International Conference 2009
Additional Notes
38 Open Source CFD International Conference 2009
Load imbalance (1/3)
In this example, using 8 partitions (with METIS), we have the following local minima and maxima:
Cells:416 / 440 (6% imbalance)�Cells + ghost cells:469/519 (11% imbalance)�Interior faces:852/946 (11% imbalance)�
Most loops are on cells,but some are on cells + ghosts,and MatVec is in cells + faces
39 Open Source CFD International Conference 2009
Load imbalance (2/3)
If load imbalance increases with processor count, scalability decreases
If load imbalance reaches a high value (say 30% to 50%) but does not increase, scalability is maintained, though some processor power is wasted
Perfect balancing is impossible to reach, as different loops show different imbalance levels, an synchronizations may be required between these loops
GCP uses MatVec and dot productsLoad imbalance might be reduced using weights for domain partitioning, with Cell weight = 1 + f(n_faces)�
40 Open Source CFD International Conference 2009
Load imbalance (3/3)
Another possible source of load imbalance is different cache miss rates on different ranks
Difficult to estimate a prioriWith otherwise balanced loops, if a processor has a cache miss every 300 instructions, and another a cache miss every 400 instructions, considering that the cost of a cache miss is at least 100 instructions, the corresponding imbalance reaches 20%