Development of a Navier-Stokes solver for Direct Numerical ... · Simulations of Isotropic...
Transcript of Development of a Navier-Stokes solver for Direct Numerical ... · Simulations of Isotropic...
Development of a Navier-Stokes solver for Direct Numerical
Simulations of Isotropic Turbulence using 2DECOMP
library
Nuno Miguel Viana Rodrigues
Thesis to obtain the Master of Science Degree in
Mechanical Engineering
Supervisors: Prof. Carlos Frederico Neves Bettencourt da Silva
Dr. Pedro Manuel da Silva Cardoso Isidro Valente
Examination Committee
Chairperson: Prof. Viriato Sérgio de Almeida Semião
Supervisor: Prof. Carlos Frederico Neves Bettencourt da Silva
Member of the Committee: Prof. Luís Rego da Cunha Eça
July 2015
i
“Cheshire Cat: You may have noticed that I’m not all there myself.”
In “Alice in Wonderland” by Lewis Carroll;
ii
iii
Acknowledgements
I would like to thank my Family for the enormous amount of patience and sacrifice,
for all the arguments won and lost, and for pressing me onwards towards the completion of
this work and my course.
I would like to thank Dr. Carlos Silva, Dr. Reis and Dr. Valente for their contributions,
on the order on which I interacted with them in the course of this Thesis, as well as thank
them for their time, patience and support. Each of them helped me greatly, by providing me
with a workplace, a goal, teaching me the basics of MPI and, for Dr. Valente, pushing me
forward towards better results.
I would like to thank my Father, who never saw the completion of this work, and my
course, for his love, his patience, his influence in my choice of career and support until his
passing. To my Mother I thank her for her insistence on the completion of the course and
progress towards a higher education. To my Brother, I thank him for being there when I
needed him. I dedicate this work to my family, both to the ones present and those who have
left, while I took this voyage towards goals not yet set.
I also take a moment to thank the LASEF team for their friendship over the course of
these years which I have shared with them.
iv
v
Abstract
This Thesis deals with the implementation of the Pseudo-Spectral Method for
computationally solving the Navier-Stokes Equations, using the 2decomp parallel libraries for
a pencil memory arrangement, providing the capacity for distributed parallel calculation. The
target goals of this thesis are to enable massive simulations, ensure solver portability,
optimize the solver, ensure its scalability and perform trial runs in established international
clusters. The 2DECOMP library is portable in itself, so great care was taken to ensure that
the MPI solver was successfully ported from platform to platform. This culminated with the
eventual set of successful tests run in PRACE at the Marenostrum III supercomputer.
Optimization steps were evaluated several times during the various intermediate versions of
the code using VAMPIR profiling tools, resulting in near-ideal behaviour for when increasing
both the mesh-size and the number of processors involved, until the communication effort
began to degrade the solver’s performance.
vi
Resumo
Esta Tese lida com a implementação do Método Pseudo Espectral para resolver
computacionalmente as equações de Navier-Stokes, utilizando as bibliotecas paralelas
2DECOMP para um arranjo de memória em caneta, providenciando uma capacidade de
cálculo paralelo distribuída. Os objectivos desta tese são permitir simulações numéricas
massivas, assegurar a portabilidade do motor de cálculo, assegurar a sua escalabilidade e
executar testes em clusters internacionais. A biblioteca 2DECOMP é, por si só portável, e foi
tido grande cuidado em assegurar que a versão MPI seria também portável de plataforma
em plataforma. Este trabalho culminou com um conjunto de testes na PRACE do super
computador Marenostrum III. Foram avaliados passos de optimização repetidamente ao
longo das várias versões intermédias deste código recorrendo ao VAMPIR, uma ferramenta
de profiling, o que resultou num comportamento perto do ideal aquando a malha e o numero
de processadores aumentou, até que o custo de comunicação afecta a performance do
código.
vii
Table of Contents
Acknowledgements......................................................................................................................... iii
Abstract ........................................................................................................................................... v
Resumo........................................................................................................................................... vi
Table of Contents........................................................................................................................... vii
List of Figures .................................................................................................................................. ix
List of Tables .................................................................................................................................... x
List of Symbols ................................................................................................................................ xi
List of Abbreviations ......................................................................................................................xiv
Chapter 1 ......................................................................................................................................... 1
Introduction..................................................................................................................................... 2
1.1 Numerical Simulation of Flows............................................................................................ 3 1.1.1. Navier-Stokes Equations................................................................................................. 3 1.1.2. Turbulence...................................................................................................................... 3 1.1.3. The Energy Cascade........................................................................................................ 5 1.1.4. The Kolmogorov Micro Scale.......................................................................................... 7 1.1.5. The Computational Cost................................................................................................. 8 1.1.6. Parallelization Schemes................................................................................................ 10
Chapter 2 ....................................................................................................................................... 15
Background.................................................................................................................................... 16
2.1 Pseudo-Spectral Method................................................................................................... 16 2.1.1 Discrete Fourier Transform: ......................................................................................... 17 2.1.2 The Navier-Stokes Equations in the Fourier Space....................................................... 19 2.1.3 Spectral Space Navier-Stokes representation .............................................................. 21 2.1.4 Numerical Algorithm: ................................................................................................... 22
2.2 Simulation of Homogeneous Isotropic Turbulence............................................................ 23 2.2.1 Decaying Homogeneous Isotropic Turbulence............................................................. 24 2.2.2 Forced Homogeneous Isotropic Turbulence ................................................................ 24 2.2.3 Forcing Method for Homogeneous Isotropic Turbulence ............................................ 25
2.3 Domain Decomposition; Pencil Pattern............................................................................. 28
2.4 2decomp............................................................................................................................ 29
viii
Chapter 3 ....................................................................................................................................... 31
Numerical Developments............................................................................................................... 32
3.1 Parallelization Scheme .......................................................................................................35 3.1.1 Global Summing ............................................................................................................37 3.1.3 Maxima/Minima............................................................................................................37 3.1.2 Local Summing ..............................................................................................................38 3.1.4 Global Access at Local level...........................................................................................39
3.2 Fast Fourier Transforms .....................................................................................................40 3.2.1 Direct FFT Wrapper .......................................................................................................41 3.2.2 Inverse FFT Wrapper .....................................................................................................43 3.2.3 Real to Complex and Complex to Real ..........................................................................44
3.3 Truncation and Hermitian Redundancy .............................................................................45
3.4 I/O ......................................................................................................................................49
3.5 Randomization ...................................................................................................................50
3.6 Statistics .............................................................................................................................51
3.7 Converter............................................................................................................................52
Chapter 4 ....................................................................................................................................... 55
Results ........................................................................................................................................... 56
4.1 Tests and Speed Up ............................................................................................................58 4.1.1 Development of X to X Version .....................................................................................58 4.1.2 Development of X-to-Z Version.....................................................................................64 4.1.3 Pre Allocation Optimization ..........................................................................................67 4.1.4 Final Temporal Results ..................................................................................................74
4.2 Large Scale DNS testing .....................................................................................................75
Chapter 5 ....................................................................................................................................... 77
Conclusions and Further Work ....................................................................................................... 78
5.1 Main results and project considerations ............................................................................79
5.2 Future work ........................................................................................................................80
References ..................................................................................................................................... 81
Bibliographic references ...................................................................................................................81
ix
List of Figures
Figure 1 – Turbulent Flow as depicted by Da Vinci (1452-1519) ................................................. 4 Figure 2 – Eddy sizes within the Energy Cascade ........................................................................ 5 Figure 3 – Description of the range of scales in a turbulent flow ................................................. 6 Figure 4 – Slab Arrangement........................................................................................................... 10 Figure 5 – Pencil Arrangement........................................................................................................ 11 Figure 6 – Slab Parallelization Scheme Examples....................................................................... 12 Figure 7 – Pencil Parallelization Scheme Examples.................................................................... 13 Figure 8 – Fourier Transform Domains (Direct and Inverse) ...................................................... 17 Figure 9 – Velocity and Wave number Vectors in Spectral Space ............................................ 21 Figure 10 – Direct and Inverse FFT Algorithm Flowchart............................................................ 33 Figure 11 – Parallelization Scheme for Physical and Spectral Space....................................... 35 Figure 12 – Global Scheme versus Pencil Scheme ..................................................................... 36 Figure 13 – Global Summing Algorithm Flowchart ....................................................................... 37 Figure 14 – Minima/Maxima Algorithm Flowchart ........................................................................ 37 Figure 15 – Local Summing Algorithm Flowchart ......................................................................... 38 Figure 16 – Global Access at Local level Algorithm Flowchart................................................... 39 Figure 17 – 2decomp Library FFT X to Z (Standard) Implementation....................................... 40 Figure 18 – Direct FFT Wrapper Algorithm Flowchart ................................................................. 41 Figure 19 – Inverse FFT Wrapper Algorithm Flowchart .............................................................. 43 Figure 20 – Complex/Real Subroutine(s) Scheme ....................................................................... 44 Figure 21 – Truncation Global Scheme versus Pencil Scheme ................................................. 45 Figure 22 – Hermitian Redundancy Algorithm Depiction............................................................. 47 Figure 23 – Global Plate Operation Visualization ......................................................................... 48 Figure 24 – Converter MPI NATIVE to FORTRAN90 NATIVE Algorithm Flowchart............... 52
Figure 25 – Temporal Results (mesh 3
128 ) in Galego (X to X)................................................. 60 Figure 26 – VAMPIR Results Visualization, excluding non-parallelizable time expenditure .. 67 Figure 27 – Temporal Results ......................................................................................................... 69 Figure 28 – Adimensional Temporal Results ................................................................................ 69 Figure 29 – Temporal Results ......................................................................................................... 71 Figure 30 – Adimensional Temporal Results ................................................................................ 71 Figure 31 – Temporal Results ......................................................................................................... 73 Figure 32 – Adimensional Temporal Results ................................................................................ 73 Figure 33 – Iteration Time Final Results ........................................................................................ 74 Figure 34 – Iteration Time Final Results ........................................................................................ 79
x
List of Tables
Table 1 – Fourier Transform Properties ..........................................................................................18
Table 2 – Temporal Results (mesh 3
128 ) in Galego ...................................................................60
Table 3 – Temporal results for FFTI (3
128 )...................................................................................61 Table 4 – FFTI Temporal usage breakdown ..................................................................................61
Table 5 – Temporal results for FFTD (3
128 ) .................................................................................62 Table 6 – FFTD Temporal usage breakdown ................................................................................62 Table 7 – Total Temporal expenditure ............................................................................................63 Table 8 – Time per Cycle in workstation.........................................................................................63 Table 9 – Time per Cycle in workstation.........................................................................................65 Table 10 – Total Time Expenditure ................................................................................................65
Table 11 – Time per Cycle, 3
64 , 4 cores ......................................................................................66 Table 12 – Relative Time Expenditure ...........................................................................................66 Table 13 – Time (seconds per Cycle) .............................................................................................68 Table 14 – Adimensional Time per Cycle .......................................................................................68 Table 15 – Time (seconds per Cycle) .............................................................................................70 Table 16 – Adimensional Time per Cycle .......................................................................................70 Table 17 – Time (seconds per Cycle) .............................................................................................72 Table 18 – Adimensional Time per Cycle .......................................................................................72
xi
List of Symbols
Greek Symbols:
kα Runge Kutta Coefficients
kβ Runge Kutta Coefficients
t∆ Time step
x∆ Spatial variation
ε Dissipation rate of turbulent kinetic energy
η Kolmogorov scale
λ Taylor Scale
wλ
Wavelength
υ Kinematic viscosity
[ ]sυ Viscosity
ρ Density
Roman Symbols:
d Eddie Diameter
)(1ke
r
r
Orthogonal vector to kr
)(2ke
r
r
Orthogonal vector to kr
f Generic Function f
ℑ Direct Fourier Transform Operator
1−ℑ Inverse Fourier Transform Operator
g Generic Function
k Wavenumber
k
r
Wave number vector
forceK Energy Input at Forcing
0k Wave number for energy injection in Forcing
L Length
xii
0l
Macroscale Lenght
cL
Characteristic Length Scale
[ ] )(uLs
r
Viscous Term in Navier-Stokes Equations
n Number of wave numbers in a given direction
� Total Number of computational points in the calculation grid
procsn Total number of processing cores
x�
Number of points in the calculation grid along x-direction, also 1�
y�
Number of points in the calculation grid along y-direction, also 2�
z�
Number of points in the calculation grid along z-direction, also 3�
)(u�r
Non Linear and Pressure Term in Navier-Stokes Equations
P Power Input
0P Processor 0 in a multi-processor set-up
1P Processor 1 in a multi-processor set-up
2P Processor 2 in a multi-processor set-up
3P Processor 3 in a multi-processor set-up
p∇ Pressure Gradient
Re Reynolds Number
LRe Reynolds at the computational box of length L
ηRe Reynolds Number at Kolmogorov scale
runT Total Run Time for a simulation
ur
Velocity vector
iu
Velocity in direction i
ηu
Kolmogorov Scale Velocity
λu
Taylor Scale Velocity
0u
Macroscale Velocity
cU
Characteristic Velocity Scale
LU Velocity at box of length L
xiii
t
u
∂
∂r
Temporal derivative of velocity vector
u∇ Velocity Gradient
u2
∇ Laplacian of the Velocity
)(ˆ ku Transform of Function of x-vale into k-value
)(xu Function of x-value
xiv
List of Abbreviations
CFD Computational Fluid Dynamics
DFT Discrete Fourier Transform
DNS Direct Numerical Simulation
FDM Finite Difference Method
FEM Finite Element Method
FVM Finite Volume Method
FFT Fast Fourier Transform
HIT Homogenous Isotropic Turbulence
LES Large Eddy Simulation
MPI Message Passing Interface
PDE Partial Differential Equation
PSM Pseudo-Spectral Method
RAM Random Access Memory
1
Chapter 1
Chapter outline:
Chapter 1 deals with the theoretical fundamentals that underpin the developed work. It
also provides insight on why this work was approached and its intended goal. In
addition, it provides with a short explanation on the parallelization scheme used and a
comparison with other options.
2
Introduction
Fluid flows are of engineering significance due to the variety of applications that
depend on understanding its behaviours. The subject has been extensively studied both
physically and mathematically. The particular dynamics of a fluid flow can be described by
the Navier-Stokes equations, presented in the next section. In turbulent flows, there are no
analytical solutions to for these equations. Computational Fluid Dynamics provides us with
engineering-grade responses to how a given flow can behave in a particular situation, with
several methods used to solve the Navier-Stokes equations or simplified forms of the same.
The most commonly used methods in commercial software are the Finite Volume Method
(FVM), the Finite Differences Method (FDM) and the Finite Element Method (FEM), uses the
same approach but has a different mathematical foundation. The previous methods all
benefit from being applicable in a range of different flow geometries, but require extensive
care when designing the computational mesh domain in order to obtain results that
successfully model the flow field.
This thesis deals with the application of the Pseudo Spectral Method (PSM), which is
characterised by its high accuracy Its main disadvantages are the need of a regularly
distributed mesh, which prevent its use in complicated flow geometries, and, like all methods
previously mentioned, a high physical memory requirement to be capable of modelling the
smaller scales of motion present in any fluid flow. As such, this work is aimed at increasing
the capacities of a pre-existing algorithm into massive calculations to be used in reference
simulations. The solver is designed to use the Message Passing Interface standard, and
takes advantage of the 2DECOMP library [1]. The 2DECOMP library uses a pencil-
arrangement for each processing core and provides full communication capacities for inter-
core and inter-node operations.
3
1.1 Numerical Simulation of Flows
1.1.1. Navier-Stokes Equations
The Navier-Stokes equations, for incompressible fluids may be written as:
For mass continuity:
0. =∇ u
r
(1.1)
For momentum continuity:
upuut
u rrrr 21. ∇+∇−=∇+
∂
∂ν
ρ (1.2)
The solution of these equations describes a flow’s velocity field, defined at every
point in its domain for a given time. This enables the calculation of several other properties,
such as mass gradients, temperature gradients and pressure gradients. However, these
equations cannot currently be generically solved, and require simplification in order to obtain
solutions for the velocities fields under study. The general form of the Navier-Stokes
equations still lacks a solution for all ranges of applicability, and there is considerable effort
in using the equations for engineering projects. A Turbulent flow is chaotic in nature due to
the interaction of convection, and results in rapid variations of local pressure and flow
velocity both spatially and temporally.
This thesis is aimed at the numerical simulation of Turbulence using a computational
method suitable for calculation of all of the flow’s properties in a simplified computational
domain.
1.1.2. Turbulence
Turbulence is a phenomenon which Peter Bradshaw describes as:
“Turbulence is a three-dimensional movement dependent on the time in which the
vortex stretching makes it so that fluctuations in the velocity field extend into all wavelengths,
between a minimum set by viscous forces and a maximum set by the boundary conditions of
the flow. It is the usual state of fluids except when at low Reynolds numbers.” According to
Tsinober [10], this quote is accurate in describing the Turbulence phenomena, but is not very
helpful when first attempting to understand the events taking place. At low Reynolds
4
numbers, a flow behaves in laminar fashion, with the fluid viscosity taking over the fluid
inertial behaviour. As the Reynolds Number increases, the inertia present in the fluid
molecules begins to take over and the flow becomes more disorganised, acquiring a
turbulent nature. Eventually point the flow is entirely dominated by Turbulence, which greatly
alters a flow’s properties at a local level. Globally, one may see large temporary movement
structures in the flow, but these are not predictable. Leonardo da Vinci had already observed
and attempted to study Turbulence, but, much as today, the phenomena is still under heavy
study and requires additional models to compute, if using the Navier-Stokes Equations.
Figure 1 – Turbulent Flow as depicted by Da Vinci (1452-1519)
5
1.1.3. The Energy Cascade
In order to understand Turbulence one must first take into account the energy
transfer processes that occur within the flow. The first of these is the Energy Cascade
Concept, which attempts to explain the energy transfer from large, macroscopic flow scales,
to microscopic flow scales, until the eventual dissipation via viscosity, where a conversion
into internal energy occurs.
Turbulent motions span a wide range of scales ranging from a macro scale at which
the energy is supplied, to a micro scale at which this energy is dissipated by viscosity [ref
turbulence book]. The interaction among eddies of various scales passes energy
sequentially from the larger eddies to the smaller ones. This process is called Turbulent
Energy Cascade, depicted in Figure 2 presented next.
Figure 2 – Eddy sizes within the Energy Cascade
If Turbulence is statistically in equilibrium, then the rate of energy transfer from one
scale to the next must be the same for all scales, so that no group of eddies sharing the
same scale sees a total energy level increase or decrease over time. It then follows that the
rate at which energy is supplied at the largest scales, is equal to that dissipated at the
smallest scales
6
Let us imagine that the flow under study is within a cube box of length L , then the
range of scales is shown next, in Figure 3:
Figure 3 – Description of the range of scales in a turbulent flow
Each of the scales represented is part of the Energy Cascade mechanism, and
energy injected at the Large Scales, travels continuously towards the smaller scales down
towards the molecular scales. Turbulent flows contain instantaneously generated vortices,
observable in practice, which transmit energy from macroscopic scales to molecular scale,
these vortices, according to their relative size and order of magnitude, are a form of energy
dissipation and mixing in the moving fluid, and vortices, at each length scale behave
differently, transferring energy into the next generation of smaller vortices. Large-scale and
Integral-scale vortices are anisotropic, and dissipate little to no energy, while also containing
the highest amount of energy in a fluid flow. Smaller vortices are isotropic, and contain very
low amounts of energy in the fluid. Both Large-scale and Integral-scale vortices have an
apparently simple structure, but are in fact extremely complex and interact with all other
scales at all points in their flow, which allows for the dissipation of turbulent kinetic energy
down towards the small vortices and eventually to the molecular scale.
L
Large Scale Integral Scale
Taylor Microscale
Kolmogorov Scale
Molecular Scale
7
1.1.4. The Kolmogorov Micro Scale
The Reynolds number at the Kolmogorov scale is υηηηu=Re = 1, which is
consistent with energy dissipation by molecular viscosity. If 1Re =η
, using the first similarity
hypothesis, as enunciated by Kolmogorov [7],[8]:
First similarity hypothesis:
In every turbulent flow at sufficiently high Reynolds numbers, the statistics of small
scale motions have a universal form that is uniquely determined by the kinematic viscosity,
υ , and the dissipation rate of turbulent kinetic energy, ε .
By performing a dimensional analysis, the following unique length, velocity and time
scale are obtained in Eq. (1.3):
4
1
3
=
ε
υη ( )4
1
ευη=u
4
1
=
ε
υ
τη
(1.3)
By taking into account the concept of the Energy Cascade, the dissipation can be
estimated from the large scales of the flow, by taking:
0
3
0~
l
uε (1.4)
Where 0
u and 0l and velocity and length scales of the largest eddies.
Taking all of the previous into account, the relationships between largest scales and
the smallest scales may be derived, resulting as follows:
4
3
0
Re−
=
l
η 4
1
0
Re−
=
u
uη
2
1
0
Re−
=
τ
τη
(1.5)
8
1.1.5. The Computational Cost.
Direct Numerical Simulation (DNS) has the capability to solve the Navier-Stokes
equations directly without any simplification. However, in order to capture the smallest
scales of Turbulence, a fine mesh is required, of the order of the Kolmogorov Scale, which
will lead into a growing necessity for memory in order to model more minute and detailed
flow properties. If this is not the case, then the flow energy is not properly dissipated by the
smallest scales.
The approach taken requires full 3-dimensional precise fields of dependent
variables, often unreachable via experimentation, at the expense of a growing necessity of
memory as the computational mesh size increases to catch these smaller scales. Assuming
a cube-shaped computational domain, of length L , with L also being the characteristic
length of the largest scales present in the simulated flow, the number of points � required is:
3
~
η
L� (1.6)
Therefore, for� :
4
94
93
Re~~~L
LLuL
�
υη (1.7)
As the simulation’s Reynolds number increases, an increasing in number of points in
the mesh is required, which is translated directly into an increasing memory usage.
For stability, the maximum allowed time-step can be found by following the Courant-
Friedich-Levy (CFL) condition:
L
L
Ut
x
tU η=∆↔
∆
∆1~ (1.8)
9
To reach a fully developed turbulent scale, the time required is proportional to the
time scales of the largest eddies,
L
run
U
LT ~ (1.9)
Therefore, the number of time steps is of the order of:
4
3
Re~~~ L
run
timesteps
L
t
T�
η∆ (1.10)
This implies that this kind of simulations is restricted to low Reynolds Numbers as
the order of nodes increase with 4
9
ReL
, as indicated by Eq. (1.7) while the number of time
steps increases with 4
3
ReL
as in eq. (1.10), with the number of mode-steps increasing by
3
ReL.
10
1.1.6. Parallelization Schemes
Due to the necessity of a higher memory as the Reynolds Number of the flow
increases, common workstations rapidly become overwhelmed by both the memory
requirements and the calculation procedures. As such, when aiming for exceedingly large
simulations, there is a requirement to move up to large clusters, which host thousands of
cores and have large memory banks. In order to take advantage of these, specific languages
are required. For this thesis, Message Passing Interface (MPI) is used. MPI is designed to
be used with large clusters and consists on breaking up the larger memory into smaller
portions of the memory capable of being handled by individual nodes and cores with specific
physical memory banks. This is themed parallelization, and for the purposes of this thesis,
two parallelization schemes were initially approached:
a) Slab Scheme
Figure 4 – Slab Arrangement
The slab parallelization scheme is the simpler of the two schemes approached
during this thesis. Each processor has access to a physical memory bank, which is, for a
three-dimensional field, operated in two dimensions prior to a global exchange in order for
the third dimension to be operated. Only one global communication is required, although the
message is of extremely large size. The Slab scheme has the advantage of, as mentioned,
for a three-dimension field, only requiring one communication. It has the disadvantage of
occupying more memory than the pencil scheme. The main disadvantage of the slab
decomposition, is the fact that it requires more memory available on a given calculation
processor, requiring at least ( )procs
n� 1× memory, with � being the selected mesh size
for calculations, placing a limit on the possible calculations which is linked to the available
Physical memory that the node, to which the processor belongs, has available.
11
b) Pencil Scheme
A Pencil scheme consists of having each processor use a one-dimensional piece of
the three-dimensional field under study. It uses less memory per processor, but requires
more processors in order to be fully efficient. Each processor requires less memory than a
slab-memory arrangement, but two communications are required in order to fully calculate
the field.
Figure 5 – Pencil Arrangement
The disadvantage of this approach is the fact that the communication effort that take
place will eventually begin to degrade code performance depending on the localized space
distribution and its relationship with the global dimensions. Due to the goal of having an as
large as possible simulation, the pencil scheme was selected and based on the 2decomp
library, which provides a framework to build the remainder of the code.
A further benefit of having a pencil-memory arrangement is the option to select
which arrangement one may use during the final calculation involving the memory fields.
This option was explored in the work developed for this thesis with two variations, one using
an x-dimension pencil arrangement to x-dimension pencil arrangement, which was intended
to, even if using a higher communication cost; capitalize on the calculation procedures, and
an x-dimension pencil to z-dimension pencil, which instead capitalized on the reduced
communications model.
12
Memory decomposition plays a key role in calculation procedures and
communications, and initially, a slab decomposition pattern was approached, as when
compared to the Pencil decomposition, it allows for two 1-Dimensional FFT procedures to be
done in the local space, requiring only one communication step and a third and last FFT
operation. This results in a rotated local view of the memory pattern.
The main disadvantage of the slab decomposition, is the fact that it requires more
memory available on a given calculation processor, requiring at least ( )procs
n� 1× memory,
with � being the selected mesh size for calculations, placing a limit on the possible
calculations which is linked to the available Physical memory that the node, to which the
processor belongs, has available.
Figure 6 – Slab Parallelization Scheme Examples
x
y
z
13
A Pencil memory decomposition, presented in figure 6, side steps this limitation by
further restricting the memory space to only what is required for a 1-Dimention FFT to be
performed, but requires two global sets of communications, with two intermediate FFT
operations prior to output.
The disadvantage of this approach is the fact that the communication effort that take
place will eventually begin to degrade code performance depending on the localized space
distribution and its relationship with the global dimensions.
Both approaches, however, produce, if one assumes the minimum communications
model, a different data distribution at output from the data distribution at input, with the slab
decomposition producing a differently orientated slab, and the pencil producing a differently
orientated pencil.
Further, more detailed comparisons of these decompositions may be found at the
2Decomp Library‘s Overview section [1].
Figure 7 – Pencil Parallelization Scheme Examples
x
y
z
14
15
Chapter 2
Chapter outline:
Chapter 2 introduces the computational tools used in this work and is intended as a
bridge between the theoretical introduction presented before, and the numerical
formulation used in the Navier-Stokes solver
16
Background
The present work is created as an evolution from the previous spectral code that has
been used in several publications in the research team in [3], [4], [5]. The subject of this
thesis was to develop a code that could eventually be comparable to with the best existing
codes in the world in terms of performance, and for that it was decided to use the 2DECOMP
library available at (http://www.2decomp.org/)
At the time of the creation of the MPI solver version, the original code existed in
OpenMP version, and is more developed than the present version.
2.1 Pseudo-Spectral Method
Pseudo-Spectral Methods (PSM) originate in 1970, and are a class of numerical
methods used in applied mathematics and scientific computing for the solution of partial
differential equations (PDE), PSM is used extensively in computational fluid dynamics (CFD),
for Turbulence simulation and is related with the Spectral methods, where the “pseudo” in
the nomenclature refers to the treatment of the non-linear term of a PDE. Spectral solutions
to time-dependent PDE are formulated in the frequency / wave number domain and solutions
are obtained in terms of Fourier coefficients. For PSM, the PDE are solved point wise in
Fourier space, but the spatial derivatives are calculated using orthogonal functions (e.g.
Fourier Integrals). They are evaluated using matrixial multiplications, FFT or convolutions.
17
2.1.1 Discrete Fourier Transform:
For a function ( )xu in the Physical domain, it is possible to execute a mathematical
operation to convert it into ( )kû in the Spectral space (or Fourier space). This operation is
called the Direct Fourier Transform, denoted byℑ :
( ) ( ) dxexuxukûikx−
+∞
∞−∫=ℑ=
π2
1}{)( (2.1)
The wave number k is computed as follows:
w
kλ
π2= (2.2)
The reverse operation is also possible, the Inverse Fourier Transform, which returns
the transformed function in the Spectral space to the Physical space, and the operation is
described as:
( ) ( ) ( ) dkekûkuxuikx
∫+∞
∞−
−
=ℑ= }{1
(2.3)
The comparison between a function’s appearance in the Physical domain and its
Fourier Transform in the Spectral space is presented next in Figure 8, which represents a
real-domain function, and its transform in the spectral space. The transformed functions
have only a wave number and amplitude, and when transformed via the Inverse DFT return
the same real function. This property allows for the apparent simplification of operations
affecting the real space to take place in the frequency domain.
Figure 8 – Fourier Transform Domains (Direct and Inverse)
18
The main advantage of this method, as previously mentioned is the simplicity of
most terms when they are in the spectral domain, even if they are apparently complicated in
the physical domain. To illustrate this, Table 1 lists the correspondent between several
mathematical operations in the physical and Fourier space
Property Physical Space Spectral Space Notes
Linearity bgaf + ^^
gbfa +
Derivative
ix
f
∂
∂
^
fiki
Laplace Operator f2∇
^2 fk−
Divergence
i
i
x
u
∂
∂
iiûik
Curl u×∇ ),( tkûik ×
Convolution gf ∗ ( )tkgf ,
^^
∗ ∗ - Convolution
Table 1 – Fourier Transform Properties
The table above lists the properties of the DFT used in the MPI DNS solver, showing
the simplicity of the operations in the Spectral space, with the singular exclusion of the
convolution operation.
To do a Spectral space convolution operation, it is more time efficient to return the
Spectral function ( )tkgfG ,
^^
∗= to it’s ( ) ( )txgtxf ,, ∗ from in the Physical space, and
calculate the external product at normal, only to then perform a direct DFT operation in order
to obtain the convolution result.
( ) ( ){ } ( )
∗ℑℑ=∗ℑ −
},{,,^^
1 tkgftxgtxf (2.4)
This defines the class of pseudo spectral methods.
19
2.1.2 The Navier-Stokes Equations in the Fourier Space
The Navier-Stokes equations can be easily written in the Fourier space:
Using tensor notation, if we apply a Fourier Transform to the continuity equation, we
get:
ii
i
iûik
x
u=
∂
∂ℑ (2.5)
And the previous may be written as follows:
0=iiûk (2.6)
For the momentum conservation equation, by applying the Fourier Transform to
each term:
t
û
t
uii
∂
∂=
∂
∂ℑ (2.7)
[ ] [ ]i
s
j
isûk
x
u2
2
2
υυ −=
∂
∂ℑ (2.8)
pikx
pi
i
i ˆ11
ρρ−=
∂
∂−ℑ (2.9)
( )iji
j
Guux
ˆ=
∂
∂ℑ (2.10)
Thus, the Spectral Form of the Navier-Stokes Momentum equation is:
[ ]iii
si Gpikûkt
ûˆˆ
12−−=+
∂
∂
ρυ (2.11)
By multiplying the above equation byi
ik , an operation equivalent to applying the
divergence in the Physical space, we obtain:
20
iiGikpk ˆˆ
12
=
ρ
(2.12)
From the previous, we may obtain the pressure:
2
ˆˆ
k
Gikp iiρ= (2.13)
Thus, the final form of the Navier-Stokes Momentum Equation in the Spectral space
is obtained:
[ ]k
kj
jkj
sjG
k
kkûk
t
ûˆ
2
2
−−=+
∂
∂δυ (2.14)
After deducing the Navier-Stokes equations in the Spectral space, a note is
warranted about the difficulty of each term. While the majority of the terms are apparently
simple to calculate, the term G is a convolution, and as such, while in the Spectral space, it
would be difficult to calculate. As such, as the terms are calculated, the term G is subjected
to an inverse FFT operation, where it is calculated as an external product, and then
subjected to a direct FFT for Spectral space calculation.
This is the only step in the entirety of the code that requires the real field to be used,
in order to reduce computational operation time. Doing so however, will produce errors as
the products of the Fourier transforms will yield ghost terms. These are called aliasing errors,
and are solved via truncation of the corresponding wave numbers. For truncation the 3/2
rule was used, that removes the aliasing errors by discarding all Fourier coefficients for
which:
max
3
2kk > (2.15)
This allows for the higher energy nodes to be removed and retain accuracy of the
PSM, which is vital in the present work.
21
2.1.3 Spectral Space Navier-Stokes representation
In the Spectral domain, the Navier-Stokes equations acquire the following
configuration (see Figure 9). The wave numbers k are three dimensional vectors originating
at the coordinate centre, and u , the complex velocity is located on a plane perpendicular to
the vector k to which it corresponds. Each wave number vector corresponds to a single set
of amplitude and angle in the plane normal to its direction. This is called the complex velocity
and has the same implicit meaning as a conversion between a real-domain function and its
representation in the spectral/Fourier space.
The complex velocity, ur
is in itself, a complex number, which, when constricted to
the plane, requires only a module and angle.
Figure 9 – Velocity and Wave number Vectors in Spectral Space
22
2.1.4 Numerical Algorithm:
The MPI DNS solver developed, as its predecessors, uses a full explicit temporal
advancement scheme (rd
3 order Runge Kutta) for the Navier-Stokes Equations:
The equations to be solved may be written as:
( ) [ ]( )uLu�tt
usrr
r
+=
δ
δ and 0. =∇ u
r
(2.16)
In which the terms ( )u� and [ ]( )uLs
are the convective term and the viscous term.
( ) puu�rrrr
∇−×=ρ
ω1
(2.17)
[ ]( ) [ ]uuL
ssrr 2
∇=υ (2.18)
The rd
3 order Runge-Kutta time stepping scheme computes, at each time-step, the
new velocity tenser at the new sub-step nk
uu = from the last sub-step
1−ku and
12 −−
=nk
uu , allowing us to write:
( ) [ ]( ){ } ( ) [ ]( ){ }2211
1
−−−−
−
+++=∆
− ksk
k
ksk
k
kk
uLu�uLu�t
uuβα (2.19)
And still subject to:
0. =∇k
u (2.20)
With the coefficients k
α andk
β , according to Williamson:
15
8
1=α ; 0
1=β
12
5
2=α ;
60
17
2−=β
4
3
3=α ;
12
5
3−=β
(2.21)
23
2.2 Simulation of Homogeneous Isotropic Turbulence
With the previously stated, one now turns to the simulation of Turbulent flows. Given
that DNS is the chosen tool, with LES being used in conjunction with PSM; this effectively
allows us to capture the instantaneous and chaotic essence of the Turbulence phenomena.
For reference, this thesis is designed with Homogeneous Isotropic Turbulence (HIT) in mind.
As such, a turbulent flow is isotropic if two conditions are met, which follow: Rotation and
buoyancy are not important, and therefore may be neglected and, there is no mean flow.
Rotation and buoyancy forces tend to suppress vertical motions in the fluid, and
create anisotropy between the vertical and horizontal directions. The presence of a mean
flow with a particular orientation may also introduce anisotropy in the turbulent velocity and
pressure fields. Further, a flow is homogenous if there are no spatial gradients in any
averaged quantity with this is equivalent to assuming that the statistics of the turbulent flow
are not a function of space. The present work is aimed at performing HIT DNS simulations,
of two related kinds: Decaying HIT, and Forced HIT. For Decaying HIT simulations, an initial
condition is set and the turbulent flow is allowed to decay via dissipation of the turbulent
kinetic energy. For Forced HIT simulations, energy is added to the flow, resulting in a
statistically steady flow. Following the mechanisms for the energy cascade, energy is added
to the low-wave numbers components of the velocity field. Comparing both decaying and
forced HIT DNS simulations, the advantage of the forced HIT is that stationary time series
are obtained and may be analysed, while also providing long-time series, at the cost of
polluting the large-scale (low frequencies) motions of the flow, which are different from
decaying (natural) turbulence, although the small-scale (or high-frequency) statistics are
unaffected due to the Energy Cascade concept. Decaying HIT, which has the advantage of
having large scales unpolluted, does not allow one to obtain long-time series as the flow will
tend towards dissipating the entirety of turbulent energy initially present in the initial velocity
field. For Forced HIT, for the velocity, as mentioned before, the low-frequency statistics are
affected by the forcing, while the high-frequency statistics are not appreciably affected.
Details of the forcing method may vary, but the underlying concept remains, that forcing
mimics the energy transfer to the inertial sub-range from the larger energy-containing scales
and the small scales are accurately solved. As such, forced turbulence has been used
extensively in the study of physics pertaining to the small scales of turbulence.
24
2.2.1 Decaying Homogeneous Isotropic Turbulence
To simulate decaying HIT, a Homogeneous Isotropic velocity field is generated as an
initial condition, and the flow evolves by dissipation of turbulent kinetic energy, requiring no
forcing.
2.2.2 Forced Homogeneous Isotropic Turbulence
A Homogeneous Isotropic Turbulence simulation consists on the following:
When a Turbulent flow is statistically invariant under rotation about arbitrary axis
and, in consequence, statistically invariant under translations, it is deemed homogenous and
isotropic. While the previous corresponds to an idealized type of turbulent flow, one may find
it in practical experimentation, such as in a wind tunnel.
Now, if one takes a cubic domain with periodic boundary conditions, in all three
directions in space, we may simulate the phenomena of Homogenous Isotropic Turbulence
numerically, allowing the use of efficient and fast numerical schemes, such as PSM, as is the
case of this particular section. Of note is the following. Turbulent Kinetic energy, present in
any flow with a turbulence component, needs to be forced in order to avoid the dependence
of the small scales from the large scales, where the forcing is processed. This is done to
maintain the Turbulent Kinetic energy steady. The process consists on artificially adding
energy to low-wave number components of the velocity field. If this is not performed, one
instead obtains a decaying turbulence simulation, which, while also accommodated by the
solver, is a much simpler use.
For the statistically stationary isotropic turbulence, with forcing as mentioned, the
characteristic length scale and velocity scales are 1
0
−
= kLc
and 3
1
0
=
k
PU
c thus, the
corresponding Reynolds number becomes:
υ
3
4
0
3
1
Re
−
=
kP
(2.22)
25
2.2.3 Forcing Method for Homogeneous Isotropic Turbulence
The forcing mimics the energy transfer to the inertial sub-range from the larger
energy-containing scales and the small scales are accurately solved, as mentioned already.
As such, forced turbulence has been used extensively in the study of physics
pertaining to the small scales of turbulence. Forcing, in this solver, is done in the Spectral
Space, requiring additional parameters. These parameters are the Power Input parameter,
P which is defined by the user on program start. The Power Input parameter controls the
energy of the flow, forceK which is to be constant through the simulation. At a stationary
stage, the rate of change in the integrated turbulent kinetic energy
dt
dK force is zero since
the dissipation rate ε matches the input power P .
Other inputs are the wave numbers k where the energy is injected at for the
simulation to progress. The implementation of forcing is as mentioned, performed in the
Spectral space, by means of a forcing term
^
f , a 3-dimensional vector.
The Navier-Stokes equation in the Spectral space can be now written as:
[ ] ( )tkfGk
kkûk
t
ûk
kj
jkj
sj,ˆ
^
2
2 +
−−=+
∂
∂δυ (2.23)
With the term ^
f defined as:
( ) )(),()(),(,21
^
ketkBketkAtkfrandomrandom
+= (2.24)
Where the terms ),( tkArandom
and ),( tkBrandom
are complex randomized numbers
and the terms )(1ke and )(
2ke are unit vectors. The random Force is additionally chosen to
be divergent free, which wields the following condition:
( ) 0,^
=tkfk (2.25)
This has the significance of implying that the force is projected in the same plane as
the velocity field, leading to the implication that vectors )(1ke and )(
2ke must be orthogonal
to each other and to k .
26
The vectors )(1ke
r
and )(2ke
r
are defined as follows:
( ) ( )
+
−
+
== 0,,),,(
2
1
222
1
22
1111
yx
x
yx
y
zyx
kk
k
kk
keeee
r
(2.26)
( ) ( )
( )
+
−
++
==2
222
2
1
2222
1
222
2222,,),,(
k
kk
kkk
kk
kkk
kkeeee
yx
yx
zy
yx
zx
zyx
r
(2.27)
With the random numbers random
A and random
B also being given by:
( )( )φ
π
θ
A
i
randomge
k
kFA 1
2
1
2
2
= (2.28)
( )( )φ
π
θ
B
i
randomge
k
kFB 2
2
1
2
2
= (2.29)
With ( )kF being the prescribed force spectrum, and functions A
g and B
g are two
real-value functions related by 1=+BA
gg . Values 1θ and
2θ are random angles so that
[ ]πθθ 2,0,21∈ with [ ]πφ ,0∈ being a random number as well generated at each wave
number and discrete time level.
In order to produce isotropic forcing, A
g and B
g are defined as:
φ2sin=A
g (2.30)
φ2cos=B
g (2.31)
27
To cancel any correlation with the velocity field 1θ must satisfy the following:
( ) { } ( ) { } { }( )
( ) { } ( ) { } { }( )22
22
1
cossin1
cossin1
tanξϕξϕφξφ
ξϕξϕφξφθ
IMAGREALgIMAGg
REALIMAGgREALg
BA
BA
++−
++
= (2.32)
With 11eu
r
=ξ and22eu
r
=ξ , and ϕ being a randomly generated angle on a
[ ]π2,0 defined by:
12θθϕ −= (2.33)
For the force, its spectrum shape is:
( )( )
c
kk
AekF
2
0−
−
= (2.34)
Which wields a concentrated force focused at wave number0k , with a degree of
concentration defined by c . The force is limited to be active at the wave number
range [ ]BAkkk ,∈ . To match the power input to the value P, A must satisfy the following:
( )
dke
t
PA
b
a
k
k
c
kk
∫−
−
∆=
2
0
1
(2.35)
The randomness of the scheme in time uncorrelates the velocity field from the
forcing, avoiding the enhancement of a certain time scale. Since it is independent of the
velocity field, there is the possibility of starting simulations from a zero velocity field, and
turbulence is generated by the forcing itself, ensuring that the final solution is independent of
the initial conditions.
28
2.3 Domain Decomposition; Pencil Pattern
The memory pattern used for the parallelization of the code was the pencil
decomposition, for the reasons mentioned earlier. This involves a minimum of two global
transposition steps, for each FFT-style operation, namely from the real to complex domain,
and from the complex to real domain for the inverse operation. The memory pattern is
described by an object-like structure, which includes and calculates all of the information
pertaining to the localized information and global information, with the procedures being
automated in the 2Decomp Library. A separate object is used, but not required, for Physical
and Spectral space, only for further work development, given that any further work done on
the code must keep in mind that the Physical/real space has a radically different structure
and organisation from the Spectral/Fourier space. Further objects are defined with different
sets of compatible, but different global dimensions for calculations, such as for the Isotropic
Turbulence initialization routines and further work will be necessary for Jet initializations.
Due to the Hermitian Redundancy routine, there is a need for each pencil to have
matching dimensions in order for this version of the solver to work, which has the added
benefit of ensuring or attempting to ensure that there is a even memory load in all
participating cores, coupled with a even processing load whenever possible with some minor
exceptions, such as some read/write to file operations. The 2Decomp Library supports this
option automatically, if requested at compilation time. This approach is to a simplification in
the algorithm for the transposition of a processor-shared plate, which is made simpler, if all
participating pencils have matching dimensions. Further, due to the previously mentioned
routine, there is a need for at least two participating cores in any given direction, since it
relies on the possibility of mirroring the entire domain, which involves a set of
communications outside the library and implemented in the functional code.
The communications performed rely on pre-allocation of the sending/receiving
buffers coupled with a minimum set of send/receive blocking operations thusly ensuring that
while the messages may be large, space is reserved and no run-time is spent on allocating
these blocks, with only the nodes/cores own communication times being relevant.
29
2.4 2decomp
As mentioned several times along the document, the Spectral code was developed
atop the 2Decomp Library available at (http://www.2decomp.org/), which is based around the
concept of memory pencil distribution. The library deals with the vast majority of the
parallelization effort, and is generic for any application that uses its fundamental basis. The
Library in itself is not aimed exclusively at FFT operations, but instead is based around the
memory pattern and subsequent necessary operations, such as global transposition, halo-
cell behaviour, reading and writing to file, and is reported as being compatible with the
architecture of several computational centres, although deviations may be expected.
According to its creator, the 2Decomp Library is a framework for FORTRAN for large-scale
parallel applications designed using a 3-Dimensional structured mesh.
At its core, the Library consists on a 2-Dimensional Pencil decomposition for
distributed memory calculations. It is, as a design feature, scalable and efficient, with an
interface aimed at its implementation and usage at supercomputers. The majority of the
communication programming is left to the Library in itself, and it includes a FFT interface
able to be used with several different FFT interfaces. The FFT module is also, given its basis
in the 2Decomp library, scalable, but requires specific information, such as being of a
different shape than the starting domain. The output is converted from the Real to a Complex
type in order to deliver the data correctly, as a FFT output must consist of module and angle
rather than a real variable. The Library is designed with portability in mind, which further
prevented any attempt of modification of the Library when applying it to the MPI Version, in
order to allow for fast portability from computational centre to computational centre.
Amongst the parallelization scheme, the Library also includes functionality to allow
for the interpolation procedures necessary for calculating, say, a velocity derivate by means
of finite differences. This functionally, called Halo-Cell, consists of a further set of operations
that, for cores with globally adjacent memory blocks, allow for sharing relevant information
based on the user’s needs. This functionality is reported as having also been optimized
thusly justifying its use, albeit, being communication intensive (in a different fashion than
global transpositions), there is some caution in using it, as each core must send/receive at
least eight messages, of largely different sizes. The Halo-Cell functionality is used in
enforcing the Hermitian redundancy algorithm, which requires information from adjacent
memory blocks present in participating processors.
30
For applications requiring periodical boundary conditions, the Halo Cell functionality
also allows for, when the memory is not immediately present in the core, to request
information from the opposite side of the domain to which the periodicity is related to. The
exception occurs when the memory is entirely present in the core, thereby voiding a need to
perform communications. For this work, only a first level memory access was required, but
the library supports a larger level of halo functionalities, to allow for higher order
interpolations to be performed on a global basis.
On the topic of pencil orientations, the Library does support x-dimension (real) to z-
dimension (complex) outputs, for an FFT for example, as well as supporting the reverse,
from z-dimension (real) input pencil to x-dimension (complex) pencil output. However, when
doing so, the FFT operations must start at z-dimension, which causes the auxiliary extra
variables to be stored as the end of the z-pencil, but horizontally. This causes the output
domain x-dimension shape to be halved* at start, with an auxiliary layer, and this
functionality was not used since the majority of this code rests on the organisation of the z-
dimension being halved*.
The Library also supports I/O operations, done with either routine-file opening or
closing operations, or, with previous file opening and closing. For the latter case, there is a
transcription of all RAM-used memory directly saved into disk by each processing core, with
the information requiring opening and conversion at a posterior time. The file saved does not
carry any information pertaining to the structural core organisation and may be opened with
different core arrangements at any time, allowing flexibility for program running at later
stages if a given cluster has different core and node availability. Some functionality for
performing portable file saves are provided, as well as a TECPLOT routine for writing slabs,
but in this code, these were left for exterior operations that read the memory stored during
execution time at discrete intervals.
This library is used successfully in Incompact3D:
(https://code.google.com/p/incompact3d/)
The successful use of the library and its subsequent release for academic and
research prompted its selection into being used for the purposes of this Thesis, which had
the intent of providing a completely distributed parallel PSM calculation engine for research
purposes, coupled with the possibility of later expansion and refinement.
31
Chapter 3
Chapter outline:
Chapter 3 deals exclusively with the work done in the implementation of the principles
mentioned in Chapter 1 and Chapter 2, and consists on a collection of algorithms used
in the set of routines that were operated during implementation.
32
Numerical Developments
In order to develop the MPI version of the Spectral code currently under usage, a
deep study had to be performed into the current logical processes used in the parent code,
in order to plan and develop a functional MPI variant for use in computational clusters.
Coupling the previous goal with the requirement that all participating subroutines
needed to be parallelized on a case-by-case and verified against Serial or OpenMP versions
of the same solver for result verification, meant that the option of completely parallelizing the
entirety of the solver code, for the purpose of this Thesis, was it was deemed beyond the
scope of the thesis.
Instead, there was a pressing need to first develop the capacity to calculate FFT with
proper real kind outputs and verify these against functional variants in other codes.
At this stage, in order to induce a minimum of changes to the way most of the
memory was accessed, the choice was taken to use, for the Spectral space, x-pencil
orientated memory pattern, which demanded two extra communications for proper memory
transposition and global Spectral space dimensions.
However, such an approach would require double the communication effort, as seen
the next flowcharts presented in the next page, for the Inverse and Direct FFT operations.
As seen in the flowcharts listed in Figure 10 present on the next page, there exists a
minimum of four communications in order to transpose the memory, with two
communications required for FFT processes, and two more to rotate the memory back into
an x-dimensional memory pencil.
Realizing that such communication effort would eventually require optimization, the
decision to re-create all participating subroutines was taken and pursued, in order to ensure
that the vast majority was orientation independent, in order to pave way for further
optimization steps to be taken.
No pre-allocation was done at this stage, which would be a separate step, if such
would be taken.
33
Figure 10 – Direct and Inverse FFT Algorithm Flowchart
Inverse X-to-X FFT Procedure Inverse X-to-Z FFT Procedure
1D FFT (Z)
Z to Y Global Transposition
1D FFT (X)
1D FFT (Y)
Y to X Global Transposition
1D FFT (Z)
Z to Y Global Transposition
1D FFT (X)
1D FFT (Y)
Y to X Global Transposition
X to Y Global Transposition
Y to Z Global Transposition
Physical Output
Spectral Input Spectral Input
Physical Output
Direct X-to-X FFT Procedure
Direct X-to-Z FFT Procedure
1D FFT (X)
X to Y Global Transposition
1D FFT (Z)
1D FFT (Y)
Y to Z Global Transposition
1D FFT (X)
X to Y Global Transposition
1D FFT (Z)
1D FFT (Y)
Y to Z Global Transposition
Z to Y Global Transposition
Y to X Global Transposition
Spectral Output
Physical Input Physical Input
Spectral Output
34
Not all subroutines were readily set up for a posterior communication effort
optimization step, such as the Truncation and Hermitian redundancy routine, which is
presented later in this Chapter, due to specifics on pencil orientation and minimal
communication requirements.
Following the correct functioning of the initial x-dimension to x-dimension pencil MPI
solver, there was an effort to profile the majority of the code using the VAMPIR Tracing Tool,
which is standard to MPI in order to find possible optimization regions in the code.
In doing such, there was the confirmation that the communication effort forced the
highest time consumption routines to be the Direct and Inverse FFT, prompting the first
optimization procedure to involve the reduction of communication while retaining code
stability
The last and most major step, which enabled faster processing times, was the
removal of almost all possible allocation procedures for regular calculation cycles and
communication buffers, with all variables of three-dimensional and two-dimension nature
being pre-allocated and its sizes remaining unchanged for the vast majority of the calculation
cycles.
While Buffered communications were not optimized, the current pre-allocation steps
taken should be sufficient to ensure proper message handling, although such a comment
may be addressed more in-depth in the Further work section.
35
3.1 Parallelization Scheme
In the course of the development of the MPI DNS solver, several different routines
required translation to local memory access and operation.
The following section introduces the schemes used ranging from the simplest to
those that demand extensive local information correlated with global localization.
These schemes affect mostly the statistical routines and are used extensively in
these cases. Due to the pencil-memory distribution, some routines require global variable
knowledge in regards to its array location when compared with the global memory.
Accessing these arrays is sometimes dependent on the local position and global
information, while in other cases, the previous two are minor details and not required at all.
Regardless of the memory disposition, the schemes used satisfy the requirements of
the original code in terms of address access when such is relevant.
The parallelization scheme chosen is similar to the default 2decomp Library, with the
Physical domain distributed along x-dimensional pencils and the Spectral space distributed
along z-dimensional pencils. The z-pencils hold the complex output of the FFT, but in a real
kind array of near identical size, resulting in the same memory usage for the Physical space
information and the Spectral space information, as depicted in Figure 11.
Figure 11 – Parallelization Scheme for Physical and Spectral Space
x
y
z
Physical Space (Global)
Dimensions: Nx Ny Nz
Spectral Space (Global)
Dimensions: Nx Ny Nz
KIND = REAL KIND = REAL
X-Pencil Orientation Z-Pencil Orientation
36
However, in the initial stages of development, and due to the functioning of the base
code, a further step was made in order to proceed towards a completely distributed parallel
MPI version, by making key one-dimensional arrays related in size and shape to the global
pencil arrangement. Figure 12 shows a visualization of what was intended.
The previous disposition enables the majority of routines to operate without global
information by ensuring that, at generation of arrays that affect specified entries in the field,
there have the same shapes as the corresponding z-dimensional pencils, allowing for the
developer to synchronize the local arrays at any point.
The prime example of usage of this logic is when initializing the wave numbers
arrays, which contains information regarding the local energy and wave number, at a global
level. Parallelizing these arrays allows for direct, index-independent correlation between the
local portion of the global memory and the wave numbers, with this capacity being exploited
at, for example, higher energy node truncation in the particular subroutine.
Figure 12 – Global Scheme versus Pencil Scheme
x
y
z
KIND = REAL
Z-Pencil Orientation
y
z
KIND = REAL
Serial Implementation
Spectral Space (Global)
Dimensions: Nx Ny Nz
x
37
3.1.1 Global Summing
Global summing is dealt with efficiently in the MPI standard.
A local sum is performed, either manually by inspecting the entire local portion of the
array entry by entry, and summing it to an auxiliary local variable, after which the local
auxiliary variable is globally reduced using an MPI Reduce or All reduce operation.
Some routines have been further simplified by a local Reduce operation coupled with
a global MPI derived all-reducing operation with the MPI_SUM command.
Figure 13 – Global Summing Algorithm Flowchart
3.1.3 Maxima/Minima
To find a global minimum or maximum, the simplest route is to find the local
minimum or maximum and then have each processing core find its minimum or maximum,
using a global reducing operation.
Figure 14 – Minima/Maxima Algorithm Flowchart
MIN/MAX
PROCEDURE
MPI_ALLREDUCE
(Global MIN/MAX)
Output Local
MIN/MAX
Output Global MIN/MAX
Input
Local Field
SUMMATION
PROCEDURE
MPI_ALLREDUCE
(Global SUMMATION)
Output
Local SUM
Output Global SUM
Input
Local Field
38
3.1.2 Local Summing
In some routines, local information is relevant on a global level, but global
information may not be present in order to perform summation or comparison.
In these cases, information present on the local array under scrutiny must be
accessed individually, with an externally defined object used as input describing the global
details of the local information.
The prime examples are statistical routines operating in the Spectral space, where
information pertaining to a specific global slab must be summed differently than the
remainder of the domain.
Given that only a few cores will have the information of this slab, and that each core
divides the memory into pencils, the entire slab is not present anywhere in the memory at
that core. As such, the usage of the globally defined object is relevant to allow for particular
treatment of the cores containing the slab.
Figure 15 – Local Summing Algorithm Flowchart
Output Value
Local Field Global –> Local
Object
Cycle Field k-j-i
Compare k-i-j with Object info
SUM Procedure 1 SUM Procedure 2
IF
MPI ALL REDUCE (SUM)
39
3.1.4 Global Access at Local level
In specific cases only certain entries in the local arrays, defined on a global basis,
are accessed and the manner of access might be simple to parallelize or not. The prime
example of this is when the accessing index is outside the bounds of the local information
and, by extension, when the point of reference to this index is also outside of the bounds of
the local space. In this case, conditional accessing correlated with global information is
required. The conditional sections keeps each processor from performing unnecessary
checks in inner loops, thereby removing unnecessary operations from the processor.
Figure 16 – Global Access at Local level Algorithm Flowchart
Output Value
Local Field Global –> Local
Object
Cycle Field k-j-i
Compare k-i-j with Object info
IF
MPI ALL REDUCE (SUM)
Global Condition
Access Information
Next k-j-I Entry IF
SUM Procedure
40
3.2 Fast Fourier Transforms
For the MPI version of the PSM DNS code developed, there was a need to develop
and implement wrappers for the Direct and Inverse Fourier Transform routines present in the
2Decomp Library, in order to match the output variable Type, which would require a
minimum set of modifications to the parent code. The chosen FFT was the FFTW engine,
given that the Library is flexible in regards to FFT engine selection. The logic used in
designing the wrappers was to make use of pre-allocated auxiliary variables with type and
dimensions matching the original Spectral space output and dimensions, to obtain faster
computational time by removing the need for run-time allocation. Memory considerations
also played a role, allowing the pre-allocated memory array to be reused, at the cost of
coding complexity. Since these dimensions accommodate auxiliary variables, the wrappers
then convert the complex output to a real type output variable, and remove the auxiliary
variables space from the memory pattern, returning a global sized memory block of
dimensions identical to the real global dimension, a step taken in order to facilitate further
calculations, which will retain only the relevant data and nothing else, a departure from the
original code where, due to in-house requirements, additional space was present and
complicated code development.
Figure 17 – 2decomp Library FFT X to Z (Standard) Implementation
x
y
z
Physical Space (Global)
Dimensions: Nx Ny Nz
Spectral Space (Global)
Dimensions: Nx/2+1 Ny Nz
KIND = REAL KIND = COMPLEX
X-Pencil Orientation Z-Pencil Orientation
Direct
Inverse
Aux. Variables
For FFT calculation
41
3.2.1 Direct FFT Wrapper
The Direct FFT wrapper was developed to take advantage of the wrappers already
in existence in the 2Decomp Library, although in order to provoke a minimal amount of
changes in the remainder of the code, a variable type change is required.
Figure 18 – Direct FFT Wrapper Algorithm Flowchart
1D FFT (X)
X to Y Global Transposition
1D FFT (Z)
1D FFT (Y)
Y to Z Global Transposition
Spectral Output (COMPLEX)
COMPLEX to REAL (Z)
Normalization
Spectral Output (REAL) (Z)
Aux. Input (Nx, Ny, Nz)
Spectral Output
(REAL, Normalized) (Z)
Spectral Input (REAL) (Z) Aux. Input (COMPLEX) (Z)
Physical Input (REAL) (X)
42
Figure 18 describes the algorithm used, where an auxiliary variable is used to store
2Decomp’s FFT wrapper’s output, which is of complex kind with variable precision.
As the majority of the code operates in Spectral space, and uses the logic of using
the standard FFT output, which is of real kind, a further operation is required using the output
kind real variable, where the information in the complex structure is transcribed to a real kind
variable, used in the remainder of the code. This output variable has differing dimensions
from the complex output in order to simplify the remainder of the code’s structure. A
normalization procedure is then performed
43
3.2.2 Inverse FFT Wrapper
Due to the Direct Wrapper ‘s requirement of returning a real kind output, the Inverse
operation is required in order to make use of the potential in 2Decomp’s calculation
procedures. The flowchart presented in Figure 19 introduces the algorithm used.
Figure 19 – Inverse FFT Wrapper Algorithm Flowchart
The algorithm consists on using the already previously allocated memory block of
complex kind to store the input for the operation, with the real input being transcribed into its
correct locations. The inverse FFT operation is then performed using this auxiliary complex
kind memory block, and the result is returned to the user.
REAL to COMPLEX (Z)
Spectral Output (COMPLEX) (Z)
Aux. Input (Nx, Ny, Nz)
Spectral Input (REAL) (Z)
1D FFT (Z)
Z to Y Global Transposition
1D FFT (Y)
Y to X Global Transposition
1D FFT (X)
Physical Output (Real) (X)
Physical Input (REAL) (X)
Aux. Input (COMPLEX) (Z)
44
3.2.3 Real to Complex and Complex to Real
As mentioned in the previous sections regarding the Wrappers, there are two
routines with perform the conversion of the 2decomp FFT output, which is a complex type
array, into a similarly dimensioned real type array, storing the same information. However, as
depicted in the following image, 2decomp’s FFT output, in order to match existing FFT
outputs in serial programs, has dimensions of ( )121
+�2
�3
� , as previously stated, with
the N1 dimension having complex kind, consisting of two numbers in a structured pair.
In terms of memory usage, this implies that 2decomp’s FFT output has an effective
( )21+�
2�
3� shape, which is the size of the Spectral Output/Input Auxiliary Memory
variable. In order to be able to use the 2decomp Library capabilities, during development of
the optimized version, these extra variables, not being necessary past FFT calculation, are
not translated into the final Spectral space variable, of real kind, in addiction to being
rendered into a 1
�2
�3
� global shape, with the amplitude and angle structure separated in
two sequential memory blocks.
The FFT global shape in 2decomp, if enforcing identical sized memory pencils, is
used during compilation ensures that the last complex z-pencil containing these auxiliary
variables is the single pencil with a different size (of +1 complex pair per line). Outside of
this, the necessary even number of nodes in the x-direction ensures that departing and
arrival memory blocks have matching dimensions. These routines are direction independent.
.
Figure 20 – Complex/Real Subroutine(s) Scheme
Spectral Space (Global)
Dimensions: Nx/2+1 Ny Nz
KIND = REAL
C to R
R to C
KIND = COMPLEX
Spectral Space (Global)
Dimensions: Nx Ny Nz
45
3.3 Truncation and Hermitian Redundancy
As mentioned in section 2.2, there exists a routine which enforces the Hermitian
Redundancy in the Spectral space. This routine’s implementation is by no means trivial, with
several steps being taken prior to the Hermitian Redundancy in itself, such as the truncation
of higher energy nodes in global locations, using the 2/3 rule.
The Truncation set of instructions is required in order to remove aliasing errors from
the calculation procedures due to the presence of higher energy (wave numbers) which may
produce a cumulative numerical error which, given time, may eventually cause a divergent
solution. The Truncation instructions are largely parallelizable with simple instructions, given
that the mesh’s dependence on whether a specific set should or not be truncated is related
mainly related to the wave number’s own numerical value, and since these were previously
panellized, a set of IF instructions ensures proper mesh treatment at a local level, while
correctly correlating with global locations.
Figure 21 – Truncation Global Scheme versus Pencil Scheme
The Hermitian redundancy on the other hand, depends strongly on the starting
pencil orientation and while its operation set is conceptually simple, the implementation of it,
coupled with the need for pre-allocation and communication makes it a non-trivial task at
parallelization. Two versions of this routine were created and tested, one which relied on
Spectral space being orientated along the x-dimension, and one which was orientated along
x
y
z
KIND = REAL
Z-Pencil Orientation
y
z
KIND = REAL
Serial Implementation
Spectral Space (Global)
Dimensions: Nx Ny Nz
x
Spectral Space (Global)
Dimensions: Nx Ny Nz
46
the z-dimension. In either case a complete mapping and inter-core querying and
broadcasting effort is required for the localized information to be shared so that
communication partners are correctly set-up.
This communication and coordination step is simply the production and sharing of a
communications table at the master core, coupled with a global broadcasting of this table to
all participating cores. Two versions were required for either x-dimensional pencil memory
orientation or z-dimensional pencil orientation.
Going back to the Hermitian redundancy routine, the algorithm used relies on the
copying of a local part of an otherwise global plate, offsetting some regions of it in different
manners and styles, mirroring each offset region, again differently depending on the region,
and relaying the mirrored and offset region to the destination core where a conditional
replacement is performed. This is visualized in Figure 22, and explained in Figure 23.
To do so, the Halo-Cell set of instructions present in 2Decomp are used to allow an
inter-core memory sharing operation where each core expands its memory using adjacent
node information. The degree of memory expansion can be controlled by the programmer,
and in this case it is the minimum required to perform the offset operation. As before this
requires the memory to be pre-allocated at a given dimension, and at the first time this
operation is performed, it is reshaped into a shape that remains until its eventual de-
allocation, corresponding to the necessary indexation for proper functional operation. The
memory block to be expanded, given the pressing need to optimize memory, has an auxiliary
nature and is reused repeatedly, as well as any communication memory pattern. This was
done by design and again prevents data corruption from the functional memory fields
corresponding to the velocity fields. Only once the final plate is correctly done, is it replaced
on the original data fields, in order to safeguard calculation procedures, while at the expense
of a higher memory usage, implying that the initial information is at all times present and can
be accessed should the need arise at any point during posterior code development.
The algorithm is better explained using the following visual depictions, with figure 22
representing the domain which is to be operated and the participating pencils. Figure 23,
represents the operations done on that memory plate with the grey spaces being set at zero:
47
Figure 22 – Hermitian Redundancy Algorithm Depiction
y
z
Spectral Space (Global)
Dimensions: Nx Ny Nz
KIND = REAL
y
z
KIND = REAL
Auxiliary Base Plate (Global)
Dimensions: 2 Ny Nz
Global Operations on Auxiliary Plate:
1. Break up Plate Sections 2. Perform Sectional Mirroring 3. Perform Sectional Offset 4. Rebuild Plate from Section
5. Enforce Hermitian on Plate
48
Figure 23 – Global Plate Operation Visualization
This routine is critically important to the correct functioning of the code, as it ensures,
if working properly that the Hermitian redundancy is truly present and enforced. As such it
was extensively tested in both the x-pencil variant and the z-pencil variant, with only the
latter being used in the final version, and the one here depicted.
It is done in a different fashion to the rest of the code, with an emphasis more on
ensuring that any and all participating variables, which are mostly 3-dimensional in nature,
are treated using FORTRAN’s own internal array-access procedures rather than externally
defined conditional access which is the case with the remainder of the code. This routine
requires that all participating pencils have the same size and shape in every processor
holding the memory plate where these operations are performed.
Global Operations on Plate:
1. Section Breakup:
2. Section Mirroring:
3. Section Offset:
4. Section Rebuild:
49
3.4 I/O
Given the code complexity and the simulation requirements for a large number of
calculations, it was required that during runtime, at specific points the velocity fields are
saved for study using smaller capacity workstations and for post-processing. The specific
details about opening, reading and writing to file using the MPI standard are available in a
wide range of specific literature, with further operations being devised by the Library. For the
purposes of this code, to ensure less time is spent during saving, the files are opened
outside the main cycle, and are kept open and written to when needed, using the faster
version of data writing present in the 2decomp library. This functionality has a couple of
situations that require addressing. Firstly, a displacement variable is required, updated on a
local basis, implying that it is already allocated and given the correct type. Initially, this
variable simple measures the byte-size of a given number. Due to the presence of parts of
the global variable field in different processors, the displacement variable is then updated,
and accrued to the displacement corresponding to the start location of the localized field in
regards to the global field. The displacement is measured length-wise, regardless of the
nature of the field in question, and given the parallelization scheme, it may be forced to skip
values periodically. The details of this operation are inserted into the 2decomp Library, with
the user only being required to create and size the displacement variable correctly. At the
time of code development, a starting value of zero is used, with follow-up update. This set of
instructions insures that when the writing operation is executed, the data is saved with the
proper size, for its individual memory block, in what pertains to bit-length. The same variable
is used for reading from the file, having the same values and nature.
The second point to be addressed is that the saving and reading operations use
MPI-native type numbers, which correspond to transcription a processor’s memory directly
unto disk. Evidently, this information cannot be read easily with non MPI-native data. The
advantage of this kind of operation is simply reading and writing speed, as the information
present on file is directly transcribed into and onto RAM addresses, making for less time
expenditure during the operation. The aforementioned details about saving and writing files
are applicable to the majority of the operations used in the code, where calculation and
computational operation speed is more relevant than the actual manner in which data is left
on the disk, although this requires that post processing tools have the ability to read the MPI
file and convert it into a FORTRAN90 native standard or standardize it prior to post-
processing with a converter program. Due to the inability of a workstation in loading the
entire global variables data into its RAM, both a Spectral space and a Physical space version
50
of the velocity field is stored, requiring a FFTI operation shortly before file writing. This file is
not supposed to be read during this code’s execution, since it functions mainly in the
Spectral space, where calculations are performed, but is seen as useful for post-processing
tool development. There was also the need to save planes from the working fields during
cluster run-time, to which the 2Decomp Library has a functionality included to perform such
operations. These fields are not meant to be read outside visualization efforts, and their
storage is done in MPI-Native format much like the majority of available file-saving
procedures in 2decomp. The reading of these files is also left to a post-processing converter
tool which was developed shortly after the successful trial runs of the MPI solver in
Marenostrum.
3.5 Randomization
Due to the Forcing method selected, a randomization method was required that took
advantage of the parallel nature of the solver insofar developed. Computational logic uses a
pseudo-random formulation in order to generate a string of numbers which statistically,
correspond to a random sequence of numbers as otherwise found in nature. Pseudo-random
algorithms use a set of techniques to attempt to achieve a random number distribution, but
the vast majority relies on a seed, which effectively forms the basis for the formulation used
in the random function present in FORTRAN standard. In order to match results to serial
versions of the code, there was a need to develop a number of possible random number
generators that could be adjusted with more models and algorithms at a later stage. The
solution was to develop, using a pre-allocation logic, arrays containing the random numbers,
generated before they’d be necessary in calculation. This was done in order to provide for
separate optimization possibility. As mentioned before, and especially applicable to a
completely distributed parallel program, should all processors initialize with the same seed,
there will not be a true random number dispersion, with each processor having the same
numerical sequence. Coupled with the fact that whenever possible blocks are of identical or
very similar size, this would eventually imply that all processors were generating the same or
very similar, random sequence, thereby betraying the purpose of using randomization. The
solution to this problem relies on the creation of a parallel random number generator. The
generator in itself may then be altered to obtain different number distributions depending on
the intended results. For this task, the first goal was to create a generator which in effective
terms, had the same random number generation as the original serial version, for the same
seed.
51
To do this, whilst keeping in mind that access to the RANDOM function must be
kept, two routines, one for a large 3-dimensional array, and another for two large 3-
dimensional arrays were created, ensuring that in both cases the generation was done in the
same manner in which the serial version performed calls to the RANDOM routine, for the
same seed. Secondly, in order to pave the way for future developments, the choice of this
random number generator was tied to a run-time option variable, which allows any user to
select between ranges of developed randomized algorithms at a later stage. A number of
different algorithms were implemented and constructed, based on different possibilities to
access the RANDOM routine. At some point in the future, further work may be done in order
to more closely approximate other distributions for randomised entries.
3.6 Statistics
During the development and execution of the code, several routines to present
varied statistical information are presented to the user.
These routines, in previous versions of the code, are reliant on the entirety of the
memory being available for calculation in either a processor (for the Serial version) or in a
Node (for the OpenMP version). For the MPI version, as mentioned before, this is not the
case. At this stage, the memory is physically separate from node to node, although shared
by processors at a given node. This implies that for simpler routines, using local checksum
operations coupled to a global checksum is the most efficient way to proceed.
The routines that calculate statistics do so, for the most part, in the Spectral space,
with only two to three cases involving a inverse Fourier transform to calculate the same
statistics in the Physical space. For more complex entities, such as vorticity, more
complicated procedures do exist in the Serial and OpenMP code version, and the
transcription into MPI, with its specific behaviours, is not immediate. These procedures were
grouped together in an extensive routine responsible by calculating and delivering to the
user the relevant information.
52
3.7 Converter
Due to MPI file reading/saving procedures, which have an inherent focus in
calculation speed, data is stored in MPI-native format, which corresponds to a near memory
dump of RAM information unto the Physical hard drive, from each node, with corresponding
memory displacements corresponding to the global information.
This method of storing information is similar to the native means of FORTRAN
saving its calculation data unto a File at the hard drive, but the two standards are neither
identical, nor directly compatible.
Due to the need of running the code in large clusters and the great difference
between available RAM at a cluster, even if distributed, and RAM at a workstation, a
converter is required to analyse any data saved during runtime.
Figure 24 – Converter MPI NATIVE to FORTRAN90 NATIVE Algorithm
Flowchart
This data access is user defined at run time and reads only the necessary fields
saved during runtime at the cluster. Further, any other details which are stored using MPI
Output Array
(FORTRAN90 NATIVE)
Global MPI FILE
(MPI NATIVE)
Cycle Field k-j-i
Compare k-i-j with Start/End Info
IF
Next k-j-i Entry Store in Intermediate
Desired Field Section
Start/End Information
Store in Output Array
53
derived operations can be accessed by the converted during code execution at a cluster
following a successful save, without affecting the data saved from the cluster.
The converter was thusly created to read the entire field line by line, and if the
position/displacement of the current read entry corresponds to a globally defined coordinate
of interest, it is stored into the workstation ram until the workstation’s ram is filled with the
relevant data.
Storage of this information is then done in Fortran90’s standard for file saving
enabling the users to use tools such as TECPLOT or other codes to investigate or
manipulate the converted data for posterior investigation.
To complement this step, the order of reading and storing may be different between
themselves, enabling the converter to bridge this fact by delivering a saving order compatible
with the eventual reading by other relevant codes already available for post-processing.
The converter may read planes or smaller fields from the main memory block as
dependent on the start/end info provided by the user.
54
55
Chapter 4
Chapter outline:
Chapter 4 delivers the final results obtained following the successful implementation of
all algorithms presents in Chapter 3, for the subjects addressed by Chapter 1 and
Chapter 2. It also addressed previous code versions performed during the coding of the
solver and presents temporal results for the most important calculation routines and the
finalized solver.
56
Results
As mentioned in the document, at least two main versions of the MPI DNS solver
were executed, with fundamental differences in how the memory pattern was handled in the
Spectral space. The first version, as a requirement for minimal error at the code verification
stage, was a Physical x-pencil to Spectral x-pencil solver.
The majority of the Serial code, and by extension the OpenMP solver, operates in
the x-wise direction in the Spectral code.
In order to prevent code malfunction, the logic was to use the original solution
parameters and processes so that any further modification to the Spectral space memory
arrangement could be verified at each step.
However, the fact that a pencil parallelization scheme is always fundamentally
different to the Serial version implied that regardless of the pencil orientation for memory
handling procedures, all participating routines needed to be re-created from scratch.
Further to the above, there was also the fact that the entirety of the code demanded
all subroutines to be Interfaced, greatly changing the programming structure of the entire
code, due to the possibility of altering the code structure to run from single to double
precision using only a 2decomp library related compilation flag.
However, one fact that could not be removed at this stage was that the dimensions
of the field, of ( )121
+�2
�3
� (of complex-type). Even though as previously mentioned
this is turned into a ( )21+�
2�
3� (of real-type) in order for the remainder of the code to
operate, the ( )21+� x-direction in global memory terms was unchanged at this time.
Similarly to what happened in the Serial solver and in the OpenMP solver versions,
the 2+ memory plate was, following an Inverse FFT operation, numerically forced to be 0 .
While at this stage the necessity of avoiding the extra global memory transpositions
could be guessed at, numerical testing was performed with the aforementioned logic, using
VAMPIR to extract information from the local IST clusters and from the local workstations.
Once the program was verified, the highest temporal expenditures were verified to
be the communication-related MPI operations, as well as code operations.
In order to remove the first time-sink, the entirety of the code was re-structured so
that a minimal communication model was achieved, and this promptly removed the
communication times as the largest temporal bottleneck.
57
Following that, steps were taken to reduce the localized code operations, starting
with fixed-time, hardware related allocation times. At this stage, the Spectral space
dimensions of ( )21+�
2�
3� were adjusted so that past the FFT calculation procedures, a
Spectral space of dimensions1
�2
�3
� was retained, which allowed for all routines that
cycled the field to stop being dependent on external information and voided the need for the
wrappers to transform the 2+ to zero.
At this stage the majority of the code was seen as being ready to advance into using
more expensive resources made available at the Marenostrum cluster.
Subsequently, a minor glitch was detected, due to the impossibility of testing higher
sized meshes locally, relating to FORTRAN90’s maximum integer value, and this was
corrected using the –backtrace flag in-situ at Marenostrum, as this error had been
unexpected. Once corrected, the MPI DNS solver code was extensively tested at Mare
Nostrum. At this stage the code was functionally tested at Marenostrum on a cubic mesh of
4096 points using the x-pencil to z-pencil version, but the test was an isolated attempt
without further progress, as the goal of this thesis was to enable at least a 2048 cubic mesh.
Further statistical routines were then adapted as temporal testing was performed in
order to bring the project into fruition, and once these were completed, the same logic of first
checking code behaviour in local workstation, then moving up to the internal IST cluster, and
lastly, into Marenostrum, was performed. At this stage, random number generation was
added, and tested at Mare Nostrum for the final temporal results here presented.
Past these initial steps, the MPI DNS solver code presents a near-ideal scalability
curve, has several random number generator options, is verified against the Serial and
OpenMP versions and is completely developed to run using a double-precision calculation
procedure.
58
4.1 Tests and Speed Up
4.1.1 Development of X to X Version
The initial development of the MPI version of the Spectral code consisted on the
creation of the wrappers and subsequent application to the initial version of the code, which
uses x-dimension as the foundation for all present DO cycles, given that FORTRAN is
reportedly faster for k- j - i memory access order, therefore implying that the i-th index must
be the innermost at any DO cycle.
However, with a closer inspection of all participating routines outside the wrappers,
there was the conclusion that such an approach, while desirable would still require a major
overhaul and re-writing of all participating routines.
Still, at this stage, the desired goal was the development of a MPI code with minimal
induced functional modifications to the parent code, aimed at continuing to take advantage of
the k - j - i order of access to major arrays.
Regardless of this particular sub-set of instruction, there was a decision by the
implementer to verify and prepare all basic participating subroutines for posterior steps
pertaining to an eventual optimization with the goal of reducing communications.
The initial steps taken relate directly to the usage of the 2decomp library in itself, as
there was a need to alter all participating real variables to the standard used by the library.
Given the heavy communication patterns and subsequent buffers as well as internal
processes, there was a necessity of altering the current code’s programming structure into
one guided primarily by modules and interfaces.
Such an approach demands that all variables be of identical nature, with the nature
defined by a 2decomp Library variable which is defined during compilation time.
This variable, as known in the program done, is dubbed mytype, and is responsible
for ensuring that any variable, when declaration is performed, is or proper nature/length in
terms of memory. Further versions of this variable exist in the library to assist a user, when
dealing with MPI communication routines, in ensuring proper variable identification such as
real_type and complex_type, which should be used in preference to MPI_DOUBLE or
MPI_COMPLEX instructions.
The utilization of this variable, coupled with the modular nature of 2decomp, forces
the user to then construct the program with recourse of the INTERFACE, which ensures that
59
any given variable, when affected or created within a subroutine, has the same nature as the
parent code. The INTERFACE ensures that the nature of the variable is correctly translated
from the parent code to functional subroutines.
This logic demanded that all re-created subroutines required an explicit external
interface and were of a procedure nature, with all of the larger variables retaining the
allocatable nature despite the level in which they were summoned.
The benefit of this is that all variables may be allocated, de-allocated and reshaped
at any point in the program, which while dubiously useful, allows separating the declaration
and allocation of the variables from the parent code to a different routine. The usefulness of
this approach was used in one particular subroutine which demanded the in-situ reshaping of
a particular variable into a proper size during the first temporal cycle execution, which then
retains this shape and all correct indexing.
The heavy use of Interfaces ensures the scoping of a given variable, allowing its
treatment to be more object-like. In fact this was the intended goal of redoing all of the
participating subroutines at this stage, in order to ensure that the memory blocks, when
accessed had the same nature and were de-facto global entities, and as such, given Global
Scope. At this stage, the most development effort consuming of all routines was the
Hermitian redundancy routine, which demanded heavy parallelization but minimal
communications. This subroutine, as explained in its relevant section, when primarily done in
an x-pencil to x-pencil logic, demands a communication table set up, and would not be
immediately transferrable into an x-pencil to z-pencil logic.
With the usage of objects containing all global information, all other subroutines were
made to be dependant of the object and on the pencil orientation, so that the various access
types and operation types were correctly connected to globalized operations.
Once the x-pencil to x-pencil version was completed it was subjected to testing at
three different machines before implementation of the code in Instituto Superior Técnico’s
local cluster.
During these tests, some difficulties and assumptions were corrected, since that
when testing upon a node; some variables were not broadcasted correctly. Once these steps
were corrected the program was tested using one node up to its maximum capacity, and
then expanded into two-nodes up the cluster’s capacity at these two nodes, to exemplify the
proper implementation of the MPI version.
Numerous tests were done at this stage to compare the output of the MPI version to
the Serial and OpenMP versions in order to ensure proper calculation of all intermediate
steps. At this time, VAMPIR was required for further testing and numerous trials were done
60
in order to ascertain the greatest temporal sinks and decisions were taken to advance into
the optimization stage.
A representative temporal results table (Table 2) of the tests done using the Galego
Cluster for the x-to-x MPI DNS solver is presented next, for a cubic 128 points mesh.
nprocs T (s)
4 3,73
8 3,07
16 1,61
32 1,51
Table 2 – Temporal Results (mesh 3
128 ) in Galego
These values, when plotted and compared against an ideal behaviour for scaling,
provide the following visualisation (Figure 25), indicating a need for optimization, as well as
the possibility of such due to the intensive communication pattern.
0,1
1
10
1 10 100
128 Ideal Scaling
Figure 25 – Temporal Results (mesh 3
128 ) in Galego (X to X)
Further testing, for the same mesh of 3
128 points, using 4 processes, focused on the
primary cause of the temporal results deviation as reported by VAMPIR, the FFT operations.
61
The average values after a sample of 100 cycles of FFT (Direct followed by Inverse)
operations, taking ten representative values wields the following results (Table 3), with the
average being calculated for the listed iterations on a given field.
TEST CR→ yx → zy → FFTI Units
1 1,2 3,8 2,8 7,3 210
−
×s
2 1,4 3,6 3,0 7,2 210
−
×s
3 1,4 3,7 3,0 7,2 210
−
×s
4 1,4 3,6 3,3 7,0 210
−
×s
5 1,3 3,6 3,1 7,2 210
−
×s
6 1,4 3,7 3,1 7,2 210
−
×s
7 1,4 3,7 2,6 7,2 210
−
×s
8 1,4 3,9 2,8 7,2 210
−
×s
9 1,4 3,5 3,1 7,3 210
−
×s
10 1,3 3,6 3,0 7,2 210
−
×s
AVG 1,36 3,58 2,68 7,2 210
−
×s
Table 3 – Temporal results for FFTI (3
128 )
If we take the average values, and sum the entire time, we may estimate the
average time consumption of the FFTI operation by its participating components, and take
the percentages of the time expenditures, and we obtain:
( )FFTIwrapper
CR→ 9,2 %
yx → 24,2 %
zy → 18,1 %
FFTI 48,6 %
Table 4 – FFTI Temporal usage breakdown
Table 4 indicates that for the x-to-x version, a 42,3% of the time is spent not
performing FFT operations, simply transposing the memory and performing the Real to
Complex kind conversion.
62
For the same sample of 100 iterations, but focusing on the FFT Direct component,
the following times were obtained, and listed in Table 5:
TEST FFTD yz → xy → RC → Units
1 7,7 2,6 3,8 1,8 210
−
×s
2 7,6 2,6 3,8 1,7 210
−
×s
3 7,7 2,6 3,8 1,8 210
−
×s
4 7,7 2,6 3,8 1,7 210
−
×s
5 7,7 2,6 3,7 1,7 210
−
×s
6 7,7 2,6 3,8 1,7 210
−
×s
7 7,7 2,6 3,8 1,7 210
−
×s
8 7,7 2,6 3,7 1,8 210
−
×s
9 7,7 2,6 3,8 1,7 210
−
×s
10 7,7 2,7 3,8 1,7 210
−
×s
AVG 7,69 2,61 3,78 1,73 210
−
×s
Table 5 – Temporal results for FFTD (3
128 )
Taking the average values, and performing the same breakdown as before, we get:
( )FFTDwrapper
FFTD 48,6 %
yz → 16,5 %
xy → 23,9 %
RC → 10,9 %
Table 6 – FFTD Temporal usage breakdown
Table 6 indicates that on average, 40.4% of the time is spent outside the FFT
operation.
Further, by using VAMPIR Trace, available in the MPI standard, the results
suggested that about 30% of the time was spent in FFT calculations (FFT plus Kind
conversion and Allocation), with 70% being MPI derived temporal usage, thusly,
communication time for global communications and transpositions involved both externally
and internally to numerical calculation procedures.
63
The results, when taken in conjunction, lean heavily towards the reduction of
communications, even prior to an allocation removal optimization step, which at this stage,
was not yet an obvious time sink.
As such, the primary step was the development of a new wrapper, which, using the
same logic of allocation and de-allocation of intermediate buffers, which would allow for a
direct comparison of the temporal expenditures.
Such a wrapper, when developed, compared to the previous wrapper, produced the
following results, for 1000 iterations, an arbitrarily large number of iterations taken, using a
64 cubic mesh, on a workstation:
IFFTD + 1000 iterations
XTOX −− 12,203 s
ZTOX −− 6,327 s
Table 7 – Total Temporal expenditure
The results as listed in Table 7 point towards a 48.15% speed up in simply reducing
the communications involved in the wrapper-related operations.
Another set of tests, now for the entire program, provided similar results, for 100
runs with a time per cycle as follows:
SOLVER 100 iterations
XTOX −− 0,4 s/cycle
ZTOX −− 0,278 s/cycle
Table 8 – Time per Cycle in workstation
The results at Table 8 point towards a 29.5% expected speed up if using the
workstation, and while far from the 48.15% speed up presented earlier, both of these taken
in conjunction pointed towards the need of developing an alternate communication and
calculation pattern.
64
4.1.2 Development of X-to-Z Version
Initially, the program was designed to behave, when in Spectral space, with x-pencil
logic, with the x-dimension being halved*. However, the library creator opted to, when
delivering an x-pencil Spectral space, to reduce its dimension and therefore greatly alter the
data distribution, with the z-dimension being halved*.
Such a fact was not acceptable as the code relies on the data distribution along x-
dimension being halved, as already mentioned, so the possibilities were discussed and the
option to progress to Spectral z-pencil was taken.
Thanks to the work mentioned earlier, the translation of the vast majority of code
was made simpler by all the variable definitions and procedural nature, with the modifications
consisting of re-doing all cycle index information report to z-pencil information rather than x-
pencil information.
At this stage, some instructions were opportunistically removed and the objects
pertaining to this information were made a subroutine input variable rather than a local
version further cleaning up the code.
At this level of programming, the most time consuming operation were the FFT
operations, which indicated a heavy communication effort and immediate gain for further
work due the initial implementation of the x-pencil to x-pencil logic devised in the initial
version.
Due to the way 2Decomp delivers the x-pencil FFT output, it was unacceptable, for
the reasons mentioned earlier, to use that version; therefore, in order to reduce the
communication effort, a real space x-pencil to Spectral space z-pencil logic had to be
pursued.
The vast majority of the participating subroutines had already been extensively
reworked due to the pencil-memory logic, so the change of logic demanded only minor
changes.
The greatest complication was the truncation and Hermitian redundancy routine,
which had to be extensively re-worked due to its own internal logic in order to enable the a
proper global reflection of a single plate, which had to be adjusted from a x-pencil to z-pencil
logic, demanding a different communication pattern, logic and implementation.
65
Given the earlier results, posted in the previous section, as well as the comparison
tables, the program was completely altered to obey the new Spectral space distribution. As
mentioned, this required the re-construction of the Hermitian redundancy routine, but,
additionally, all cycles required an update to the new distribution, since the localised
information was now organised differently.
For the completed program, after verification and validation of the results, the timing
tests could begin. Again a new arbitrary vale of iterations was chosen, and to avoid cluttering
up computational resources, the mesh was downsized to 3
64 to allow workstation usage. In
this case 1500 iterations were performed, and the timed results, are listed in Table 9:
Version 1500 iterations
xx → 0,400 s/cycle
zx → 0,2953 s/cycle
Table 9 – Time per Cycle in workstation
While these temporal results fall in line with the FFT temporal results for a same-
sized mesh, the speed ups obtained were calculated taking the new temporal expenditure
value with the previous value, obtaining a 26,2% speedup.
For the FFT operations (D+I), VAMPIR reported the time expenditure of calculation
versus communication times as being, for 1500 iterations:
Total 162,124 s (1500 iterations)
nTimeCalculatio 57,991 s
MPI 104,438 s
Table 10 – Total Time Expenditure
Table 10 yields a 35.6% time spent in communication outside numerical calculation
procedures, which is a reduction from the previously obtained 70% time expenditure in
global communications.
66
Turning again to VAMPIR, the results obtained matched up with the previously
obtained, with slight disparity due to the added instrumentation effort as listed in Table 11:
( )zx�ew → 0,3 s/cycle
( )xxOld → 0,4 s/cycle
Table 11 – Time per Cycle, 3
64 , 4 cores
With VAMPIR automated instrumentation, the reported speed up is 25%, for the two
competing versions. The results are tabled next, in Table 12:
RungeKutta 33,8 %
AllToAll 19,3 %
FFTI 14,0 %
FFTD 6,2 %
CR→ 2,7 %
RC → 1,3 %
ableParalelizeon − 22,7 %
Table 12 – Relative Time Expenditure
Table 12 lists the obtained results, although, at this stage, the only result which
needs explained is the one related to the Runge Kutta calculation scheme. However, the
FFT operations stopped being the highest temporal expenditure operations, with a particular
note pertaining the highest FFTI temporal usage. In the code, two FFTI operations are called
per FFTD operation, which explained the temporal discrepancy.
The Runge Kutta, now the highest temporal expenditure routine, is a complicated
routine that accommodates several other routines. By analysing its constituent cycles, and
its processes, which had been remade during parallelization, these were found not to be at
fault for the highest temporal usage. At this stage, the only logical step was to assume that
the fixed hardware related times were the responsible portion of this temporal sink.
Given that all DO cycles had been optimized in the x-to-x to x-to-z transition, and
even the calculation space had been optimized by adapting the complex-to-real and real-to-
complex routines, only the several allocation and de-allocation operations, of calculation
spaces, message buffers, and intermediate variables of large size were still in need of
optimization.
67
4.1.3 Pre Allocation Optimization
0 5 10 15 20 25 30 35 40
% of Time Expenditure
Runge Kutta Routine MPI_AlltoAll FFTI
FFTD Real to Complex Routine Complex to Real Routine
Figure 26 – VAMPIR Results Visualization, excluding non-parallelizable time expenditure
During the development of the x-pencil to z-pencil version of the code, there was the
opportunity to remove some key objects that defined global structures.
At this stage, and following the completion of the x-pencil to z-pencil code
conversion, it was verified that the program became around 25% faster, but that value was
not what expected as VAMPIR data profiling indicated that the communication time had
ceased to be the major time-consuming operation. Instead, the largest time consumption
was now non MPI-related routines, which had the highest values, as it might be seen in the
above image, where the Runge Kutta routine presents the highest time sink.
Since the basic DO loops order was left unchanged and there were no barriers to
explain the excess time usage, the fault was traced to hardware related mechanics, such as
file opening and closing and memory allocation, of which an extensive amount was done
during code development. In order to verify the possibility of further time gains, all
participating memory blocks were allocated externally at program start and de-allocated only
at program conclusion, coupled with reducing the amount of memory usage by allowing
temporary and auxiliary memory arrays to be reused given the identical sizes and pencil
orientations present in the sub steps of the code functional routines.
This step vastly improved the program runtime execution, and at its conclusion the
communication effort of the library occupied the majority of time during execution, and while
further optimization may be possible, the major limitation is currently external to the code
here developed.
The various stages are presented next, using a format which includes a small
explanation of the modification and presents the temporal results, both in tabular form and in
graphical form, with Adimensional results presented for an easier visualization.
68
The initial stage prior to de-allocation of the large three-dimensional variables,
presented the following temporal results, as seen in Tables 13 and 14:
meshcores 128 256 512 1024
4 1,9 18,1 155,3 1242,4
16 1,2 9,1 77,8 622,4
32 2,0 6,9 48,0 229,7
64* 0,5 4,1 48,9 152,1 Table 13 – Time (seconds per Cycle) (X to Z) (Galego; with Allocation)
meshcores 128 256 512 1024
4 1,86 2,26 2,43 2,43
16 1,17 1,13 1,22 1,22
32 1,98 0,86 0,75 0,45
64* 0,50 0,51 0,76 0,30 Table 14 – Adimensional Time per Cycle
(X to Z) (Galego; with Allocation)
These results, as visible on the charts (Figure 27 and 28) presented on the next
page of this document, show a less than ideal behaviour in term of scalability.
Given that by this stage all cycles had been redone, and the vast majority of
unnecessary operations had been removed from the code, only repetitive allocation and de-
allocation of intermediate buffers and communication sending/receiving buffers was present,
as well as some global instructions pertaining to global information at a local level, necessary
for specific routines.
As such, the guidelines to follow were established as:
1. Communication Buffers for MPI messages should be allocated at program start.
2. Intermediate buffers, if required, should be allocated at program start.
3. Buffers may be re-used if they are available, to reduce physical memory usage.
4. All global objects should be created at program start and referred to.
5. If a buffer requires re-shaping, its dimensions should be compatible from cycle to
cycle.
6. Minimize memory expenditure if possible, but without sacrificing calculation
speed.
A note on all tables in this section pertaining to Galego is now presented. Galego’s node
structure has a maximum of 32 cores per node, so results of 64 cores include inter-node
communication.
69
0,0
0,1
1,0
10,0
100,0
1000,0
10000,0
1 10 100
128 256 512 1024 Ideal Scaling
Figure 27 – Temporal Results (X to Z) (Galego; with Allocation)
0,10
1,00
10,00
1 10 100
128 256 512 1024 Ideal Scaling
Figure 28 – Adimensional Temporal Results
(X to Z) (Galego; with Allocation)
70
After following all the previously established guidelines, the results obtained are
listed in Tables 15 and 16, and visualized in Figure 29 and 30:
meshcores 128 256 512 1024
4 1,9 15,7 137,4 1099,2
16 0,6 4,8 46,0 324,4
32 0,3 3,3 27,6 258,0
64* 0,2 2,3 18,9 190,7 Table 15 – Time (seconds per Cycle) (X to Z) (Galego; without Allocation)
meshcores 128 256 512 1024
4 1,91 1,96 1,96 1,96
16 0,56 0,59 0,66 0,58
32 0,34 0,42 0,39 0,46
64* 0,24 0,29 0,27 0,34 Table 16 – Adimensional Time per Cycle (X to Z) (Galego; without Allocation)
While the magnitude of the calculation time is initially the same, the new results point
towards a better scalability, and no further optimization was possible on the code without a
much larger overhaul to the entire initial algorithm.
Given that no further locations of the code were seen as being a time sink, the
decision to progress towards MareNostrum III was taken, since by this stage, any further
issues, if present, would manifest themselves in a large cluster. There was also the need to
test the MPI DNS solver with recourse to a large number of processors and explore its
viability in a production environment. Few problems were encountered, except when testing
a mesh with a grid size larger than 3
1024 . The reasons for this were trivial, but had not been
accounted for, with the value 3
1024 being close to the max integer value permitted by the
FORTRAN90 standard. As such, at this stage, the solver required slight modifications to
come of the normalization procedures to side-step this limitation and once all results were
numerically validated and confirmed, for the mesh sizes possible at Galego and at a
workstation, further testing was done up to a grid size of 2048 as standard for result
gathering and presentation.
71
0,0
0,1
1,0
10,0
100,0
1000,0
10000,0
1 10 100
128 256 512 1024 Ideal Scaling
Figure 29 – Temporal Results (X to Z) (Galego; without Allocation)
0,10
1,00
10,00
1 10 100
128 256 512 1024 Ideal Scaling
Figure 30 – Adimensional Temporal Results (X to Z) (Galego; without Allocation)
72
For Marenostrum, the results obtained were as follows:
meshcores 128 256 512 1024 2048
4 1,60
16 0,25 2,59
32 0,15 1,55 10,80
64 0,09 0,80 5,90
128 0,06 0,43 3,12 14,00
256 0,05 0,23 1,80 6,86
512 0,06 0,18 1,09 3,91 29,90
1024 0,14 0,21 0,65 1,83 13,99
2048 0,52 0,52 0,80 1,46 20,54
4096 1,90 1,93 2,03 2,66 12,50 Table 17 – Time (seconds per Cycle)
(X to Z) (MareNostrum; without Allocation)
meshcores 128 256 512 1024 2048
4 1,600
16 0,250 0,250
32 0,145 0,150 0,150
64 0,090 0,077 0,082
128 0,060 0,042 0,043 0,043
256 0,050 0,022 0,025 0,021
512 0,060 0,017 0,015 0,012 0,008
1024 0,140 0,020 0,009 0,006 0,004
2048 0,520 0,050 0,011 0,005 0,006
4096 1,900 0,186 0,028 0,008 0,003 Table 18 – Adimensional Time per Cycle
(X to Z) (MareNostrum; without Allocation)
These results, when plotted as in the next page (Figures 31 and 32), point towards a
stable behaviour and, as the number of cores increases for the larger meshes, towards
coherent scalability.
Further, if comparing to the ideal scalability curve, the results obtained present near-
ideal scalability. We can use these result as a case-study for the behaviour of smaller mesh-
sizes and the behaviour of calculation time versus communication time, as, for a 128 mesh,
past 64 processing cores, the program actually becomes slower due to intensive
communication effort.
Of particular note are the 2048 results. Further testing was not possible due to a
maximum of 4096 cores made available to the developer, but the results obtained point
towards continued scalability past using 4096 cores.
73
0,00
0,00
0,00
0,01
0,10
1,00
10,00
100,00
1 10 100 1000 10000
128 256 512 1024 2048 Ideal Scaling
Figure 31 – Temporal Results
(X to Z) (MareNostrum; without Allocation)
0,0001
0,0010
0,0100
0,1000
1,0000
10,0000
1 10 100 1000 10000
128 256 512 1024 2048 Ideal Scaling
Figure 32 – Adimensional Temporal Results (X to Z) (MareNostrum; without Allocation)
74
4.1.4 Final Temporal Results
Following the successful transition of the serial solver, to a x-to-x solver, to a x-to-z
solver, with the temporal results now numerically confirmed and the optimization steps taken,
further work was required.
A new random number generator had to be created that provided compatible
number generation to previous versions, while at the same time, the option for using different
random engines was put on the table. Following the same pre-allocation logic, the random
number generator was created to separate itself from any calculation procedure, while
providing the same number distribution, as well as other options. Further work was done to
implement a set of statistical calculation routines, which were also to function during program
execution, as in the original solver, but only at particular time steps. The results for the
random-version are plotted next, with no numerical table given, as the deviation from the
temporal results already attained was negligible.
Figure 33 – Iteration Time Final Results
The results in figure 33 presented have been confirmed by the research team
several times and were delivered to the entity regulating and monitoring Mare Nostrum for
subsequent projects.
With the result attained, all goals initially set out by the research team and the
developer have been attained, in almost the best way possible, given the near-ideal program
behaviour.
75
4.2 Large Scale DNS testing
One of the stated goals of this work is to provide the capacity to generate large
simulations to produce finer results for research purposes. With the successful creation of a
functional MPI PSM calculation engine, the program can be used to generate new data with
finer resolutions. With the further successful creation of a MPI to FORTRAN data converter
and subsequent possibility of accessing sections of the stored field, old post-processing tools
may be used until the creation of new post-processing tools.
Although the loading of the file as performed in the converted may be conceptually
slower, such a data upload must only be done once per field being investigated, with the
added benefit of complete portability from workstation to workstation.
The steps taken in generating a randomizing routine allow for different initialization of
the same statistical fields, allowing for different simulations to be done for a given data set.
Even with the same previous results, the possibility of swapping the random number
generator on the fly and therefore generating different number sequences depending on the
option, enabling for slightly different simulations with statistically identical results.
Further, the program is completely independent of the number of cores and nodes
used, enabling initialization or the continuation of previous simulations in different memory
patterns, although, if the correct options are chosen for the random number generator, this
may produce yet another set of different results to allow for research to progress.
With such possibilities, the generation of a large-scale DNS simulation was desired
from the get-go of the program design, with the goal of achieving higher mesh sizes.
The goal was successfully attained, with the only difficulties relying on the fact that
the maximum integer number in FORTRAN presented a cap for the mesh global dimensions
with a slight modification done to solve the problem. This modification was required in order
to avoid any possible errors up to a cubic mesh size of 2147483647 points per dimension, at
which point the FORTRAN standard is unable to process the dimension.
Regardless, the stated goal was to successfully run a double-precision, cubic mesh
of 2048 points per dimension, in Mare Nostrum and to provide all data for posterior analysis
using existing post-processing tools. The previous achieved highest result was a cubic mesh
of 1024 points per dimension, under double precision.
This goal was routinely achieved during testing procedures, with testing of up to
4096 cubic mesh sizes.
76
77
Chapter 5
Chapter outline:
i) Chapter 5 concludes the thesis by presenting the final results, and lists further
functionalities that were not slated for parallelization at the present time. It
serves as the conclusion of the thesis.
78
Conclusions and Further Work
Following the creation of the DNS MPI solver, and the results obtained, this section
deals with further programs that may be required in the future, hopefully taking advantage of
the code designed.
The translation of the OpenMP version to the MPI version was not completed, as
such was not the goal of the current thesis.
Further work is required to complete the translation of the existing code, namely the
jet portion to MPI, but another developr has already started taking steps towards the
deployment of the modification.
Since the solver was designed to be as simplistic and as expandable as possible, in
the near future, further development is expected to take advantage of the scalability
achieved at this stage.
While an OpenMP version of the solver was indeed designed, an OpenMP version of
the post-processing tools was not. A new MPI version may eventually be required and
planned, in order to be able to study fields in clusters directly, rather than using slower and
less RAM capable for the post-processing capabilities.
Further, other smaller routines that allow scalar quantity measurement were not
implemented, as were other functionalities present in the code for other types of simulation,
such as particle-tracking. However, the work done in the present routines may and should be
used as a basis for quickly devising functional variants of the routines that are present in the
previous considerations.
79
5.1 Main results and project considerations
The project taken as the basis of this Thesis was successfully accomplished,
although the remainder of the code was not translated into MPI due to architecture-specific
issues with parallelization, which usually demands a complete memory handling pattern
paradigm shift from the serial implementation to a pencil-implementation.
Scalability was a concern and achieved successfully, although further slight
optimizations may be eventually done, although, given the code’s current status, expansions
are more readily required. Small adjustments to code velocity may eventually take place, and
there is certainly room for such, namely in the largest statistical routines due to a lack of pre-
allocation logic.
The remainder of the code attempts to use as little communications as required, but
memory expenditure is a concern, given that the statistical routine last applied prevented
IST’ cluster, Galego, from being able to achieve 1024 cubic mech sizes, while in double
precision.
A hardware expansion is required and is currently under study in order to facilitate
the work on the next stages of code expansion.
Section 4.1.4 includes the main results, but these are here repeated, in Figure 34:
Figure 34 – Iteration Time Final Results
As it can be readily seen, the scaling curves follow the ideal scaling curve rather
closely as the number of participating cores increases, thereby allowing use of the program
past the experimented limits in larger, faster clusters and provide access to European, and
world-wide simulation size competition.
80
5.2 Future work
Due to the amount of work already done in the serial code version and in the
OpenMP version, the entirety of the code was not able to be translated into the MPI version.
Not only was that not possible given the time-frame, but the real focus of the solver
was to provide an optimized calculation engine capable of being further expanded by other
researchers in the near future. Future work should focus on both expanding and verifying all
results by using no random number generation in both the serial version and in the MPI
version, with comparable mesh sizes in other to be fully validated in terms of numerical
computation. Further, some work may be done in earlier routines where pre-allocation may
not occur, such as in the stat-run physical routine, where some variables (of arguably smaller
size than the velocity fields, such as 1-dimensional arrays) are allocated at ru time.
These allocations take place in each core, and when all are taken in conjunction for
efficiency purposes, it might be advisable to remove the allocation and de-allocation steps
and include them in the already present structure of allocation prior to run time.
Minor tweaks may eventually also be performed, such as the removal of the
complex-outputs from the library at a later stage. A change that was heavily considered at all
stages was to use the complex output from the library directly, instead of translating it from
the output into the working variable, but while a modest speed up of around 10% may be
obtained, the kind=complex structure may or not be in fact slower to access at a larger scale.
Instead, a finer, more optimal approach would be to contact the library developer, or create a
FFT wrapper set of routines that returned a kind=real output directly after the FFT operation.
The amount of work in this option would be staggering and would greatly reduce portability of
the code, and as such, was not taken. To turn the entirety of the code into a complex kind,
while also functionally possible, would imply a great change to the entire structure of the
code, and as such, despite initially being planned, was not pursued.
As such future work should mostly concern itself with expanding and validating the
expansions from the current code.
81
References
Bibliographic references
[1] N. Li and S. Laizet, "2DECOMP&FFT – A highly scalable 2D decomposition library and FFT interface", Cray User Group 2010 conference, Edinburgh, 2010. [2] M. Lesieur. Turbulence in fluids. Springer Verlag, 2008. [3] C.B. da Silva and J.C.F. Pereira. Invariants of the velocity-gradient, rate-of-strain, and rate-of rotation tensors across the turbulent/nonturbulent interface in jets. Physics of fluids,20:055101, 2008. [4] C. B. da Silva and Miguel Cortez Teixeira. Turbulence dynamics near a turbulent/non-turbulent interface. JOURNAL OF FLUID MECHANICS, 695(1):257-287, 2012.
[5] C. B. da Silva, Ricardo José Nunes dos Reis and J C F Pereira. The intense vorticity structures near the turbulent/non-turbulent interface in a jet. JOURNAL OF FLUID MECHANICS, 685(685):165-190, 2011. [6] P. A. Davidson. Turbulence - An Introduction for Scientists and Engineers. Oxford University Press, 2004. [7] Kolmogorov, Andrey Nikolaevich (1941). "The local structure of turbulence in incompressible viscous fluid for very large Reynolds numbers". Proceedings of the USSR Academy of Sciences (in Russian) 30: 299–303., translated into English by V. Levin: Kolmogorov, Andrey Nikolaevich (July 8, 1991). [8] Kolmogorov, Andrey Nikolaevich (1941). "Dissipation of Energy in the Locally Isotropic Turbulence". Proceedings of the USSR Academy of Sciences (in Russian)32: 16–18., translated into English by Kolmogorov, Andrey Nikolaevich (July 8, 1991)
[9] G. K. Batchelor, The theory of homogeneous turbulence. Cambridge University Press,1953.
[10] Arkadi Tsinober. An Informal Introduction to Turbulence. Kluwer Academic Publishers, New York,2004. [11] P.A. Davidson. Turbulence: An introduction for Scientists and Engineers. Oxford University Press,New York, 2004. [12] H. Tennekes and J.L. Lumley. A First Course in Turbulence. Massachusets Institute of Technology, Cambridge, 1972. [13] David C. Wilcox. Turbulence Modeling for CFD. DCW Industries, Inc., La Caada, 1994. [14] M. Lesieur, O. M_etais and C. Pierre. Large-Eddy Simulations of Turbulence Cambridge University Press, 2005.
82
[15] C.B. da Silva. The role of coherent structures in the control and interscales interactions of round, plane and coaxial jets. PhD thesis, Institut National Politechnique de Grenoble, 2001. [16] A.A. Townsend. The Structure of Turbulent Shear Flow. Cambridge University Press, second edition, UK, 1976. [17] M. Phillips. The irrotational motion outside a free boundary layer. Proc. Camb. Phil. Soc., 51:220, 1955. [18] S. Corrsin and A.L. Kistler. Free-stream boundaries of turbulent ows. NACA Technical Report, 1244, 1955. [19] F.O. Thomas and K.M.K. Prakash. An investigation of the natural transition of an untuned planar jet. Phys. Fluids A, 3(1):90{105, 1991. [20] L. Bradbury. The structure of a self-preserving turbulent plane jet. J. Fluid Mech., 23:31{64, 1965 [21] C.B. da Silva and O. Métais. On the inuence of coherent structures upon interscale interactions in turbulent plane jets. J. Fluid Mech., 473:103{145, 2002a. [22] J. Ferziger and M. Perich. Computational methods for uid dynamics. Springer-Verlag, 1996. [23] J. Willianson. Low-storage runge-kutta schemes. J. Comp. Phys., 35:48{56, 1980. [24] P.K. Yeung and S.B. Pope. An Algorithm for Tracking Fluid Particles in Numerical Simulations of Homogeneous Turbulence. J. Comp. Phys., 79:373{416, 1988. [25] K. Alvelius. Random forcing of three-dimensional homogeneous turbulence. Phys. Fluids, 11: 1880, 1999