PAbreu Thesis
-
Upload
paulo-tribolet-abreu -
Category
Documents
-
view
221 -
download
0
Transcript of PAbreu Thesis
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 1/140
UNIVERSIDADE TÉCNICA DE LISBOAINSTITUTO SUPERIOR TÉCNICO
Multiscale High-Performance Computingin Plasma Physics
fine and coarse parallelism in simulation and visualization
Paulo Tribolet de AbreuLicenciado
Dissertação para obtenção do Grau de Doutor
em Engenharia Física
Orientador: Doutor Luís Miguel de Oliveira e Silva
Co-orientador: Doutor João António Madeiras Pereira
Jurí: Presidente: Presidente do Conselho Científico do IST
Vogais: Doutor Fernando António de Oliveira Carvalho Rodrigues
Doutor Carlos António Abreu Fonseca Varandas
Doutor Luís Miguel de Oliveira e Silva
Doutor João António Madeiras Pereira
Doutor Ricardo Parreira de Azambuja Fonseca
Doutor Viktor Konstantyn Decyk
Dezembro de 2010
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 3/140
iii
Título: Computação Multi-Escala de Alta Performance em Física de Plasmas:
níveis de paralelismo em simulação e visualizaçãoNome: Paulo Tribolet Abreu
Doutoramento em Engenharia Física
Orientador: Luís de Oliveira e Silva
Co-orientador: João Madeiras Pereira
Resumo:
Na primeira metade desta década, a comunidade científica teve que en-
contrar novos meios para satisfazer as suas crescentes necessidades de podercomputacional. Duas arquitecturas destacaram-se como especialmente pro-
missoras: computação Grid e processadores gráficos. Mostramos nesta tese
novos resultados obtidos nestas arquitecturas tão recentes, na área da simula-
ção numérica de plasmas.
No lado da Grid, desenvolvemos e testámos uma biblioteca multithread que
permite a programas transferir grandes conjuntos de dados da Grid de uma
forma simples, robusta, transparente para o utilizador e que minimiza o im-
pacto na performance. O desempenho e o impacto foram testados em vários
cenários de utilização e num código de produção. No lado dos processadores
gráficos, implementámos um código de simulação de plasmas totalmente rela-
tivista. Apresentamos alguns resultados de simulações 2D e uma comparação
com um código semelhante de última geração.
Explorámos também diferentes níveis de paralelismo na área da visuali-
zação científica. Construímos um cluster de visualização em memória distri-
buída e utilizámo-lo para visualizar grandes conjuntos de dados. Implemen-
támos a possibilidade de visualização directa no nosso código de simulação,
de modo a permitir ao utilizador uma interacção e exploração imediata dos
resultados, enquanto estes estão ainda a ser produzidos.
Palavras-chave: Computação; Física de Plasmas; Algoritmos Paralelos;
Visualização; Computação Grid; Simulação.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 5/140
v
Title: Multiscale High-Performance Computing in Plasma Physics:
fine and coarse parallelism in simulation and visualization
Abstract:
When commodity processors hit the power wall in the first half of this
decade, the scientific community needed to find new means to satisfy its in-
creasing computational needs. Two architectures stand out as very promis-
ing in the near future: Grid computing and graphic processors. They are thecoarse and the fine extremes in the spectrum of parallel computing. This thesis
shows several new results obtained with these new computing architectures in
numerical simulations in plasma physics.
In the coarse end of parallel computing, we have developed and deployed
a multithreaded library that allows code to manage transfers of large data sets
from the Grid, in a way that is simple, robust, transparent to the user and
minimizes performance impact. We present an analysis of the performance
impact on several application scenarios and on a real case scenario. In the fine
grain parallelism side, we have implemented a full relativistic plasma code on
graphic processor. We present some results from 2D simulations and a com-
parison against a similar state-of-the-art code.
We have also explored coarse and fine grain parallelism in scientific visu-
alization. We have set up a distributed memory scientific visualization cluster
and used it to visualize large data sets. We have also implemented a direct
visualization feature in our plasma simulation code, so that the user can inter-
act on the screen with the simulation and watch and explore as the results are
being produced.
Keywords: Computing; Plasma Physics; Parallel Algorithms; Visualization;
Grid Computing; Simulation.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 7/140
ACKNOWLEDGMENTS
This work would never have been done without the support of a large number of
individuals and institutions. Unfortunately I do not have the ability to personallythank them all, but at least I would like to refer to some in particular.
Prof. Luís Silva was my supervisor during the five years of this work, and also
during the previous years as an inexperienced researcher. During all this time he was
able to maintain the ideal environment for research at the Grupo de Lasers e Plasmas
(GoLP/IPFN) and to guide me through all the obstacles towards a successful outcome.
It was a privilege to work under the guidance of such an extraordinary scientist and
team leader.
Prof. João Madeiras Pereira was my co-supervisor and introduced me to the fan-
tastic world of Computer Graphics. This is a knowledge that brought me to new areasI would not have been able to explore on my own. For that I am very thankful.
Prof. Ricardo Fonseca was an invaluable help through most of the work in this
thesis. His deep knowledge of many different scientific, computational, and numerical
areas enlightened me in various stages, and his perseverance and pragmatic optimism
pulled me forward at the most difficult moments. I consider myself very lucky to have
had the opportunity to receive his guidance and support.
The research group at GoLP is the perfect environment for pushing the limits of
knowledge in Plasma Physics. Its members are excellent and highly motivated scien-
tists and support personel that master the very fine balance between hard work and
play. They all contributed in various ways to the positive outcome of this work, but
I would like to thank in particular: Michael Marti, for companionship in the server
room and all kinds of important lessons in server maintenance; Luís Gargaté, for his
incentive in finishing his PhD earlier than me; and Jorge Vieira and Samuel Martins
for a careful revision of the first version of this thesis.
My week at UCLA was short but very fruitful. I am thankful for the support of
Prof. Warren Mori and to Dr. Viktor Decyk for the helpful discussions. Samuel Mar-
vii
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 8/140
viii Acknowledgments
tins, Cláudia Franco, and Frederico Fiúza where the best of hosts and created the per-
fect homely set for a fruitful research week.
My parents started this work when they gave me my first chemistry set as present
for my 7th birthday. They never stopped supporting me in various ways through all
these years. Their example of love and courage is the foundation of my present life.
My brothers and sister never doubted my success, and all of them contributed to my
desire to study science and technology, and to become an engineer and a researcher.
I would never have embarked on this 5 year project without the support of my
lover and companion Britta, and our daughter Kalú. They offered their love and un-
conditional support from the first moment and at all times, even when I was least
expecting it. With all my love, this work is dedicated to you.This work was partially supported by FCT, Portugal, grant SFRH/BD/17870/2004
and project GRID/GRI/81800/2006, and by the NVIDIA Professor Partnership Pro-
gram.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 9/140
CONTENTS
1 Introduction 1
1.1 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Multi-core Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Scientific Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Original Contributions in this Thesis . . . . . . . . . . . . . . . . . . . . . 9
2 Coarse grain parallelism 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 The EGEE Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Motivation for Plasma Simulation Codes . . . . . . . . . . . . . . 16
2.2 An EGEE Grid Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 DataLight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Available Tools for Data Transfer . . . . . . . . . . . . . . . . . . . 20
2.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.4 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.5 Application deployment . . . . . . . . . . . . . . . . . . . . . . . . 332.4 Portal development and other future work . . . . . . . . . . . . . . . . . 34
3 Fine grain parallelism 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Basic PIC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 CUDA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 CUDA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
ix
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 10/140
x Contents
3.4.1 Particle push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Current deposit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.3 EM field update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.4 Boundaries and guard cells . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Application to Weibel Instability . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7 Overview of other work on PIC and CUDA . . . . . . . . . . . . . . . . . 62
3.8 Overview and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Multi-scale visualization 654.1 Distributed rendering with VisIt . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Direct visualization with CUDA . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Particle (point) visualization . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Grid Visualization and Diagnostics . . . . . . . . . . . . . . . . . . 69
4.2.3 Interactivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.4 3D visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.5 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Overview and Future Work 75
A Code listings 77
A.1 Code from Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.2 Code from Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.3 Code from Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.3.1 VisIt plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91A.3.2 Direct visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
B A GPGPU Implementation of the Boris Pusher Algorithm 99
B.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
B.2 The Boris pusher in a simplified PIC code . . . . . . . . . . . . . . . . . . 100
B.3 GPGPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
B.4 Parallelization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 11/140
Contents xi
B.5 Cluster implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B.6 Performance and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 103
C Color figures 105
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 13/140
LIST OF FIGURES
1.1 First generation Grid model. . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Second generation Grid model using Globus Toolkit 2 middleware. . . . 4
1.3 Third generation Grid model. . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 The EGEE site CFP-IST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Using GFAL instead of POSIX calls with HDF. . . . . . . . . . . . . . . . 21
2.3 Serial Data First (DF) scenario variant: the application alternates be-
tween a simulation step and a transfer step. . . . . . . . . . . . . . . . . . 24
2.4 Serial data last (DL) scenario: the application completes the simulation
and only after that it initiates the transfer of the result data. . . . . . . . . 252.5 Weak Data Threaded (W-DT) scenario: data transfers occur on a dif-
ferent thread than the simulation and take usually less time than one
simulation step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Intensive Data Threaded (I-DT) scenario: data transfers occur on a dif-
ferent thread than the simulation and take usually longer time than one
simulation step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 A simplified fluxogram of the algorithm implemented to allow for si-
multaneous Grid simulation and network transfer of results. . . . . . . . 28
2.8 Comparing the I-DT scenario (solid line) with the I-DF (dotted line) andI-DL (dashed line) using the F and T time intervals. . . . . . . . . . . . . 31
2.9 Comparing the W-DT scenario (solid line) with the I-DF (dotted line)
and I-DL (dashed line) using the F and T time intervals. . . . . . . . . . . 31
2.10 Comparing the I-DT scenario (solid line) with the I-DF (dotted line) and
I-DL (dashed line) using the S and C time intervals. . . . . . . . . . . . . 32
2.11 Comparing the W-DT scenario (solid line) with the I-DF (dotted line)
and I-DL (dashed line) using the S and C time intervals. . . . . . . . . . . 32
xiii
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 14/140
xiv List of Figures
2.12 Job state machine in gLite (adapted from [1]). . . . . . . . . . . . . . . . . 35
3.1 Overview of a full cycle of an electromagnetic PIC code. Subscript j
refers to cell quantities on the grid; subscript i refers to particle quantities. 40
3.2 Staggered 2D grids in PIC codes. . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 CUDA memory architecture, both from the hardware and software level.
(For a color version, see Figure C.1 on page 105.) . . . . . . . . . . . . . . 43
3.4 Illustration of particle splitting in the Villasenor-Bunemann current de-
position method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Illustration of current deposition in the Esirkepov method. . . . . . . . . 50
3.6 Pseudo-code for the generalized atomic function. . . . . . . . . . . . . . 52
3.7 Illustration of a stride during particle push and current deposition, so
that kernels in the same warp handle particles in different cells. . . . . . 53
3.8 Illustration of the sequence call of the three different kernels required
for a full EM field update. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.9 Charge density of electrons during the formation of a Weibel instability. 56
3.10 Final state of a full 2D PIC simulation of a Weibel instability (positrons
in cyan and electrons in yellow). (For a color version, see Figure C.3 on
page 106.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.11 B21 + B2
2 intensity during the formation of a 2D Weibel instability. . . . . 57
3.12 Percentage of time spent at each step of the pusher algorithm. Based on
the values from Table 3.3. (For a color version, see Figure C.4 on page 107.) 59
3.13 Evaluation of performance with stride size and particles per thread for
the 1.2 M run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 Part of the hardware for distributed visualization testing. These four
machines represent 8 cores and 4.4 GB of RAM. Another similar node
(not in the picture) acted both as the scheduler and as a computing
node, and added the remaining 2 cores and 1.1 GB of RAM. . . . . . . . 66
4.2 Direct visualization of a 2D PIC simulation of the formation of a Weibel
instability. (For a color version, see Figure C.5 on page 107.) . . . . . . . 68
4.3 Examples of different diagnostics during the formation of a 2D Weibel
instability: (a) electron charge density, (b) positron charge density, (c) in-
tensity of the perpendicular component of the current ( j3), and (d) trans-
verse magnetic field energy B21 + B2
2. . . . . . . . . . . . . . . . . . . . . . 70
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 15/140
List of Figures xv
4.4 Overview of the integration between the simulation code (middle layer,
in green), and the interaction and visualization code (top and bottomlayers, in red and blue respectively). (For a color version, see Figure C.6
on page 108.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Two examples of direct interaction during a simulation: (a) depicting
the grid used in the PIC simulation over the particles, (b) zooming in
the center of the image while the simulation is running. (For a color
version, see Figure C.7 on page 108.) . . . . . . . . . . . . . . . . . . . . . 71
4.6 Direct visualization of a 3D particle pusher on a static EM field. (For a
color version, see Figure C.8 on page 109.) . . . . . . . . . . . . . . . . . . 73
A.1 A simplified fluxogram of DataLight (left and bottom) and its connec-
tion with a simulation code (top). . . . . . . . . . . . . . . . . . . . . . . . 78
A.2 The staggered 2D grid used for EM-field interpolation. . . . . . . . . . . 87
B.1 The use of textures in the GPGPU implementation of the Boris pusher. . 101
B.2 The use of forced-fields allows for easy domain decomposition. . . . . . 103
C.1 CUDA memory architecture, both from the hardware and software level.
(Reproduction of Figure 3.3 from page 43.) . . . . . . . . . . . . . . . . . 105
C.2 Illustration of a stride during particle push and current deposition, so
that kernels in the same warp handle particles in different cells. . . . . . 106
C.3 Final state of a full 2D PIC simulation of a Weibel instability (positrons
in cyan and electrons in yellow). (Reproduction of Figure 3.10 from
page 56.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
C.4 Percentage of time spent at each step of the pusher algorithm. Based on
the values from Table 3.3. (Reproduction of Figure C.4 from page 107.) . 107
C.5 Direct visualization of a 2D PIC simulation of the formation of a Weibel
instability. (Reproduction of Figure 4.2 from page 68.) . . . . . . . . . . . 107
C.6 Overview of the integration between the simulation code (middle layer,
in green), and the interaction and visualization code (top and bottom
layers, in red and blue respectively). (Reproduction of Figure 4.4 from
page 71.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
C.7 Two examples of direct interaction during a simulation: (a) depicting
the grid used in the PIC simulation over the particles, (b) zooming in
the center of the image while the simulation is running. (Reproduction
of Figure 4.5 from page 71.) . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 16/140
xvi List of Figures
C.8 Direct visualization of a 3D particle pusher on a static EM field. (Repro-
duction of Figure 4.6 from page 73.) . . . . . . . . . . . . . . . . . . . . . 109
C.9 The use of forced-fields allows for easy domain decomposition. . . . . . 109
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 17/140
LISTINGS
A.1 The main exported function for DataLight. It is called for each file that
should be transferred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79A.2 The queue manager thread. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.3 The remote file transfer thread. . . . . . . . . . . . . . . . . . . . . . . . . 81
A.4 The function that finishes the transfer threads. . . . . . . . . . . . . . . . 82
A.5 Applying the generalized atomic function from page 52 to the addition
of floats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.6 Entering and leaving the current deposition function. . . . . . . . . . . . 83
A.7 Implementation of the Esirkepov method. . . . . . . . . . . . . . . . . . . 84
A.8 Magnetic field update using shared memory. . . . . . . . . . . . . . . . . 86
A.9 EM-field interpolation at a particle’s position using textures. . . . . . . . 87
A.10 EM-field interpolation at a particle’s position without textures. . . . . . . 88
A.11 The kernel for a particle push. . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.12 Initialization of a CUDA device. . . . . . . . . . . . . . . . . . . . . . . . . 90
A.13 The function to check if the file contains vectorial magnetic field data
from an Osiris simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.14 Domain decomposition for a serial visualization system. . . . . . . . . . 93
A.15 Domain decomposition for a parallel visualization system. We do not
decompose in more domains than the number of available processors. . 94
A.16 GLUT display function for direct visualization. . . . . . . . . . . . . . . . 95
A.17 Drawing the simulation particles. . . . . . . . . . . . . . . . . . . . . . . . 96
A.18 Drawing the diagnostics as textures. . . . . . . . . . . . . . . . . . . . . . 97
xvii
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 19/140
To Britta and Kalú
with whom I’ve learned
the great mysteries that can’t be taught.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 21/140
CHAPTER 1
INTRODUCTION
Why Nature’s mysteries can be grasped by us humans in mathematical form is in it-
self a mystery. Several explanations have been formulated, but none can really fully
satisfy our inquiring minds. Some suggest that it is not Nature, but our mind that
is mathematical. Nonetheless, ever since the appearance of electric computers in the
middle of the twentieth century that science has been using them to advance knowl-
edge about the mathematical aspects of Nature: as a confirmation of theories, as a
prediction of experimental results, and as a base for new directions of discovery.
The ENIAC [2], built at the University of Pennsylvania between 1943 and 1946,is usually considered to be the first Turing-complete machine, hence the first modern
computer.1 Although through most of its 10 years of life it was used in military ap-
plications, these usually involved solving physics problems, like ballistic trajectories
and computations related with hydrogen bombs. Science use of computers has not
stopped since then, and has been increasing as fast as computing power has allowed.
In the 50s and 60s, with the transistor and the integrated circuit, science started
relying more and more on digital numeric simulations. During the 70s and 80s, fol-
lowing the successful ARPA experiment of the 60s, research centers got digitally con-
nected, first in the USA and later throughout the world. At the time, a single pow-erful computer system for research would cost from several hundred thousand up to
several million dollars, like the Cray-1, which sold for $8.8 million, or the Goodyear
Massively Parallel Processor, estimated at $750 thousand. This trend of single2 and
very powerful systems continued through the 80s and the first half of the 90s.
1It has been proven [3] that the Z3, built in 1941, was also Turing-complete, although not in a practical
way.2Even when implementing an internal parallel architecture, these were single systems, not a connec-
tion of several independent systems.
1
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 22/140
2 Chapter 1. Introduction
These machines had often unique architectures and forced the user to develop spe-
cific applications for them. Most used a mix of shared and distributed memory, oftencoupled with a hierarchy of access patterns (fast/slow) to it, and different connections
to storage and processing power. A successful application in one system was not likely
to be well deployed in another without serious modifications.
In the 90s, the PC was finally adopted by the numerical simulation scientific com-
munity. With the appearance of the Bewolf type of cluster [4], scientists had access
to cheap and powerful systems made from commodity, off-the-shelf (COTS) compo-
nents. Several PCs, connected with inexpensive ethernet cards and running the free
Linux OS, could run successful numerical experiments for a fraction of the price of
a dedicated machine. Moreover, they were easy to gradually upgrade, as researchmoney became available and PCs became more powerful, unlike the specialized ma-
chines that usually required a full replacement.
The Message Passing Interface (MPI) specification [5] was essential to the wide
acceptance of the PC cluster. It defined a standard way for implementing the com-
munication between the different instances of the deployed application. If an appli-
cation or an algorithm implementation ran successfully on a certain PC cluster using
the MPI specification, it stand a good chance of working unmodified in any other PC
cluster throughout the world. For the best part of ten years, until the mid 2000s, most
computer clusters were based on the shared memory model, using COTS machines,
connected by ethernet and using an implementation of MPI.
1.1 GRID COMPUTING
In the late 80s, the USA Office of Science and Technology Policy issued a report where
several scientific grand challenges were identified [6]. These grand challenges were de-
fined as “fundamental problems in science or engineering whose solution would be
enabled by the application of high-performance computing resources.” In the begin-
ning of the 90s, it became obvious that these grand challenges could not be solved by
one single supercomputer, but only with the joint efforts of the different supercom-
puters installed throughout the research and academia centers. And thus, in 1999,
with the proliferation of PC clusters and the ubiquitous access to the Internet, Grid
computing was born.
Grid computing was defined by Ian Foster et al. in 2001 [7] as “a system that co-
ordinates resources not subjected to a centralized control, that delivers a nontrivial
quality of service and that uses standard, open, general-purpose protocols and in-
terfaces.” This definition was born from a successful grand challenge project called
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 23/140
1.1. Grid Computing 3
NUG30 [8], created to solve a 30-factor quadratic assignment problem [9]. This prob-
lem was posed in 1968, but it was only solved in 2000. It took 7 days of computation,using a grid of 9 clusters, and summing a total of 1000 processors. It required more
than 96 000 CPU hours, which is equivalent to about 11 years on one single CPU. So 11
years of computation were reduced to one week by gathering several computational
clusters into one common task. A general overview of a first generation Grid, like the
one we have just described, is given in Figure 1.1.
FIGURE 1.1 : First generation Grid model.
The idea is that computing power (the “Resource Provider”) is offered to a match-
maker system. This system gathers all the resources available and waits for resource
requests from clients (the resource requesters). Once a match is made, the requester
can use the provided resources. This model had limited success, because of a few
shortcomings:
— the protocols (represented by the arrow lines in Figure 1.1) were not standards;
— the connection between the requester and the provider was fixed during the life-
time of the connection;
— the requester could not choose resources directly.
These shortcomings meant the connections tended not to be reliable. If one of the
players (the requester, the provider, or the match maker) suffered an offline period,
the whole cycle (publishing, requesting, using the resources) had to be redone, usually
requiring human intervention.
Foster’s paper [7] pointed the solution to these problems: physical machines had
to be abstracted by resources, and connections should be done using reliable open
standards. A set of software packages were already being developed to follow those
requirements, in particular the Globus Toolkit [10], which became one of the most
widely used Grid middleware.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 24/140
4 Chapter 1. Introduction
Several Grids were built using this new approach and they became part of what is
now called the 2nd generation, where emphasis is put on resources and open proto-cols. A general overview of a 2nd generation Grid is given in Figure 1.2.
FIGURE 1.2 : Second generation Grid model using Globus Toolkit 2 middleware. The Mon-
itoring and Discovery Services (MDS) keep track of the available services and resources. A
request for a service or resource is made using the specific API, and the use of that service or
resource is done through the Grid Resource Allocation Management (GRAM) protocol.
A 2nd generation Grid implements several general components that offer the cor-
rect services and resources:
Security: This component is responsible for the authentication of the user to the Grid.
It should allow for proxy authentications, so that a user can login at only one
place and use any resource anywhere she/he is entitled to.
Information services: This component collects the services and resources available and
announces them to any requester.
Data management: This component is responsible for the data allocation and transfer
services on the Grid.
Execution management: This component allows the user to control applications run-
ning on the Grid: launching, querying, pausing, relaunching, etc.
Common runtime: This component is the general environment required for a remote
application to be launched and run (eg., operating system and application li-
braries).
The most notable example of a 2nd generation Grid might be the Large Hadron
Collider Computing Grid (LCG), originally built specifically to supply computing
power and data storage for the Large Hadron Collider experiments at CERN. Based
partially on the Globus Toolkit middleware and partially on other tools, it is consid-
ered to be one of the most successful 2nd generation Grid implementations. In fact it
grew from being a dedicated Grid to the LHC to become a wide eScience dedicated
Grid called EGEE.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 25/140
1.1. Grid Computing 5
In 2002 Ian Foster et al. laid out the plans for the third generation Grid [11], by
introducing the Open Grid Service Architecture (OGSA). It allows for a higher level of abstraction by basing the Grid on services instead of resources. A general overview of
a 3rd generation Grid is given in Figure 1.3.
FIGURE 1.3 : Third generation Grid model. Services are published to the registry and waitinvocation from a requester. All communications use open protocols: services are described
using the Web Service Description Language (WSDL), entities are identified through Universal
Description Discovery and Integration (UDDI), and messages are sent using the Simple Object
Access Protocol (SOAP).
The main trend that evolved during these three generations of Grid models was an
increase in the abstraction of how resources are made available. This can be seen by
comparing Figures 1.1 to 1.3. In the first generation grids, resources (like computing
power or data storage) were tightly coupled to the protocols to use them and to the
physical equipment that supported them. Different equipments usually meant differ-ent protocols and resources: a different cluster might mean a different way to transfer
and launch a program; a different storage system (eg., hard disk or tape based) usually
meant different ways to transfer and retrieve the data. In the second generation grids,
each type of resource was associated with one protocol, and the physical equipment
was abstracted away. It could be changed and, as long as the new equipment still
provided the same kind of resource, this change would be transparent to the client,
since the same protocol could be used. Finally, in the third generation grids both the
protocol and the resource are abstract away as a service. A client only has to know
how to communicate with a server to request a service. Each service has an interface
description that is passed to the client at the beginning of the transaction. Thus there
is a loosely-coupled contract between the client and the server.
A major improvement achieved by the 3rd generation Grid was the unification
of Web services and Grid services, creating the Web Services Resource Framework
(WSRF). The idea was that, although Web services supply a good specification for
dealing with services on the Web, they are nominally stateless, that is, no data is
retained between invocations of the service. This is not adequate for Grid services.
WSRF provided a set of operations and components that extended Web services to
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 26/140
6 Chapter 1. Introduction
allow for state preservation. This allowed for services to simply extend their existing
Web capabilities to become Grid services. The Globus Toolkit version 4 (GT4) is oneof the most widely used implementations of WSRF.
At the moment, the EGEE Grid is one of the world’s largest 3rd generation Grid. It
uses the gLite middleware [1] which is actually a mix of GT4 tools with some specific
services. Most of the computing solutions for today’s grand challenges involve taking
advantage of Grid infrastructures. To that goal, one of the major contributions of this
thesis is the development of a software library that allows programs that produce as
output large amounts of data to be quickly and easily deployed on the EGEE Grid (see
Chapter 2).
1.2 MULTI-CORE MACHINES
During the time when computer grids were being developed and behind all this evo-
lution in scientific computing, Moore’s law [12] was acting in a most spectacular way.
In its current form, it states that the density of components placed on a chip dou-
bles roughly every two years. For a single chip, more density can cause more clock
speed, and so until 2005 Moore’s law effect was to double processor clock speed every
24 months.3 However, around that time chip manufacturers were faced with a new
problem: the power wall.Power consumption on a chip grows roughly with the square of the clock speed.
With the increase in speed and transistor density, the power consumed per unit area
was starting to be an issue for consumer level computers. It was clear that Moore’s
law could no longer mean a steady increase in processor’s speed. The solution was
to keep the same speed, but increase the number of processors per chip. This reduces
the power consumption by√
2 ≈ 1.41 for the same theoretical number of instructions
per second, and has been the approach of CPU makers in the last five years. At the
moment, it is common to find COTS systems with 16 cores.
Most implementations of MPI can treat different cores in one machine as differentnodes in a distributed memory system. This means that MPI code could run unmod-
ified in these clusters made of multi-core machines (known as fat nodes), although not
taking advantage of all the possibilities of a shared memory system. And thus we are
presented with the problem that the scientific high-performance computing commu-
nity has been facing in the last six years: how to take advantage of the multiple cores
3This effect was so reliable that some scientists report anecdotally that if a certain algorithm did not
run fast enough in a recent processor, it was cheaper, faster, and easier to wait for the processor’s speed
to increase than to rethink the algorithm or to add more computers to the calculation.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 27/140
1.2. Multi-core Machines 7
that are now available in most shared memory clusters while still using the effort
spend in the last 15 years in distributed memory architectures, specially consideringthat clusters are still a very important way to advance numerical simulations?
Several solutions have been emerging, some acting on the level of programming
languages, compilers, and software in general, others as a mix of new hardware archi-
tectures and programming models. At the software level, some of the most relevant
solutions are:
OpenMP: OpenMP [13] extends C, C++ and Fortran to allow for shared memory mul-
tiprocessing. It requires explicit parallelism, hence leaving to the hands of the
programmer which parts to parallelize with MPI and which parts with OpenMP.
TBB: Threading Building Blocks [14] abstracts away the idea of a thread. The pro-
gram is seen as a set of concurrent tasks running on a system with a certain num-
ber of threads available. A task scheduler launches the tasks using the available
threads and is responsible for the load balancing of the whole system. The par-
allelism has to be explicitly programmed, but data locality can be either explicit
or implicit. A disadvantage is that it is available only for C++.
PGAS: Partitioned Global Address Space [15] tries to abstract the typical memory hi-
erarchy of modern systems. On a single core, one has the RAM, usually large
and slow, and several levels of cache, much smaller in size but also much faster.
These concepts can be extended to distributed memory systems (multi-core ma-
chines) or to shared memory systems (clusters of machines). For example, on
a cluster the slow memory is the one available over the network while the fast
memory is the one locally available to the processor. When programming with
PGAS extensions, the programmer deals with local, shared or global memory,
and the underlying physical access (locally over the processor bus or remotely
over network) is abstracted. However, parallelism has to be explicitly declared.
These software approaches give the scientific developer tools to adapt his code in a
more efficient way to the existing trend of clusters of fat nodes, that is, of systems with
a mix of distributed and shared hierarchical memory.
At the same time that these software solutions were being developed, there was
also a movement towards a more efficient multi-core design in hardware, without
the legacy of a single-core architecture. The most notable processor chip that came
out of this movement is the Cell Broadband Engine Architecture (Cell BE for short)
[16]. On the computational level, it is made of a main processor called the Power
processing Element or PPE, which is actually an IBM’s Power Architecture core, and
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 28/140
8 Chapter 1. Introduction
of eight co-processors called the Synergistic Processing Elements, or SPEs [17]. They
are connected by a high-bandwidth circular data bus called the Element InterconnectBus or EIB. A cache coherent direct memory access (DMA) to main memory and other
external storage completes the architecture. Although several projects in scientific
computing used the Cell BE [18], and the Roadrunner supercomputer being partially
based on it, it failed to gain wide spread acceptance. In November 2009 IBM declared
they had discontinued the development of the Cell.
Unexpectedly, a promising new multi-core hardware architecture became avail-
able to the HPC scientific community from another place: the graphics and gaming
industry.
Programmable Graphic Processor Units (GPU) were introduced in the beginningof this decade. The first generation offered limited programmability (no conditional
expressions, assembler only), but within a few years the next versions evolved to a
full programmable system, although still with a tight connection to the graphics en-
gine [19, 20]. Finally, in 2007, NVIDIA launched the “Compute Unified Device Ar-
chitecture” (CUDA), a software stack (compiler and APIs) that allows for a general
programming approach on specific CUDA-enabled GPUs. Due to the ubiquitous pres-
ence of graphics chips and its low cost, CUDA was accessible to almost anyone who
wanted to try to take advantage of its unique massively parallel architecture [21]. One
major contribution of this thesis is the implementation of a numerical plasma simula-
tion code on a GPU using CUDA (see Chapter 3).
1.3 SCIENTIFIC VISUALIZATION
R. W. Hamming’s ground breaking classic on numerical methods [22] is based on a
very powerful motto:
“The purpose of computing is insight, not numbers.”
Throughout the history of computing, scientists have gained a very powerful ca-
pacity to generate huge amounts of data. Let us not forget that, although processorspeed has not increased significantly in the last few years, Moore’s law is still valid:
our capacity to generate data is still doubling every 24 months [23, 24], using some of
the techniques explained previously. If we expand Hamming’s motto with the follow-
ing corollary:
Data is useless without insight.
the question comes naturally: how do we achieve insight into such amounts of data
that scientists are able to produce?
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 29/140
1.4. Original Contributions in this Thesis 9
One of the answers includes visualization. Since it involves the primary sensory
apparatus (vision), as well as the processing power of the human mind, it is a simpleand effective medium for gaining insight into complex and/or voluminous informa-
tion [25]. Hence a work on new effective ways to produce scientific data as a result
of numerical simulations would not be complete without some contribution in this
area. Chapter 4 explains our novel contribution to the field of scientific visualization,
mainly as an immediate tool to gain insight in running numerical simulations.
1.4 ORIGINAL CONTRIBUTIONS IN THIS THESIS
In Chapter 2 we present a novel approach to adapt HPC applications and its users tothe Grid, in particular to the EGEE Grid.
First we give a short overview of the EGEE Grid in general and of a minimal Grid
site in particular, focusing on the services that have to be available. Next, we tackle
the problem of distributed transfer and storage of large data. We present DataLight, a
software library that allows programs that produce as output large amounts of data to
be quickly and easily deployed on the EGEE Grid. This is an important tool for a broad
class of numerical simulations, mainly because the EGEE Grid is one of the largest in
the world and, although highly optimized for dealing with both data intensive and
computational intensive applications, it has been mainly aimed at applications thatrequire large amounts of data as input but produce a relatively small amount of data
as output. Hence the need for a tool that could handle the inverse paradigm in a
simple, robust, and transparent way to the user.
Since Datalight is based on existing Grid data transfer tools, before developing
the library we do a careful examination of available tools in the EGEE middleware
gLite. Then we develop a methodology for evaluating the performance of the library
and present a general overview of the algorithm that was implemented. Finally, we
show and analyze the results of using DataLight, both in a test environment and with
production code.
This chapter finishes with a presentation of another tool to adapt HPC applications
to the Grid: a web portal that, thanks to its tight integration with DataLight, can warn
the user as each result file becomes available for post-processing, thus avoiding the
unnecessary delay of having to wait for the whole simulation to finish.
In Chapter 3 we present the implementation of a fully relativistic plasma simula-
tion code on CUDA enabled devices. We start with an overview of particle-in-cell al-
gorithms, followed by a description of a CUDA system (hardware and software). Then
we explain our particular implementation, with details about each part of the code.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 30/140
10 Chapter 1. Introduction
The purpose was not only to measure what performance gains could be achieved (if
any), but also to estimate the amount of effort that was to be involved in porting anexisting code base to this new environment.
Next we present some results from 2D simulations (Weibel instability) and a com-
parison against a similar state-of-the-art code (Osiris). We also give an overview of
other solutions to this problem (plasma codes in CUDA devices), and finish the chap-
ter with some conclusions.
In Chapter 4 we present two solutions for the visualization and post-processing
of data produced by scientific numerical simulations in general, and by plasma sim-
ulation codes in particular. One deals with large datasets and implements a database
plugin for a parallel visualization program (VisIt [26]). This allows for visualizationand exploration of datasets that do not fit in the memory of a single workstation.
Another original contribution in visualization presented in this chapter is an im-
mediate and interactive exploration of the data produced during a simulation, which
we call direct visualization. This method can integrate directly with a CUDA imple-
mentation and, with minimal influence on the performance of the simulation code,
allows for a direct interaction between the scientist and the simulation while the latter
is running.
The Appendix has two main sections. Appendix A has the code listings of the
most relevant parts of the programs developed during this thesis. This includes theDataLight library, the CUDA implementation of a plasma code, the database plugin
for distributed visualization, and parts of the direct visualization implementation.
Appendix B presents our first work in implementing simulation codes in a graphic
processing units. This eventually proved to be not worth pursuing with the technol-
ogy available at the time (GLSL [27] in 2006), but led to the first development of a
direct visualization tool and to the CUDA approach 18 months later. Hence we think
it is relevant enough to be presented as part of this work, although in a secondary
position.
We end this thesis with a color reproduction of some of the figures and the bibli-
ography.
The work presented in this thesis led to the following publications and communi-
cations:
PAPERS IN REFEREED JOURNALS
— P. Abreu, R. A. Fonseca, J. M. Pereira, L. O. Silva, PIC codes in new processors: a
full relativistic PIC code in CUDA enabled hardware with direct visualization, in
IEEE Transactions on Plasma Science, to be published in January 2011.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 31/140
1.4. Original Contributions in this Thesis 11
PAPERS IN CONFERENCE PROCEEDINGS
— P. Abreu, Real-Time Scientific Visualization and Exploration of Very Large Datasets,
in Proceedings of the Cumulus Doctoral Supervision Workshop, IADE, Lisbon, Portu-
gal, 2005.
— P. Abreu, J.M. Pereira, L. O. Silva, Exposing Stream Processors as Grid Services: a
GPGPU Example, in Ibergrid — 1st Iberian Grid Infrastructure Conference Proceedings,
Santiago de Compostela, Spain, 2007.
— P. Abreu, R. Fonseca, L. O. Silva, Migrating large output applications to Grid en-
vironments: a simple library for threaded transfers with gLite, in Ibergrid — 2nd
Iberian Grid Infrastructure Conference Proceedings, Porto, Portugal, 2008.
— P. Abreu, R. Fonseca, and L. O. Silva, Streaming the Boris Pusher: A CUDA Imple-
mentation, in Advanced Accelerator Concepts: Proceedings of the Thirteenth Advanced
Accelerator Concepts Workshop, Carl B. Schroeder, Wim Leemans, Eric Esarey (Eds.),
USA 2008, ISBN: 978-0-7354-0617-9.
CONTRIBUTED TALKS
— P. Abreu, Real-Time Scientific Visualization and Exploration of Very Large Datasets,
in Cumulus Doctoral Supervision Workshop, IADE, Lisbon, Portugal, 2005.
— P. Abreu, J.M. Pereira, L. O. Silva, A Distributed Memory GPU Implementation of
the Boris Particle Pusher Algorithm, in Eurographics Symposium on Parallel Graphics
and Visualization, Braga, Portugal, 2006.
— P. Abreu, R. Fonseca, L. Silva, Migrating large output applications to Grid environ-
ments: a simple library for threaded transfers with gLite, in Ibergrid — 2nd Iberian
Grid Infrastructure Conference, Porto, Portugal, 2008.
— P. Abreu, R. Fonseca, and L. O. Silva, Streaming the Boris Pusher: A CUDA Im-plementation, Thirteenth Advanced Accelerator Concepts Workshop, Santa Cruz, USA,
2008.
POSTERS IN CONFERENCES
— P. Abreu, J. M. Pereira, L. O. Silva, A Distributed Memory Grid Enabled GPU Imple-
mentation of the Boris Particle Pusher Algorithm, in APS 2006 48th Annual Meeting
of the Division of Plasma Physics, Philadelphia, USA.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 32/140
12 Chapter 1. Introduction
— P. Abreu, J. M. Pereira, L. O. Silva, Exposing Stream Processors as Grid Services: a
GPGPU Example, in Ibergrid — 1st Iberian Grid Infrastructure Conference, Santiago deCompostela, Spain, May 2007.
— P. Abreu, J. M. Pereira, L. O. Silva, A Distributed Memory GPU Implementation
of the Boris Particle Pusher Algorithm, in AstroGPU 2007 — Workshop on General
Purpose Computation on Graphics Processing Units in Astronomy and Astrophysics, IAS,
Princeton, USA, November 2007.
— P. Abreu, R. A. Fonseca, J. M. Pereira, L. O. Silva, PIC codes in new processors: a
full relativistic PIC code in CUDA enabled hardware with real-time visualization,
in 21st International Conference on Numerical Simulation of Plasmas, Lisbon, Portugal,2009.
— V. K. Decyk, T. V. Singh, P. Abreu, R. A. Fonseca, L. O. Silva, Designing Particle-in-
Cell Algorithms for Advanced Computer Architectures: Application to GPUs, 51st
Annual Meeting of the APS Division of Plasma Physics, Atlanta, USA, 2009.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 33/140
CHAPTER 2
COARSE GRAIN PARALLELISM
We already saw in Chapter 1 how the deployment of scientific codes on a Grid is one
of the possible ways to access enough computing power and memory space to solve
some of the grand challenges that scientists face today. However, deployment of an
existing high-performance scientific code on a Grid might not be a trivial task, since
Grids usually require specific implementations for security, data access and transfer,
system load, etc.
In this chapter, we present DataLight, a simple library based on EGEE middleware
function calls, in particular lcg_utils, that handles the transfer of large data file sets
from the local storage of the Worker Node to an EGEE Storage Element in a way that is
simple, robust, transparent to the user and minimizes performance impact. The library
is very easy to integrate into non-Grid/legacy code and it is flexible enough to allow
for overlapping of several data transfers and computation. We also present a careful
analysis of the performance impact on several application scenarios, and we analyze
the use of the library in a real case scenario, where it is used in a high-performanceplasma simulation code.
In addition, we refer to an ongoing effort to streamline the user interaction to
launch HPC codes on the Grid. We are currently developing a web service for launch-
ing and checking jobs on the Grid, and for keeping track of the output files for easy
post-processing. This service interoperates easily with DataLight and should make
accessing the large data files that result from large scale simulations a simple task for
the user.
13
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 34/140
14 Chapter 2. Coarse grain parallelism
2.1 INTRODUCTION
2.1.1 THE EGEE GRI D
As explained in Chapter 1, the EGEE is one of the largest Grids in the world. It started
as a dedicated support Grid for the Large Hadron Collider experiment (LHC), with
the apt name of LCG, for Large Hadron Collider Computing Grid. At the time, the
middleware used (LCG-2) was specifically developed for it, with some tools used from
the Globus Toolkit 2. This made it a 2nd generation Grid. The LCG was so successful
that its scope broadened to include several other scientific research areas beyond the
original High Energy Physics, like Fusion and Biomedicals. This move came withthe change in name to EGEE, which originally meant “Enabling Grids for eScience in
Europe.” Due to its present intercontinental scope, “Europe” was dropped from the
name, but the short version kept the original four letters.
The EGEE uses its own middleware, called gLite [1], which is based on the Globus
Toolkit 4, with other complementary services added. It is a full 3rd generation Grid
software stack, implementing most of the WSRF standards [28] and based on the com-
ponents described in [7] and [11], namely:
Security: This service includes components for authentication, authorization and del-egation, based on X509 certificates and proxies (for the delegation part).
Data management: This service includes low level data transfer tools like GridFTP [29]
and GFAL [30], and high-level tools like the Logical File Catalogue (LFC) and the
File Transfer Services [31] (FTS). Some of these services and tools will be further
explained throughout this chapter.
Execution management: Mainly the GridICE [32] and the Workload Management Sys-
tem [33] (WDS), this component is responsible for the launching, monitoring,
and updating the state of jobs on the Grid.
Information services: This service is responsible for keeping track and publicizing the
state of the Grid. It is based mainly on the Monitoring and Discovery System
[34] (MDS).
Common run-time: This component ensures that applications deployed on the Grid
find a common environment (libraries, compilers, processors, operating sys-
tems) to run.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 35/140
2.1. Introduction 15
These services are usually hosted in specific machines, which get their typical
EGEE names according to their function. For example, a participating site usuallyhas to deploy and maintain the following nodes:
User Interface (UI): The UI is the node where the user logs in to interact with the Grid.
It is the only place where he/she is required to have a local account and where
the user certificates are kept. From this node the user authenticates, queries the
Information Services, launches jobs, queries their status and retrieves results. It
usually runs most of the tools to query the gLite services, like authentication and
certificate proxy request, information services queries, job and file management,
etc.
Computing Element (CE): The CE includes both the computing power of a Grid site
and the gateway to that computing power. These two roles are usually divided
in two different types of machines, that supply different functions: the Gate-
keeper (GK) service and the Worker Nodes (WN).
The GK runs services for authentication and authorization, and a local infor-
mation system that publishes information regarding local resources. It also has
to interact with the local batch system, which is responsible for running the re-
ceived jobs. The WN are the nodes where jobs are actually run. They are usually
part of a local cluster managed by a batch system and connect to the GK.
In a nutshell, users from the UI can specify which CE they want to run their jobs
on. Jobs enter a CE for execution through the GK, which interacts with the local
batch system.
Storage Element (SE): The SE is responsible for the data management services (transfer
and storage) offered at a Grid site. These include, besides the authentication and
authorization services, the Storage Resource Management (SRM), the gridFTP
protocol and the GFAL API protocol. This chapter deals mainly with the data
services in gLite, so we will explain these services more thoroughly later (Sec-
tion 2.3.1).
Monitoring Box (MON): The MON service collects the information given by different
software sensors installed on the different site machines.
Later in this chapter (Section 2.2) we will give a general overview of a minimal EGEE
Grid site that we have deployed and maintained.
Other nodes deploy more general services, called core services. They are managed
by larger Resource Centers (RC) and used by all users and smaller sites to allow for
interaction with the global Grid infrastructure. Some of them are:
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 36/140
16 Chapter 2. Coarse grain parallelism
Top Berkeley Database Information Index (Top-BDII): The Top-BDII collects the informa-
tion provided by the individual sites’ information systems. It provides a central-ized place where Grid services and sites can query that information.
File Catalogues (FC): The FC are a database for file services. They map the logical
names of Grid files to their storage locations and their global identifiers.
Resource Broker (RB): The RB implements the match between the user requirements
and the available resources (from queries to the top-BDII and to the FC). It also
runs the Logging and Bookkeeping service (LB) that stores the status of all jobs
submitted to the RB and can be queried by the user through the UI. In the latest
version of the software it has been replaced by the WMS.
2.1.2 MOTIVATION FOR PLASMA SIMULATION CODES
A very common type of HPC algorithm is one that produces very large data files as
output. An example is the class of algorithms used in physics simulations, where
the motion and interaction of several thousands or millions of particles are simulated.
Codes like this (e.g., PIC [35]) typically require a relatively small amount of data as
input (∼ 1 kB) but can produce a huge amount of output (up to 1TB).
We have seen that the EGEE Grid is highly optimized for dealing with both dataintensive and computational intensive applications. In fact, since it was built on top
of the existing LCG, two of its main purposes are to offer high computing power to
applications, and high storage and replication facilities for data. However, this leads
to a certain dominance of Grid applications that require large amounts of data as input
(like the ATLAS or CMS experiments), against Grid applications that produce large
amounts of data as output.
As the EGEE project expanded and matured, other uses for its Grid infrastructure
were devised, and other Grid projects have been developed based on it, like the In-
teractive European Grid [36] (I2G). There has also been an important development of
tools that facilitate the migration and integration of existing applications to the Grid
and the development of new applications [37, 38, 39]. Some of these applications,
although well suited to be deployed on the Grid, might need different uses of its in-
frastructure than what has been usual until now.
In our case, we have a very clear motivation to work in deploying code on the
Grid, since our research center develops, maintains and deploys two massively paral-
lel plasma simulation codes (Osiris [40] and dHybrid [41]), which have been tuned for
HPC systems ranging from one to several hundreds of thousands of processors. For
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 37/140
2.1. Introduction 17
this kind of massively parallel codes, deployment on a computing Grid can be an im-
portant advantage, since one has several clusters with different capabilities available, both in computing power (number of processor cores and speed of each core) and in
total memory. It is possible to choose the Grid cluster according to the problem size to
solve, and parameter scanning is also available, where one can launch variations of the
same kind of problem over several Grid clusters. For large problems like these, that
run over several hours or days, the overhead that the Grid middleware might add is
usual not relevant to the total computing time. Moreover, since computing clusters are
usual associated with large storage systems, the usually large output of the simulation
can easily be stored on the Grid, to be accessed later.
Our plasma simulation codes are thus good candidates for application deploymentin the EGEE and I2G Grids, not only because they can make good use of the computing
power and memory available, but also because they challenge the current migration
infrastructure, since they usually produce large data sets (several hundreds of GB)
as output for post-processing (e.g., data analysis and visualization). It is important
to make these large amounts of data produced by such codes available as quickly as
possible, ideally still at runtime, so that the user can have almost immediate access to
preliminary results. It is also likely that if the application is writing the results locally
on the Worker Node (WN), this node might not have enough storage for the complete
output data, specially when it is in the order of several hundreds of GB. In this chapter,
we present a solution on how to handle this output in a way that has the least impact
on the performance of the application deployed on the Grid.
On one hand, we developed DataLight, a library that handles the transfer of large
data file sets from the local storage of the Worker Node to an EGEE Storage Element
in a way that is simple, robust, transparent to the user and minimizes performance
impact. On another, we are developing a web portal that allows users to launch HPC
applications on the EGEE Grid and to interactively check on the data files produced.
As they are automatically transfered to the closest SE through DataLight, the portal
is updated and the user notified, so that he can access these files for post-processing
while the job is still running on the Grid.
The rest of this chapter is organized as follows: in Section 2.2 we present the EGEE
Grid site we have deployed and maintained since 2007; in Section 2.3 we describe the
development of the DataLight library, first by presenting an overview of the available
gLite tools for data transfer (page 20), then by exposing the algorithm developed for
the library and the methodology used in evaluating its performance (page 23), and
finally by analysing the results (page 29); finally, in Section 2.4 we present the current
state of the web portal and point some possible directions for future development.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 38/140
18 Chapter 2. Coarse grain parallelism
2.2 AN EGEE GRID SIT E
In 2007 we have deployed an EGEE Grid site as a way to allow for our research group
to collaborate with the Fusion Virtual Organization [42] of the EGEE Grid. Most of the
time during 2007 it was a testing site. During 2008 it was in production, although with
some significant down time due to hardware issues. 2009 was a year of more stable
production, where the availability and quality of service improved significantly.
The site has the name CFP-IST and is physically located in the Complexo Interdis-
ciplinar building of the IST campus at Alameda Afonso Henriques in Lisbon.
CFP-IST consists of three physically distinct systems that offer the following gLite
services:
— the LCG-CE service, offered on the physical machine with a FQDN of golp-ce.
ist.utl.pt. This is a Computing Element service (CE) with a default batch system
combination of Torque [43] and Maui [44], that makes available to the Grid the
computing power of this site.
— the site-BDII service, located on the same machine golp-ce.ist.utl.pt. This ser-
vice announces to the top-BDII the occupancy and availability of the CE.
— the UI service, located on the physical machine with a FQDN of golp-ui.ist.utl.pt. This is the user interface entry point, where users can login, compile, launch
jobs to the Grid, check the status of those jobs, and retrieve their results. This is the
only system where grid users are allowed to log in.
— the MON service, located on the same machine golp-ui.ist.utl.pt. This is the
monitoring service of the grid site, that collects all the info and status from the CE
and checks that all announced services are available.
— the SE service, located on the physical machine with a FQDN of golp-se.ist.utl.
pt. This is a storage service of 1.2 TB, available to the Grid. The storage system isthe DPM (Disk Pool Manager) based on MySQL.
Behind the CE we have the Worker Nodes (WN), that actually run the jobs sub-
mitted to the CE. At the moment they consist of 10 cores, divided by 6 machines: two
single core machines with 3 GB RAM per core and four dual core machines with 2 GB
per core.
The original configuration in 2007 was just two single core WNs with 2 GB per core
and a SE with 500 GB storage. The configuration got a significant upgrade in 2008 with
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 39/140
2.2. An EGEE Grid Site 19
more RAM (also for the service systems, that now have 3 GB each) and the four dual
core WNs. In 2009 we got the upgrade of the storage space of the SE to 1.2 TB.
Besides this, during 2009 the improvements were also on power dissipation and
physical space occupation. All the systems were moved from individual off-the-shelf
cases to 3U racks that were fit in a 42U cabinet. They were also fitted with new power
supplies. This move increased the reliability of the system and the overall availability.
Figure 2.1 shows the physical machines and the cabinet where they are stored.
During 2009 we ran 3001 jobs for the Fusion VO, which corresponds to 3.5% of the
total number of jobs ran by the SWE federation, and to a total of 12972 CPU hours. We
also ran jobs for the SWETest VO, the dteam and the ops VO, as it is normal for any
node of the south-western Europe (SWE) federation.
FIGURE 2.1 : The EGEE site CFP-IST.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 40/140
20 Chapter 2. Coarse grain parallelism
Currently (2010) the site is undergoing an important upgrade. We have started
the process of expanding the WNs to 54 cores with 2 GB RAM per node, and to adda RAID system of 6 TB (expandable to 24 TB). The complete system will move to a
64-bit operating system. The RAID is already in place and being tested and some of
the WNs are already deployed in 64 bits.
2.3 DATALIGHT
In this section we present DataLight, the library referred previously that handles the
transfer of large data files to an EGEE Storage Element. We already explained the mo-
tivation behind the development of this tool in Section 2.1.2. We start by presenting
an overview of the available gLite tools for data transfer. Then we will explain the
algorithm developed for the library and the methodology used to evaluate its perfor-
mance. Finally, we will analyze the results, both in different test case scenarios and in
a production code.
2.3.1 AVAILABLE TOOLS FOR DATA TRANSFER
Producing large data sets as result of simulations running on the EGEE Grid poses
an interesting problem for application development. On one hand, gLite offers twohigh-level APIs for data management (GFAL and lcg_util [1]) which are well suited
for the reliable transfer, replication and management of data on the Grid and one low-
level API for data transfer submissions (FTS [31]), which is suited for point to point
movement of physical files. On the other hand, such data management operations
should occur still during simulation time, not just after it, such that the generated data
is made generally available as soon as possible and the WNs are never filled with too
much output results. This requirement can lead to performance degradation (either at
the application level or at the WN level) due to the overhead and slowness of network
transfers when compared to local storage access.
FTS
Although specially designed for reliable transfer of large data sets, FTS [31] only
deals with physical files from point to point. Its asynchronous nature allows for non-
blocking transfers, but also involves an overhead (job submission) and complexity
(need for channel setting, lack of logical references, need for SURL endpoints) that is
not adequate for porting a user application to the Grid.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 41/140
2.3. DataLight 21
GFAL
GFAL [1] offers a POSIX-like API for data management on the Grid. Instead of acting
on local files, the API allows for remote access of Grid files, stored in Storage Elements
(SE), using the gLite data management layer in a way that is transparent to the applica-
tion [45]. However, for our case in study, GFAL is not appropriate for implementation
for the following reasons:
Opaque file formats: Output data produced from simulations is often encapsulated
in a pre-defined file format. If the application does not use POSIX file calls and
instead uses the API defined by the file format specification, this can lead to an
impossibility of using GFAL. A good example is the Hierarchical Data Format(HDF) [46] (see Figure 2.2).
Although in this case the API is open and the source is available, which would
allow for the developer to modify the existing API implementation to use GFAL
instead of the usual POSIX calls, this task may be too complicated to even at-
tempt, since it usually involves changing highly optimized code, developed and
maintained for several years from a team of developers. There are also other
cases where the source is not available and such a modification is simply not
possible.
Local system(WN)
POSIX calls
HDF API
Application
L o c a l
Remote system(SE)
gLite datamanagement
ModifiedHDF API
Application
L o c a l
GFAL API
R e m o
t e
FIGURE 2.2: Using GFAL instead of POSIX calls with HDF. The left column shows the software
stack involved in a standard HDF usage. From top to bottom: the application reads and writes
HDF files through standard API calls; the API reads and writes the data locally using POSIX
calls, which are passed to the underlying Operating System. The right column depicts the
necessary layers to add if the GFAL API is to be used: the HDF API would have to be modified
to use GFAL instead of POSIX calls. This would allow for access to a remote Storage Element.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 42/140
22 Chapter 2. Coarse grain parallelism
Application stall and memory usage: While the data is being written, it is still in
memory. There are simulation algorithms where it is not possible to partitionthe result space in time in a way such that results from one simulation cycle are
successively generated and stored, but instead the complete result data for one
simulation cycle must be available in memory for writing at the same time. The
application has to wait for that writing step to finish before it advances to the
next simulation cycle.
Using GFAL in this case would involve a big performance impact, since a net-
work transfer is several orders of magnitude slower than local storage access,
and even more if we add the overhead involved in using the gLite middleware.
Threading this step would avoid application lock, but would not avoid highmemory occupation during a very long period. This, in practical terms, would
have the same effect as an application lock, since the new simulation cycle has to
wait for the previous data to be released in order to use the memory thus freed.
LCG_UTIL
The lcg_util API [1] allows applications to copy files between a WN and a SE (among
other Grid nodes), to register entries in the file catalogue and to replicate files between
SEs. This API deals with complete files (unlike GFAL), which makes it adequate for
the transfer of large files. In addition, it offers a higher level abstraction to data man-agement (unlike FTS), which makes the programmer task simpler. Hence, it was our
natural choice for evaluation of the best method for the transfer of large data sets from
simulation results from a WN to a SE.
OTHER TOOLS
A great amount of work has been done on the ability to replicate and transfer large
data sets on the Grid. In fact, in the first years this century the European Union
funded the DataGrid project [47], aimed at sharing “large-scale databases, from hun-
dreds of TeraBytes to PetaBytes, across widely distributed scientific communities.”
This project was recently included in the EGEE Grid. As it was not aimed at generating
large amounts of data, but on its replication and accessibility, most of the tools devel-
oped for it are not fit for our purpose, since they require significant code change at the
application level [48, 49, 50].
Several of the Grid services for data transfer rely on GridFTP [29, 51]. In fact, even
the lcg_util API uses the low level GridFTP tools. Although other tools have been
proposed for file transfers on the Grid (like [52]), we found that a GridFTP based tool
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 43/140
2.3. DataLight 23
had all the requirements of stability, reliability and performance that we needed. Also,
GridFTP is one of the most widely used middelware for data transfer on Grid systems,so building our library on top of a software stack based on GridFTP is a first guarantee
for portability.
2.3.2 METHODOLOGY
Our goal is to develop a tool that will ease the deployment on the Grid of applications
that output large amounts of data, specially in the case where that data should be
made available as soon as it is produced. To this end, we have implemented a simple
library for threaded transfers for Grid applications using the gLite middleware. We
have also tested the impact of such a tool in a particular type of Grid application:
a typical numerical simulation run, where result data is generated regularly, after a
certain number of simulation tasks (usually, after the computation of the state of the
experiment after a certain number of time steps) have occurred. This usually involves
making the result data available to the Grid, i.e., transferring the data from the WN
to a SE and creating replicas. The library was also used in a real-case scenario, with a
state-of-the-art code for kinetic plasma simulations.
We have evaluated different application scenarios for incorporating transfer of
(usually large) data files on a running Grid application. As evaluation criteria, wedefine the following four measurable time intervals: total turn-over time (T ) is the total
(wall clock) time the application is running, including both simulation and data result
transfers; first turn-over time (F) is the (wall clock) time it takes to make the first result
data available to the user (e.g., on a Grid’s SE); simulation step time (S) is the average
(wall clock) time a simulation step takes to complete, until the output data is locally
available on the WN to be transferred; finally, complete simulation time (C) is the aver-
age (wall clock) time the application takes from start until it finishes the last simulation
step (i.e., it is T minus the average time of a data transfer step). Figures 2.3–2.6 depict
these four time intervals in different application scenarios.
We have identified two main application scenarios to evaluate: serial, where data
transfer and computational steps alternate, but the application is either simulating or
transferring, and threaded, where data transfer occurs at the same time as the next com-
putational step, thus overlapping data transfers with computation. The serial scenario
establishes a baseline of the time intervals (defined above) to which the threaded sce-
nario is compared against. It would be expected that the threaded scenario would
always yield better performance; results show, however, that a serial scenario might
be the best choice in some situations.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 44/140
24 Chapter 2. Coarse grain parallelism
SERIAL SCENARIO
There are two relevant variants of the serial scenario of applications:
Serial Data First (DF): In this scenario, the output data is transferred as soon as it is
ready, thus the application alternates between a simulation step and a transfer
step (Figure 2.3). The transfer step can be seen as making part of the simulation,
since the application must finish the transfer of the data results of one step before
it can continue to the next simulation step. We expect F to be the shortest on this
scenario, which might be useful for interactive Grid applications (e.g., I2G).
Application
Simulation step time S
Total simulation time C
Transfer 1Simulation
step 1
L o c a l w r i t e 1
Simulation
step 2
L o c a l w r i t e 2
...Simulation
step n
L o c a l w r i t e n
Transfer 2 Transfer n
Total turn-over time T
First turn-over time F
FIGURE 2.3 : Serial Data First (DF) scenario variant: the application alternates between a sim-ulation step and a transfer step.
Serial Data Last (DL): In this case, all simulation steps are done first, and data trans-
fers only start after that calculation phase is over (Figure 2.4). This variant sce-
nario is not relevant as a whole for our purpose, since it only starts transferring
when the complete output is available, thus not solving one of the problems we
want to address: the lack of local storage space in the WN. Besides, implement-
ing such a scenario is very simple and can be done using a script that would
transfer the whole data after the simulation is done. However, we found this
variant scenario useful to measure the shortest C, that is, the total time the simu-lation would take if a user would be running it in a desktop machine, and not on
the Grid. Thus, the difference between C in this scenario and T in other scenarios
is a measurement of the penalty of data transfer on the application.
THREADED SCENARIO
The serial scenarios described previously, although with some minimal advantages,
serve mainly to establish a minimal F (scenario DF) and a minimal C (scenario DL),
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 45/140
2.3. DataLight 25
Application
Simulation step time S
Total simulation time C
Simulationstep 1
L o c a l w r i t e 1
Simulationstep 2
L o c a l w r i t e 2
... Simulationstep n
L o c a l w r i t e n
Transfer 1 ... Transfer n
First turn-over time F
Total turn-over time T
FIGURE 2.4 : Serial data last (DL) scenario: the application completes the simulation and onlyafter that it initiates the transfer of the result data.
in order to evaluate the penalty that a threaded approach might have on those two
parameters. It also establishes a T (total turn-over time) that we aim to improve.
However, the best results might be achieved if we transfer the output of a previous
simulation step while the application is calculating the next one, thus overlapping
data transfer and computation. This allows for the optimal overlap of two tasks that
have minimal influence on each other: the simulation step, which is CPU and memory
intensive, but has little effect on IO, and network transfer, which stresses mainly the IO
(local storage and network) and has a lesser impact on the CPU. This overlapping can
be achieved by launching a transfer thread as the last step of each simulation cycle for
the network transfer of the data. This data threaded scenario (DT) is the third scenario
we will evaluate.
DATA INTENSITY
Overlapping data transfers and calculation leads to two new variants of the DT sce-
nario, depending on the amount of data generated per simulation step:
Data weak: In this scenario, the transfer time of the data result from a simulation step
is usually shorter than the duration of that step (Figure 2.5). This means that data
transfers can occur as soon as data is available. We expect T to be much shorter
than in the corresponding DF and DL scenarios, while F will not be much larger
than in the DF scenario, and C will not be much larger than in the DL scenario.
Data intensive: In this scenario, each data transfer takes longer than the average sim-
ulation step (Figure 2.6). As a result, although time gains will still be substantial
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 46/140
26 Chapter 2. Coarse grain parallelism
Main thread
Transfer thread
Simulationstep 1
L o c a l w r i t e 1
Simulationstep 2
L o c a l w r i t e 2
... Simulationstep n
L o c a l w r i t e n
Transfer 1 ...
Total turn-over time T
Transfer 2 Transfer n
First turn-over time F
Simulation step time S
Total simulation time C
FIGURE 2.5: Weak Data Threaded (W-DT) scenario: data transfers occur on a different thread
than the simulation and take usually less time than one simulation step.
when compared to DF and DL scenarios, we expect them to be more influenced
by the transfer thread.
These two threaded variants can only be compared against the corresponding se-
rial variants, i.e., against a DF and a DL where the simulation step takes the same
amount of time. Thus, we also define the corresponding variants for the serial scenar-
ios: Weak Data First (W-DF), Weak Data Last (W-DL), Intensive Data First (I-DF), and
Intensive Data Last (I-DL).
2.3.3 ALGORITHM
We have developed and implemented an application to test the different scenarios
outlined previously. It uses an algorithm that is flexible enough to allow for multi-ple transfer threads to occur, for optimizing the bandwidth usage at the cost of some
processing time, although that possibility is not analyzed in the current work. In Fig-
ure 2.7 we show a simplified fluxogram of the application.
The main loop of the application (top left of Figure 2.7) consists of a simulation cy-
cle followed by a write of the result data to local storage. These two steps are repeated
until the complete simulation finishes and are the only steps of the whole Figure 2.7
that actually belong to the test application. All the other steps are part of the Data-
Light library. The connection between the application and DataLight occurs after the
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 47/140
2.3. DataLight 27
M a i n t h r e a d
T r a n s f e r t h r e a d
S i m u l a t i o n
s t e p 1
L o c a l w r i t e 1
S i m u l a t i o n
s t e p 2
L o c a l w r i t e 2
. . .
S i m u l a t i o n
s t e p n
L o c a l w r i t e n
T r a n s f e r 1
. . .
T r a n s f e r 2
T r a n s f e r n
T
o t a l t u r n - o v e r t i m e T
F i r s t t u r n - o v e r t i m e F
S i m u l a t i o n s t e p
t i m e S
T o t a l s i m u l a t i o n t i m e C
FIGURE 2.6: Intensive Data Threaded (I-DT) scenario: data transfers occur on a different
thread than the simulation and take usually longer time than one simulation step.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 48/140
28 Chapter 2. Coarse grain parallelism
Main thread
Transfer thread
Queue manager thread
Compute results
Write locally
Add to waiting queue
Disk
Waiting queue
Waiting queueempty?
Transfer slotsavailable?
Wait
Move first elementfrom waiting queue to
transfer queue
Initiate transfer thread
Transfer queue
Transfer thread
Transfer first elementfrom transfer queue
Remove transferredelement from
transfer queue
End
N
Y
Y
N
Queue manager
Main program
Queue manager
thread
FIGURE 2.7 : A simplified fluxogram of the algorithm implemented to allow for simultaneous
Grid simulation and network transfer of results.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 49/140
2.3. DataLight 29
local write operation. Each local write operation is followed by a call to DataLight,
that adds the file just written to a waiting queue of files waiting to be transferred.
The first time that DataLight is called, that is, the first time that the main thread
writes to this queue, it launches a queue manager thread (right-hand side of Fig-
ure 2.7), which is responsible for managing the waiting queue. This queue manager
dispatches the files on the waiting queue list in a FIFO order to a transfer queue. It
checks if there is enough transfer slots available (these are bandwidth and system de-
pendent) and, for each file that is moved from the waiting queue to the transfer queue,
it initiates a transfer thread (bottom left of Figure 2.7). Launching the transfer threads
from a manager thread instead of the main thread allows for greater flexibility: they
can be interrupted and resumed, can occur out of order, and can be more than onerunning at the same time.
Each transfer thread uses a simple algorithm represented on the bottom left of
Figure 2.7. It is up to each transfer thread to remove the file reference from the transfer
queue and to remove the local file as soon as the transfer finishes.
Finally, as the complete simulation finishes, the main thread sends a signal to the
queue manager thread. It waits until all queues are empty and quits, thus finishing
the application.
2.3.4 RESULTS AND ANALYSIS
We have used the library extensively both in test and in production codes. Although
the library allows for multiple transfer threads, this work focus on the use of a single
transfer thread concurrently with computation. Tables 2.1 to 2.4 list the relevant time
intervals obtained when using our library in a test simulation. Times are averaged
over 20 runs and are measured for data sizes of 8 MB (Table 2.1), 32 MB (Table 2.2),
128 MB (Table 2.3), and 512 MB (Table 2.4). Figures 2.8 and 2.9 show the evolution of
the two more relevant time intervals defined previously for the different data sizes:
T (total turn-over time) and F (first turn-over time). Figure 2.8 represents the data
intensive variant, and Figure 2.9 the data weak variant.
All tests were done on WNs with a single core 2 GHz AMD Athlon 64 bit processor
with 1 GB of RAM running Scientific Linux 3.0.8 and gLite 3.0 (lcg_util 1.5.2). The
SE belongs to the local Grid node and has a similar configuration to the WNs. The
network between the WNs and the SE was a shared 100 Mbs LAN. No other jobs were
running on the WNs concurrently.
Examining the results, it is interesting to note that, although the DL scenario per-
forms poorly in almost all criteria, the other serial scenario (DF) has better perfor-
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 50/140
30 Chapter 2. Coarse grain parallelism
time Data intensive simulation Data weak simulation
interval I-DF I-DL I-DT W-DF W-DL W-DT
F 5.4 15.4 7 27 477.6 33
T 108 109 119 550 571 542
S 0.51 0.52 0.9 21.9 23.6 26.8
C 103 104 113 545 566 536
All values in seconds
TABLE 2.1: Average measured time intervals for 8 MB files.
time Data intensive simulation Data weak simulation
interval I-DF I-DL I-DT W-DF W-DL W-DT
F 8.5 51 10.1 43 784 55
T 170 213 158 864 1053 945
S 1.96 2.2 2.4 37.1 38.5 47.0
C 163 205 150 858 1038 937
All values in seconds
TABLE 2.2: Average measured time intervals for 32 MB files.
time Data intensive simulation Data weak simulation
interval I-DF I-DL I-DT W-DF W-DL W-DT
F 19 259 59 160 3145 179
T 378 1763 972 3192 4085 3324
S 8.6 9.0 11 149 155 166
C 367 1684 924 3181 4035 3311
All values in seconds
TABLE 2.3: Average measured time intervals for 128 MB files.
time Data intensive simulation Data weak simulationinterval I-DF I-DL I-DT W-DF W-DL W-DT
F 294 2760 423 618 8946 723
T 12818 6926 5957 12357 13147 8988
S 113 127 129 421 436 437
C 5710 6706 5662 12160 12928 9702
All values in seconds
TABLE 2.4: Average measured time intervals for 512 MB files.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 51/140
2.3. DataLight 31
1
10
100
1000
10000
8 32 128 512
F
( s
)
Data size (MB)
F on the Data Intensive scenario
DF
DL
DT
100
1000
10000
100000
8 32 128 512
T
( s
)
Data size (MB)
T on the Data Intensive scenario
DF
DL
DT
FIGURE 2.8 : Comparing the I-DT scenario (solid line) with the I-DF (dotted line) and I-DL
(dashed line) using the F and T time intervals.
10
100
1000
10000
8 32 128 512
F
( s
)
Data size (MB)
F on the Data Weak scenario
DF
DL
DT
100
1000
10000
8 32 128 512
T
( s
)
Data size (MB)
T on the Data Weak scenario
DF
DL
DT
FIGURE 2.9: Comparing the W-DT scenario (solid line) with the I-DF (dotted line) and I-DL
(dashed line) using the F and T time intervals.
mance than the threaded scenario (DT) in several situations. On the data intensive
variant (Figure 2.8), DF and DT are comparable for small data sizes (8 and 32 MB, Ta-
bles 2.1 and 2.2). But with files of 128 MB, the serial scenario has better T and F than
the threaded one (Table 2.3). This means that the first output data will be available
earlier and the application will finish sooner on the DF scenario for output files of that
size. We think that the main reason for this unexpected behavior is the CPU over-
head in launching a data transfer using lcg_utils, as explained below. For large data
files (512 MB, Table 2.4), the DT scenario shows better performance, which is to be ex-
pected. For the data weak variant (Figure 2.9), DF and DT have similar performance,
except for large data files, where the thread variant performs better.
The other two time intervals (S and C) show a similar trend (Figures 2.10 and 2.11):
the threaded variant allows for earlier availability of the output data files as these files
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 52/140
32 Chapter 2. Coarse grain parallelism
get larger, although the DF scenario is usually shorter in all the tests. The S results
also point to an important penalty of transferring small files in gLite: calculating F− Sresults at least in about 5 s. This is the minimal time it takes to transfer files from the
WN to a SE using the lcg_utils, since it is the difference between the average first turn-
over time and the average simulation time. We obtain these ∼ 5 s both with 8 MB and
with 32 MB files. Only above that the transfer time takes significantly more. During
this 5 s overhead it is to be expected that the calculation thread will suffer, since that
the CPU is also being requested by the transfer thread. Hence it is to be expected that
performance gains of the threaded scenario will be noticeable for data files that take
significantly longer to transfer than the overhead of ∼ 5 s.
0,1
1
10
100
1000
8 32 128 512
S
( s
)
Data size (MB)
S on the Data Intensive scenario
DF
DL
DT
100
1000
10000
8 32 128 512
C
( s
)
Data size (MB)
C on the Data Intensive scenario
DF
DL
DT
FIGURE 2.10: Comparing the I-DT scenario (solid line) with the I-DF (dotted line) and I-DL
(dashed line) using the S and C time intervals.
10
100
1000
8 32 128 512
S
( s
)
Data size (MB)
S on the Data Weak scenario
DF
DL
DT
100
1000
10000
8 32 128 512
C
( s
)
Data size (MB)
C on the Data Weak scenario
DF
DL
DT
FIGURE 2.1 1: Comparing the W-DT scenario (solid line) with the I-DF (dotted line) and I-DL
(dashed line) using the S and C time intervals.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 53/140
2.3. DataLight 33
2.3.5 APPLICATION DEPLOYMENT
We have deployed our library on the code Osiris [40], which is a state of the art, mas-
sively parallel, electromagnetic fully relativistic Particle-In-Cell (PIC) code for kinetic
plasma simulations. In this code, the interaction of a large number of charged plasma
particles (up to ∼ 1010) is modeled using a particle-mesh technique specially suited
to this problem called PIC [35]. We will describe the PIC algorithm in more detail in
Chapter 3, Section 3.2. Applications to this code include astrophysical shocks, ultra-
intense laser plasma interactions, and nuclear fusion using fast ignition.
Osiris is written in Fortran 95. In its current implementation, our library presents
a C interface with only two functions exposed: one to add a file to be transferred to
the waiting list (see Figure 2.7) that takes as the only parameter the full path of that
file, and another to tell the queue manager to finish all transfers and quit that takes
no parameters. The user should call the first after each file is saved, and the latter at
the end of the simulation. As mentioned before, currently the URL of the SE is set at
the compile time of the library. To integrate our library with Osiris, thus allowing it
to interface with gLite and produce large output files to the Grid, two small function
wrappers were written to allow direct calling of our interface from Fortran. From the
Osiris point of view it was only required then to call the function that adds a diagnostic
file to the transfer queue after it is written to local storage, and to call the function to
terminate the queue manager at the end of the simulation, so that Osiris would waitfor the last transfers to finish. The output files are in the HDF format, which is handled
transparently by our library, as any other file format (see Section 2.3.1). The whole
migration process took only a few hours;1 we expect the linking of our library with
other codes/languages to present a similar difficulty.
In our simulation tests, the application ran for nearly one hour, producing more
than a hundred output files of very different sizes, varying from a few kB to hundreds
of MB, to a total of ∼ 5.5 GB. Performance gains from moving from a DS to DT imple-
mentation were very significant: the total turn-over time T was reduced from 3378 s to
2164 s, an improvement of over 35%. We also expect that for longer simulations, with
durations of up to ∼ 100 hours, the results will be even better. Furthermore, for such
simulations, receiving the output data as soon as it is produced is an added advantage
when compared to other HPC batch systems.
In Osiris, a simulation cycle is usually much shorter than the transfer time over
the network. Hence, Osiris could be considered as a “data intensive” application, as
defined in page 25. This means that performance gains using DataLight would be
1This migration process and the writing of the Fortran-to-C wrappers were done by Prof. Ricardo
Fonseca.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 54/140
34 Chapter 2. Coarse grain parallelism
limited, since the whole simulation is limited by the network bandwidth. However,
most Osiris simulations do not write out the results at every time step. This wouldmake the simulation very slow (even using local storage) and very expensive in terms
of data storage. Often, diagnostics are produced after a certain number of time steps.
Depending on the simulation, that number can be tens or hundreds. In this case, Osiris
is actually behaving as a “data weak” application, meaning that the overall running
time is dominated by the simulation and not by the data transfer. This can lead to very
significant performance gains, as we saw when we moved from the serial scenario
(DS) to the threaded scenario (DT) in our deployment of DataLight with Osiris.
Deploying DataLight with Osiris overcame our main obstacle in using plasma sim-
ulation codes on the Grid. With this library we can now launch Osiris or dHybrid (orany other plasma simulation code, like QuickPIC [53]) in any Grid site which has gLite
and MPI (like any I2G site) and be certain that the data produced will be promptly de-
livered to a SE of our choice.
2.4 PORTAL DEVELOPME NT AND OTHER FUTURE WORK
One of the main strengths of DataLight is its simplicity. It allows for the overlapping
of computation and data transfers with a minimal effort to the developer of the non-
Grid application. The current implementation allows for setting some parameters atcompile time, like the URL of the SE (not mandatory). However, we have found useful
to add an initialization interface to make available such options at run time, using
command line parameters to be included in a JDL file. This will allow for finer user
control of the run-time parameters of the library while keeping its simplicity.
The tests presented here were run with one transfer thread and two streams per
transfer in parallel (the lcg_cr parameter nbstreams). This was chosen in order to limit
the variable parameters during this first test phase, while at the same time optimizing
bandwidth. However, we want to do more complete tests on the performance of our
library with multiple transfer threads, since it was build from the beginning with this
feature in mind. We expect the performance gains to increase when compared to the
DF scenario until the bandwidth limit is reached.
We have also started development of a Grid web portal for users of the Osiris
and dHybrid codes. The main idea is to make the infrastructure where the code is
running as much irrelevant/transparent for the user as possible. He accesses a web
page, sets the simulation parameters, targets a certain system, and it is up to the portal
to select the correct binary and launch the job. The user does not need to use the
specific commands of the target system.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 55/140
2.4. Portal development and other future work 35
An important feature of this portal is its integration with DataLight. We have
added the ability to automatically keep track of when the output files are availablefor post-processing. With this feature, the user can easily take advantage of the much
shorter first turn-over time, F, instead of having to wait for the total simulation time,
C, or the total turn-over time, T . This feature has to be explicitly added to the code
(in our case, to the DataLight library) and to a portal, since the usual gLite job status,
shown in Figure 2.12, can not be expanded to include intermediate running status.
SUBMITTED
WAITING
READY
SCHEDULED
RUNNING
ABORTED
CLEARED
DONE (cancelled)
DONE (failed) DONE (ok)
FIGURE 2.1 2: Job state machine in gLite (adapted from [1]).
In gLite, a job stays in the RUNNING state as long as it is running on the remote
system. There is no standard way for the job to communicate back to the user whichdifferent phases it might be running through. In our implementation, the portal is up-
dated during the RUNNING state with intermediate status messages from the DataLight
library. These messages are displayed on the web page that the user is watching and
also stored in a database. In this way the user is aware of the output files becoming
available on the SE that was chosen, and he can start post-processing them while the
simulation is still running on the Grid.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 56/140
36 Chapter 2. Coarse grain parallelism
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 57/140
CHAPTER 3
FINE GRAIN PARALLELISM
Kinetic plasma simulations using an electromagnetic particle-in-cell (PIC) algorithm
have become the tool of choice for numerical modeling of several astrophysical and
laboratory scenarios, ranging from relativistic shocks and plasma shell collisions [54,
55], to high intensity laser-plasma interactions [56], with applications to fast ignition
[57, 58], to particle acceleration[59], and to radiation generation [60, 61], among oth-
ers. However, fully relativistic kinetic codes are computationally intensive, and new
computing paradigms are required for one-to-one direct modeling of these scenarios.
In this chapter we look at the use of modern graphics processing units (GPU) for PIC
algorithm calculations, discussing the implementation of a fully relativistic PIC code
using NVIDIA’s CUDA concepts. Details on the algorithm implementation are given,
focusing on grid-particle interpolation and current deposition. Finally, results from a
test simulation of an electron/positron plasma shell collision are presented, focusing
on code validation and performance evaluation.
3.1 INTRODUCTION
There are many plasma physics scenarios where fully relativistic particle-in-cell codesplay a key role in providing essential understanding of the underlying physical pro-
cesses involved. Due to the problems they try to solve, PIC implementations tend
to be used in large-scale simulations that require large computing resources, usually
from tens to thousands of computing cores. Furthermore, as we explained in Chap-
ter 1, the last 15 years have witnessed a trend in the HPC scientific community to
move from highly customized shared-memory systems to cheap distributed-memory
systems, often built with commercial off-the-shell (COTS) computers. When the CPU
annual increase in speed slowed down in the middle of this decade, these cheap clus-
37
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 58/140
38 Chapter 3. Fine grain parallelism
ters started to show their limitations, specially in computational power (high latency),
space and power needs. As the physical problems to solve increase in complexity —with more memory (capacity and bandwidth) and computational power requirements
—, the scientific community has started looking for other possibilities from where to
get their next HPC system.
Programmable GPUs have received attention from the scientific computing com-
munity since their introduction on the market at the beginning of 2000 [21]. They
constitute a highly interesting alternative to the usual CPUs due to their high comput-
ing power (recently over a teraflop per GPU chip), their relative low cost and power
consumption, and their capability for massive data parallelism (several thousand si-
multaneous threads are available on a single chip of a recent GPU).However, despite their computational power, general programs for the GPU (GP-
GPU) suffered a serious implementation difficulty: the algorithm had to be mapped
to Computer Graphics (CG) concepts in order to use graphics API, and the program
had to be structured in terms of the graphics pipeline. In fact, in 2006 we implemented
one part of a PIC algorithm (the particle pusher, see Sections 3.2 and 3.4.1) using the
GPGPU approach, in particular using the OpenGL Shading Language (GLSL), version
2.0 [27]. We were able to implement the full Boris pusher in GLSL in about 3 months,
and deployed it in a system with an AMD Athlon 64 3200+ processor and a NVIDIA
GForce 6600. Since the performance on that GPU was similar to a CPU implementa-tion and the development effort was significant, we decided at the time not to advance
further with GLSL and moved later to CUDA. A full report on this GLSL effort can be
found in Appendix B.
To overcome this obstacle of mapping general algorithms to CG concepts, several
approaches were developed (eg., BrookGPU [20] and Sh [62]). Finally in 2007 NVIDIA
made available an architecture (hardware, drivers, API, and SDK) for HPC on GPUs
called CUDA (Compute Unified Device Architecture). In particular, “C for CUDA”
offers a high-level interface to program and deploy high-performance applications on
GPUs using a C-like syntax and compiler. This system allows for developers to build
GPGPU applications without using CG concepts but still using the high-performance
of the GPU. A usual CPU application can thus be more easily tap on the GPU perfor-
mance without requiring CG knowledge from the programmer.
In this chapter we will present an implementation of a fully relativistic 2D PIC
code in CUDA. We will discuss the main problems faced and the solutions we found
which, in most cases, are valid for deploying most particle mesh algorithms in CUDA
systems. We will also discuss performance issues while benchmarking our validation
test, a 2D Weibel instability.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 59/140
3.2. Basic PIC Algorithms 39
The rest of this chapter is organized as follows: in Section 3.2 we present the basic
PIC algorithms that were ported to CUDA; in Section 3.3 we present a brief overviewof CUDA; in Section 3.4 we discuss the implementation of the PIC algorithms in “C
for CUDA”; in Section 3.5 we present the validation of the code by implementing a
Weibel instability simulation; after that, in Section 3.6 we discuss the results and the
performance of the code, followed by a brief description of other approaches to this
problem (Section 3.7); finally in Section 3.8 we will offer an overview and conclusions
of our work.
3.2 BASIC PIC ALGORITHMS
Directly modeling the interaction between all particles in plasma is only feasible for
a relatively small number of particles, given that this algorithm requires a number of
operations going with O(N 2). Even in a petascale system, a state-of-the-art kinetic
plasma simulation, with a total number of particles of ∼ 1010 [63], such algorithm
would require ∼ 1 day for a single time step, out of potentially millions of iterations.
To overcome this limitation, numerical simulations often resort to the particle-mesh
method for calculations. In this method, particles do not interact directly, but rather
through fields defined on a grid. Field values are interpolated at particle positions
to determine the forces acting on each particle, and particles are deposited back onthe grid to advance field values. Although the number of operations depends on the
number of grid cells, the number of particles, and the interpolation schemes used, it is
several orders of magnitude lower than a particle-particle method.
For plasma physics simulations this algorithm is generally referred to as the par-
ticle-in-cell (PIC) algorithm [64, 35]. Figure 3.1 illustrates a complete electromagnetic
PIC simulation loop. Simulation space is discretized as a regular grid, and the electro-
magnetic (EM) field values are defined on this grid. To advance simulation particles,
we use the Lorentz force acting on the particle, calculated by interpolating the field
values at the particle positions. The electric current resulting from particle motion is
deposited back on the grid, and used to advance the values of the EM field.
In our implementation particles are advanced using a leap frog scheme. Particle
positions x and fields are defined at time step ti, and are used to calculate the Lorentz
force acting on the particle:
u =q
m
E +
1
c
uγ× B
(3.1)
where u is the generalized velocity (u ≡ γv), γ is the Lorentz factor, q/m is the
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 60/140
40 Chapter 3. Fine grain parallelism
Field interpolation:
(E , B) j⤻ (E , B)i
Particle advance:
(E , B)i⤻ pi⤻ xi
Current deposition:
xin-1
- xin⤻ j j
Field advance:
j j⤻ (E , B) j
t ↻
FIGURE 3.1 : Overview of a full cycle of an electromagnetic PIC code. Subscript j refers to cell
quantities on the grid; subscript i refers to particle quantities.
charge/mass ratio of the particle, and E and B are the EM field interpolated at the
particle’s position.
The interpolation of the values of E and B at the particle can be seen as the in-
teraction between the simulation grid and the particles. For linear interpolation, this
requires the values of 2 nearest grid points for every direction, leading to 4 points in
2D and 8 points in 3D. It should also be noted that, as we will discuss below, the E and
B are defined on a staggered grid (see Figure 3.2), which means that the grid points
required for each field component may be different.
FIGURE 3.2 : Staggered 2D grids in PIC codes.
Advancing the particles’ generalized velocity from time step ti − 1/2 to time step
ti + 1/2 is done using the so-called Boris pusher [65, 35], which is a second-order ac-
curate time centered numerical technique that has been successfully applied in many
simulation algorithms [66], and, in particular, widely adapted for PIC codes. This tech-
nique separates the effects of the electric and magnetic forces in four steps: starting at
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 61/140
3.2. Basic PIC Algorithms 41
a time ti − 1/2, i) add half the electric impulse to u, obtaining u; ii) rotate u with half
the magnetic impulse, obtaining u ; iii) rotate u with the full magnetic impulse usingu; iv) add the remaining half of the electric impulse.
Using the generalized velocity at time step ti + 1/2, we can then advance the par-
ticles’ positions with:
xi+1 = xi +ui+1/2
γi+1/2∆t (3.2)
To advance the EM field values we rewrite Maxwell’s equations, specifically Fara-
day’s and Ampère’s law, to give (again in cgs units):
∂E∂t
= 4π j− c× B (3.3)
∂B∂t
= −c× E. (3.4)
Starting from a known set of field values at t = 0, we can then advance the EM
field components at each time step, provided we find the rotational of the E and Bfields, and electric current j resulting from particle motion. The rotational operator is
approximated by a finite-difference operation on the grid using the technique devel-
oped by Yee [67]. To improve accuracy, the field values are not defined in the same
points inside a grid cell, but rather as staggered grids as described in Figure 3.2, whicheffectively results in a second order spatially accurate algorithm. To improve the time
accuracy, we advance the fields in three steps: starting with E and B defined at a time
ti, and j defined at a time ti+ 12
, i) advance B by half a time step using Equation 3.4;
ii) advance E by a full time step using Equation 3.3 and the intermediate B; iii) ad-
vance the intermediate B by the remaining half time step. This method allows for
second order accuracy with no memory penalty.
The electric current resulting from particle motion also needs to be defined in the
staggered grid. However, because of the finite difference approximation of the ro-
tational operator, a simple interpolation of qu/γ is not enough, since it will lead tocharge conservation errors. To overcome this, current deposition in PIC codes has cur-
rently two widely used approaches: the Villasenor-Bunemann method [68] and the
Esirkepov method [69]. Both methods ensure exact (analytical) charge conservation
and for linear interpolation both lead to the same result. However the Villasenor-
Bunemann method requires that particle motion is split into motion segments lying
inside the same cell, which is generally realized with a set of “IF” statements. The
Esirkepov method thus yields better performance [70], and was chosen for our imple-
mentation.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 62/140
42 Chapter 3. Fine grain parallelism
3.3 CUDA OVERVIEW
CUDA [71] is both a hardware and a software architecture for creating general purpose
programs on a GPU. At the hardware level, it is available for NVIDIA’s GeForce series
(8000 and later), the Tesla systems and the Quadro equipment. At the software level,
it has a stack composed by the hardware driver, the C-like API and its runtime, and
several higher-level mathematical libraries (CUFFT, CUBLAS).
With CUDA, the GPU is viewed as a compute device capable of executing a very
high number of threads in parallel. Hence, it operates as a co-processor to the main
CPU, or host: portion of an application that is executed many times, but independentlyon different data, can be isolated into a function, called a kernel, that is executed on the
device as many simultaneous threads.
CUDA hardware has a set of multiprocessors (Streaming Multiprocessors or SMP),
whose number varies depending on the GPU model. In the current architecture, each
multiprocessor has 8 processors (called Scalar Processors or SP), which set the number
of concurrent threads. Table 3.1 shows the number of SMP and the total number of
processors for some of the CUDA enabled products we have access to.
GPU model SMP/SP
Quadro FX1800 8/64
GeForce 9800 GX 16/128
GeForce 8800 GT 14/112
Tesla C1060 30/240
Tesla C870 16/128
TABLE 3.1 : Number of Streaming Multiprocessors (SMP) and total number of Scalar Proces-
sors (SP) for some CUDA enabled hardware.
All the SP cores in a chip have access to the whole RAM memory of the device,
which is called Global Memory. In addition, all SP cores in a SMP have access to a
certain amount of memory in the SMP, called Shared Memory. This shared memory is
usually 32 kB in size and has access times two orders of magnitude faster than global
memory — typically, 400 cycles for global memory and 4 cycles for shared memory.
Finally, each SP also has access to its own local memory (8 kB) and 32-bit registers (32).
Figure 3.3 shows an overview of the different memories available in CUDA devices.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 63/140
3.3. CUDA Overview 43
SP1
Registermemory
SP8
Registermemory
Shared memory
Constant cache
Texture cache
SMP N
SP1
Registermemory
SP8
Registermemory
Shared memory
Constant cache
Texture cache
SMP 1
Constant memory
Texture memory
Local memory
Global memory
Local memory Local memory
Device DRAM
Device (GPU)
Host (CPU)
FIGURE 3.3: CUDA memory architecture, both from the hardware and software level. (For a
color version, see Figure C.1 on page 105.)
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 64/140
44 Chapter 3. Fine grain parallelism
Data is transferred from the host memory to the device in three ways: to global
memory where, after being processed and transformed by a kernel, it can be read backto the host; to constant memory or to texture memory where it can be accessed by each
SMP in a read-only cached fashion. From global memory, a kernel can read the data to
shared memory to be processed by all the threads in a SMP, or to register memory to be
used by each thread individually. The constant and texture caches are also distributed
per SMP. Finally, local memory is also available per thread. However, as it is allocated
in the device’s RAM, it is as slow as global memory.
These two hardware properties (multiprocessor/processor organization and glo-
bal/shared/local/register memory hierarchy) allow for a very high flexibility in paral-
lel processing power but at the same time force some limitations on programmability.
Flexibility: The grouping of processors in SMP allows for deployment of the same
kernel over multiple GPUs with different capabilities.
The batch of threads that executes a computational kernel is organized as a grid of
thread blocks. Each thread block is a batch of threads that can cooperate together by
efficiently sharing data through the fast shared memory and by synchronizing
their execution to coordinate memory accesses. From the programmer’s point
of view, all threads in a block can cooperate (sharing memory and synchroniz-
ing) as if they are running concurrently; blocks in a grid cannot cooperate. This
thread granularity allows for kernels to run efficiently on various devices with
different parallel capabilities: a device with few parallel capabilities may run all
the blocks of a grid sequentially, while a device with a lot of parallel capabilities
may run all the blocks in parallel; usually, it is a combination of both.
Constrains: Since the amount of fast shared memory is limited, the programmer has
to avoid expensive reads from and writes to global memory. Instead, she/he has
to try to map the algorithm to use the limited shared memory.
Also, although thread switching is very inexpensive in this architecture, at any
given time only 32 threads are guaranteed to be synchronized in a thread block.
This group of synchronized concurrent threads is called a warp. The penalty
for accessing global memory can be minimized if one can group the access in a
half-warp of 16 sequencial threads that access sequencial memory positions. In
this case, the global memory access is coalesced to one single access that can be
masked if other threads are busy with computational task.
Managing coalescent global memory access is of paramount importance to
achieve high performance in a CUDA. In a way, it is similar to managing the
different levels of cache in more standard CPU processors.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 65/140
3.4. CUDA Implementation 45
The general processes of launching a computational kernel with CUDA are thus:
1. Transfer data from host memory to device global memory.
2. Determine the block and grid size.
3. Launch the kernel that:
(a) copies the relevant set of data from global memory to shared or local mem-
ory,
(b) computes on that data, and
(c) transfers the results to global memory.
4. After returning to the host, it is possible to launch another kernel using the data
already residing in the device’s memory.
5. When all kernels have finished, the results are copied from the device’s global
memory to the host’s memory.
3.4 CUDA IMPLEMENTATION
As explained in Section 3.2, a PIC code can be viewed as a process that iterates over
the following steps:
Particle push, where each particle’s new position is calculated from the Lorentz equa-
tion:dpdt
= q(E +vc× B). (3.5)
This step involves an interpolation of the EM field values at each particle’s posi-
tion.
Current deposit, where the current is calculated from the path of each particle and
its charge. This step involves a deposition of the current contributed by each
particle at each vertex of the grid.
EM fields update, using the current calculated in the previous step and Equations 3.3
and 3.4.
In the next sections we will analyze in detail the implementation of each step as a
streaming algorithm and in particular in CUDA environments.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 66/140
46 Chapter 3. Fine grain parallelism
3.4.1 PARTICLE PUSH
The computational steps of the Boris pusher are as follows:
1. Starting at a given time t, with a time step ∆t, the values of the velocity are given
for t− ∆t2 : vt− ∆t
2; however, the position of the particles rt and the electromagnetic
fields Et and Bt are time centered at t.
2. Calculate vt+ ∆t2
with:
v− = vt− ∆t2
+q
m
∆t
2Et
v = v− + qm∆t2
(v− × Bt)
v+ = v− +2
qm∆t2
1 + q
m∆t2 Bt
2(v × Bt)
vt+ ∆t2
= v+ +q
m
∆t
2Et.
3. Calculate rt+∆t with:
rt+∆t = rt + ∆t vt+ ∆t2
.
When used in a PIC code, E and B are defined at each cell’s vertices, and they haveto be interpolated at rt.
Due to the lightweight thread architecture in CUDA, it is possible to launch hun-
dreds of thousands of threads at once and let the CUDA implementation serialize the
respective blocks. This means that it is possible to push just one particle per thread
and have as many pusher threads as particles, or to push several particles per thread.
We have implemented a parameter in our code that defines the number of particles
pushed per thread, which allows for a customized balance between calculation, mem-
ory usage, and memory bandwidth.
To make sure that the maximum occupancy of the GPU is reached, we have to esti-mate the number of threads per block, which depends mainly on the amount of the de-
vice’s memory usage. In our implementation, we found 64 to be the best choice. Since
the number of particles might not be divisible by the number of threads per block, we
can either add a test at the beginning of each push kernel that checks if the current
thread number accesses a valid particle index, or do two launches of the kernel: in the
first launch, we launch as many threads blocks as possible without going beyond the
total number of particles, and the next launch processes the rest of the particles in one
incomplete block. We have found the latter approach to be marginally faster than the
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 67/140
3.4. CUDA Implementation 47
former (about 1% better with 131 thousand particles). The following code illustrates
the setting of the block and grid size (PPB is a constant set to the number of particlethreads per block):
/ / H o w m a ny p a rt i cl e t h re a ds p er b l oc k :
m _ p bl o c k_ d i m = m a ke _ u in t 3 ( PP B , 1 , 1 );
m _ pg ri d _d im = m a ke _u in t 3 ( m _N p / P PB , 1 , 1 ) ;
m _ pO f f se t = 0 ;
m _p Re st = m _N p % P PB ;
if ( m _ p R e s t )
m _ pO f fs e t = m _N p - m _ pR e st ;
After launching the kernel, each thread pushes one particle. Since it only needs the
position r and velocity v of that particle, those values are stored in the threads’ local
memory for immediate access. On the other hand, the E and B fields are needed for
interpolation for all the threads, so they are kept in the global memory of the device.
The following code illustrates the main kernel implementation:
/ / F in d p ar ti cl e i nd ex f ro m t hr ea d & b lo ck i nd ex :
in t t h r ea d _i d = t h re a dI d x . x + t h re a dI d x . y * b l oc k Di m . x ;
in t b l oc k _i d = b l oc k Id x . x + b l oc k Id x . y * g r id D im . x ;
in t i dx = b lo ck _i d * b lo ck Di m . x * b lo ck Di m . y + t hr ea d _i d ;
f lo at 3 r , v ; / / T hi s p ar ti cl e p os a nd v el .
/ / C op y f ro m g lo ba l t o l oc al m em or y f or s pe ed :
r = R [ id x ]; v = V [ id x ];
f l oa t 3 B i , E i ; / / I n te r p ol a t ed f i el d s
I n t e r p o l at e F i e l d s _ d ( r , E , B , P o sM i n , P o sM a x ,
C e l ls D i m , C e l ls N u m , & E i , & B i ) ;
C a l cu l a te V e lo c i ty _ d ( v , E i , B i , w t );
C a l c u l a t eP o s i t i o n _ d ( r , v , P o sM i n , P o sM a x , s t e p ) ;
/ / S to re t he r es ul ts b ac k i n s ha re d m em or y ( s ti ll d ev ic e ):
R [ id x] = r ; V [ i dx ] = v ;
After the kernel runs, all particles have been pushed one time step. Several imple-
mentation strategies are worth mentioning.
Precision: CUDA hardware only has one double precision arithmetic unit for each
SMP. This means that calculations in double precision have a performance that
is approximately 9 times lower than in single precision. To allow for the use of
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 68/140
48 Chapter 3. Fine grain parallelism
single precision calculations, and thus to fully utilize the computational power
of the CUDA hardware, particle positions cannot be defined using the corner of the simulation box as a reference. Instead, we must keep track of the cell where
the particle is located and define the particle position referenced to the cell cor-
ner. In our case, absolute positions are defined with two arrays: one contains
the cell number the particle is in (integer), and the other stores the normalized
position (a single precision float in the interval [0, 1[) inside the cell, referenced
to the lower left vertex. As particles leave a given cell, their positions are cor-
rected so that they are referenced to the new cell. This extra calculations incur
in small performance penalty, but the overall gain from using single precision is
far greater.
Memory bandwith: In order to increase memory access bandwidth, the copy from
global memory to shared memory of particle quantities like r and u is done in a
coalescent fashion by all threads in a block whenever possible. Memory band-
width from/to global memory further increased by sorting the arrays with the
particles’ data (relative position, cell number, velocity, charge) according to the
cell each particle belongs to. If these arrays are sorted in global memory, then all
accesses have maximal coherence, in particular while accessing the EM field ar-
ray for interpolation. This ensures that accessing particle data is always coherent
if we have one thread per particle.
The code allows for each thread to loop through several particles, pushing one
at a time. In this case, coherence in reading and writing particle data is lost.
However, there is an important gain in reading E and B fields, since it is likely
that the same values will be used by several particles in the same push thread.
Not only less global memory access is needed per particle, but also the compu-
tational operations of each push thread are increased over the memory access
operations (less data access and more computation). Hence for each particle in
a thread a check is done: if the particle belongs to a different cell than the previ-
ous particle, new values for the EM-field are read from global memory to local
(registers) memory; if the particle belongs to the same cell, then the the EM-field
values have already been read and can be used for interpolation.
Hardware interpolation: An interesting feature of modern graphic processors is that
they contain texture units that can do zeroth-order and first-order interpolation.
Thus if we store the EM field as textures, linear interpolation at the particle’s
position can be done by hardware simply as a texture fetch. We have achieved
a three-fold performance increase over a non-texture fetch interpolation. This
approach, however, limits the usage of variables stored as textures to read-only.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 69/140
3.4. CUDA Implementation 49
It is possible to update a texture with new values by changing its memory buffer
to global memory, but we found the overhead too expensive to be worth it.Nonetheless, and although only linear interpolation is used in this work, it is
interesting to note that higher order interpolations have also been achieved in
GPUs in general [72] and in CUDA enabled products in particular [73].
3.4.2 CURRENT DEPOSIT
Current deposition in PIC codes has currently two widely used approaches: the
Villasenor-Bunemann method [68] and the Esirkepov method [69]. In both cases, a
quasi-particle is viewed as contributing with a charge density ρ to the grid cells in
its vicinity using a weighting function. It is the change in the charge density during
the particle push that originates the current that is assigned to the grid. Hence, the
continuity equation
· j +∂ρ
∂t= 0 (3.6)
is satisfied in both methods. If a particle stays within a cell, both methods are equiv-
alent. But if a particle passes a cell boundary in one time step, each method has a
different approach to calculate the current contribution at each cell.
In the Villasenor-Bunemann method, the particle is split in virtual particles, each
one staying in a single cell (Figure 3.4). This method has the advantage of beingstraightforward to calculate and of not being limited to cartesian geometries. It is,
however, limited to first-order interpolation.
FIGURE 3.4: Illustration of particle splitting in the Villasenor-Bunemann current deposition
method.
The Esirkepov method is a generalization of the Villasenor-Bunemann for higher
order interpolation. There is no particle splitting, hence no dynamic memory alloca-
tion during current deposition (this is important for CUDA implementation), however
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 70/140
50 Chapter 3. Fine grain parallelism
it is limited to cartesian coordinates. Since we were working in cartesian coordinates
and specially interested in higher-order interpolation, this method for current deposi-tion was chosen.
Most of the current deposit algorithm explained in [69], where for each particle we
calculate the amount of current to be deposited for each of the cells crossed during the
last time step motion, has a straightforward implementation in CUDA. Nevertheless,
since each particle will deposit current in up to 16 cells in 2D linear interpolation (see
Figure 3.5), particles from many cells may be contributing to the current in any given
cell.
FIGURE 3.5: Illustration of current deposition in the Esirkepov method. The 16 marked cells
have current deposited, although the bottom four and the left four have 0 deposit, as statedin [69].
This step is then likely to produce a significant amount of memory collisions, as
several different threads try to deposit the current in the same cell. These threads can
be from the same block, in which case one might devise some kind of synchronization
between them, but they can also belong to different blocks, in which case synchro-
nization is harder to implement in an efficient way. However, several strategies are
available.
One solution is to serialize the current deposition. This is very simple to implementat the cost of a significant performance penalty, even after parallelizing the deposition
of each current component. The cost depends on the number of threads that would
have to be queued. In CUDA we would be effectively running hundreds of thousands
of threads (from one particle per thread to a few particles per thread) almost in serial,
just three at a time (one concurrent thread for each space component). The perfor-
mance hit is so significant (about 2 orders of magnitude) that it is more effective to
transfer the data back to the CPU just for the current deposition and transfer it back to
the GPU for the rest of the simulation loop.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 71/140
3.4. CUDA Implementation 51
Another possibility is to take advantage of the particle sorting per cell referred
previously. If the particles are sorted according to the cell they belong to (before thepush) then we can accumulate the current on one thread per cell basis: each thread
loops through all the particles in a cell and deposits the current at each of the 16/25
vertices. If synchronization is guaranteed, then we can be sure that only one cell is
updated at a time. This approach has the drawback of adding a sorting step at every
simulation cycle. The sorting penalty could be minimized by adding more cells per
thread and thus sorting every other time step. A similar approach has been done
in [74] by dividing the simulation space in slabs that get assigned to threads blocks.
Since collisions are avoided in a block [75], no sorting is required in a slab, but only
between slabs.
A third possibility is to use atomic operations. These operations ensure that there
are no memory collisions by locking a memory position being written by a given
thread. CUDA offers a limited set of atomic operations that act both on global and
shared memory. In particular, it provides an atomic operation that allows accumula-
tion of values for integer values, although such operation is not available for floating
point data. To overcome this limitation the atomicExch() function, which is also avail-
able for floating point numbers, can be used. Given a memory address and a given
value, this function exchanges the given value with the one currently in memory.
Using this function, it is possible to devise a simple algorithm that simulates any
atomic operation, as long as that operation is commutative and has a neutral element:
1. Given a work value to accumulate at a certain position in memory:
2. ensure that it is different than the neutral element;
3. atomic-exchange the previous value at that memory position with the neutral
element for the accumulation function;
4. accumulate the previous value with the work value, generating a new value;
5. atomic-exchange the new value at the memory position.
6. If the value we got back is not the neutral element, this means that some other
thread placed this value there as a result of its own calculation. So use this value
as a new work value and go to 3.
7. Otherwise, if it is the neutral element, finish.
Figure 3.6 shows the pseudo-code that illustrates the steps described previously.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 72/140
52 Chapter 3. Fine grain parallelism
function PSEUDOATOMICOPERATION(address, value)
work_value← valuewhile (work_value = neutral_element) do
prev_value← atomicExchange(address, neutral_element)
new_value← operation( prev_value, work_value)
work_value← atomicExchange(address, new_value)
end whileend function
FIGURE 3.6: Pseudo-code for the generalized atomic function.
In our first approach, we started by implementing the pseudo atomic option. There
are several advantages to this approach. First, it allows for a straightforward code
which is easier to port from/to other multi-threaded architectures which might have
a similar atomic exchange operation, but may not suffer from the same global memo-
ry/shared memory issues. Second, this allows us to avoid load balancing issues, since
we can deposit the current per particle and not per cell. On a per cell current deposit,
a non-uniform particle distribution would necessarily cause different loads through-
out the threads: those dealing with cells with less particles would finish first than
those dealing with more particles. With one particle per thread, or with a more gen-
eral n particle per thread approach, this non-uniform thread load is avoided, since allthreads are dealing with the same number of particles and thus have the same amount
of calculations to perform.
However, one disadvantage is that we have 2 atomic exchanges per particle and
per cell, which might slow down the code seriously if the particles are ordered per
cell inside a block of threads. A simple solution is to initialize the array of the par-
ticles’ positions in a way that minimizes the chance that two consecutive particle’s
indexes refer to particles inside the same cell. Even better, all threads in a warp (the
group of 32 threads that are known to be synchronized by CUDA) should be working
with particles in different cells, preferably cells that are far apart so that there are no
collisions inside that warp. Although it is easy to initialize the particles so that this
condition is met, some care has to be taken to ensure that it still holds throughout
the simulation. As the simulation evolves and particles are moved around, this initial
distribution condition will no longer hold and some current deposition conflicts will
occur, causing a degradation in performance. When that degradation is higher than a
certain threshold, a redistribution of the particles over the threads is required, corre-
sponding to a sorting operation with a certain penalty. However, the threshold for the
sorting can be adjusted so that this penalty is minimized in the overall performance
balance.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 73/140
3.4. CUDA Implementation 53
So our final implementation included the pseudo atomic option, a sorting oper-
ation and a stride distance, that is, the distance in particle index between the parti-cles handled by one thread and the next consecutive thread. This guarantees that
the atomic collisions were minimized during current deposition. In Section 3.6 a dis-
cussion on the sorting penalty and the optimal stride value is presented. Figure 3.7
illustrates the implementation of the stride during particle push and current deposi-
tion.
FIGURE 3.7: Illustration of a stride during particle push and current deposition, so that kernels
in the same warp handle particles in different cells. p is the number of particles per thread and
s is the stride. (For a color version, see Figure C.2 on page 106.)
For the sorting step, we had to consider that, in our implementation, particles’
data is stored as an array, with each value containing all the data corresponding to
one particle: the cell it is in, its position inside the cell, its generalized velocity and its
charge. Hence sorting was done in three steps: the first step collected the cell index
of each particle as an independent array; the second step sorted that array; and the
third step shuffled the full particles’ array the same way as the cell index array. Not
sorting the full particles’ array during the second step has the advantage of avoiding
several non-coalescent memory accesses of particle data. Since the array to sort is just
of unsigned integers, we used the radix sort algorithm and its available implementa-
tion in CUDPP, which is considered to be one of the fastest implementations of then
radix sort algorithm [76].
3.4.3 EM FIELD UPDATE
The kernel that implements this step is launched as one thread per cell — or one thread
per lower left vertex of a cell. Each thread updates its value of the EM field as de-
scribed previously (Equation 3.3 and 3.4), using the staggered grid of Figure 3.2 and a
3 step finite-differences method. Actually, several kernels had to be implemented and
launched in sequence, since this process requires at least two global synchronization
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 74/140
54 Chapter 3. Fine grain parallelism
steps: one after the update of the first half of the B field and another after the full up-
date of the E field. Each step is implemented as a kernel, and between them an updateof the guard cells is also necessary. An overview of the sequence of different kernel
launches is illustrated in Figure 3.8. All these kernels are launched as one thread per
cell — or one thread per lower left vertex of a cell.
FIGURE 3.8: Illustration of the sequence call of the three different kernels required for a fullEM field update.
3.4.4 BOUNDARIES AND GUARD CELLS
The algorithm was implemented using Dirichlet periodic boundary conditions, defin-
ing the simulation space as closed and periodic. This means that the neighbors of
the cells lying at the lower boundary of the simulation will be the cells at the upper
boundary of the simulations and vice-versa, and that particles leaving one side of the
simulation box will re-enter on the other side. This is implemented through the useof guard cells, which are extra cells added at the simulation boundaries, where the
values from the grid points at the other boundaries are replicated. This allows for a
much simpler simulation algorithm, where all cells are treated equally, and guard cell
values are updated after the iteration is completed. To this end we have implemented
two different kernels to handle guard cells: one adds the accumulated current density
from the guard cells to its correspondent physical cells in the grid, while the other
copies the EM fields from the physical border cells to the guard cells. It should also be
noted that this technique is similar to what is used in distributed memory parallel PIC
algorithms, where guard cells in one computation device correspond to grid points on
a neighboring computation device, and could, in principle, be used in a system where
multiple CUDA devices operate cooperatively.
3.5 APPLICATION T O WEIBEL INSTABILITY
We have tested our CUDA simulation code by running a well-known benchmark
problem of the Weibel instability [77, and references therein] in electron-positron plas-
mas. In this simulation, two plasma clouds, one positron plasma cloud and one elec-
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 75/140
3.6. Results and Analysis 55
tron plasma cloud, initially moving perpendicularly to the simulation plane with some
temperature distribution, are set to evolve self-consistently using a fixed neutralizingion background. The simulation parameters are as follows:
Generalized velocity: u = γv = (0; 0; ±0.6), with species 1 (electrons) having the
positive velocity and species 2 (positrons) the negative velocity.
Thermal distribution: uniform guassian with 0.1 width.
Grid parameters: 128× 128 cells; simulation space: 12.8 c/ω p × 12.8 c/ω p.
Time step: ∆t = 0.07, which, together with the simulation space size, satisfies the
Courant condition.
In this scenario the evolution of the plasma clouds is governed by the Weibel insta-
bility, leading to the formation of current filaments that then coalesce leaving behind a
set of empty plasma bubbles (in 2D), as can be seen in Figures 3.9 and 3.10. These bub-
bles remain stable, with the plasma pressure being balanced by the radiation pressure
of the EM fields trapped inside the bubble. The initial kinetic energy of the particles is
partially transferred to the EM fields, leading to the formation of magnetic field loops
that slowly merge until the simulation reaches a steady state, as seen in Figure 3.11.
Most of the simulations were done with 6 × 6 = 36 particles per species per cell,
but this parameter could be changed interactively. The simulations completed suc-cessfully in several CUDA enabled systems. The results were benchmarked against
the EM-PIC code OSIRIS [40] in single precision, giving consistent results. The small
differences found were related to a different choice of random number generator, and
differences in the particle order for current deposition, which leads to minor differ-
ences due to numeric roundoff issues.
3.6 RESULTS AND ANALYSIS
The performance of the CUDA PIC code was evaluated using the simulation men-tioned in the previous section. The code was run with two different configurations
of particles: 8 particles per cell (4 for each species), giving a total of 131,072 particles
(further referenced as the 131 k run), and 72 particles per cell (36 for each species),
for a total of 1,179,648 particles (referenced as the 1.2 M run). Table 3.2 presents an
overview of the most relevant results using a Tesla C1060 board for simulation and a
Quadro FX 1800 for rendering.
We also ran our code on the CUDA devices listed in Table 3.1. Our purpose was
to evaluate NVIDIA’s claims on the portability of CUDA code across a wide range
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 76/140
56 Chapter 3. Fine grain parallelism
FIGURE 3.9: Charge density of electrons during the formation of a Weibel instability.
FIGURE 3.1 0: Final state of a full 2D PIC simulation of a Weibel instability (positrons in cyan
and electrons in yellow). (For a color version, see Figure C.3 on page 106.)
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 77/140
3.6. Results and Analysis 57
(1) (2)
(3) (4)
FIGURE 3.11: B21 + B2
2 intensity during the formation of a 2D Weibel instability.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 78/140
58 Chapter 3. Fine grain parallelism
# particlesrender
(ns/particle)
cycle
(ns/particle)
131 k 0.020 72
1.2 M 0.019 39
TABLE 3.2: Time measurements of a full PIC CUDA 2D implementation of a Weibel instability.
Grid dimensions are 128 × 128 cells. The system used a Tesla C1060 board for CUDA and a
Quadro FX 1800 for rendering.
of devices of different computational capabilities. Although we noticed obvious per-
formance differences, our code produced consistent results across all equipment. The
Tesla C1060 was the one that presented the best computational performance, so it is
the one used throughout this chapter.
For comparison, we implemented the same code in C and ran it on the host ma-
chine, an Intel Xeon E5420 2.50 GHz with 6 MB cache. The C code was compiled using
gcc version 4.3.2 with full optimizations enabled (-O3). The results, using one core of
the CPU, were 401 ns per particle on the 131 k run and 379 ns per particle on the 1.2 M
run.
It is interesting to note the difference in performance between the 131 kpart and
the 1.2 Mpart runs. The reason is that below one million particles, the Tesla C1060 is
not quite full yet, and does not have enough threads to hide the memory latency in
transfers between global and shared or register memory.
We emphasize that the rendering time can be considered irrelevant (less than 0.05%
of a cycle). Even if we manage to lower the cycle time by one order of magnitude, it
will still be acceptable if direct visualization is desirable (see Section 4.2).
We have also estimated the timings for the different parts of the algorithm. We note
that only an estimate is available, since it is not possible to completely isolate all the
algorithm parts for benchmark. For example, sorting is required for an effective cur-rent deposition. Since it is possible to time the sorting step without current deposition,
but not the other way around, we estimate the current deposition step by subtracting
the sorting time from the total of current deposition and sorting. Table 3.3 and Fig-
ure 3.12 show the time duration of the different push steps: reading and writing out
the particle data from/to global memory to/from registers, sorting the particles’ array
according to the cell’s index, interpolating the fields at the particle’s position, calculat-
ing the new velocity and advancing the particle, and depositing the current for each
particle.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 79/140
3.6. Results and Analysis 59
StepTime
(ns/particle)
Read & write 4.5
Sort 6.0
Field interpolation 0.66
Vel. & part. advance 0.4
Curr. deposition 23
Total 34.6
TABLE 3.3: Time estimations for different parts of the particle pusher algorithm in the
1.2 M run.
R/W
Sort
Interpolaon
Vel.+Pos.
J deposion67%
13%
17%
2%
1%
FIGURE 3.12 : Percentage of time spent at each step of the pusher algorithm. Based on the
values from Table 3.3. (For a color version, see Figure C.4 on page 107.)
The previous timings do not show the field update step, since it depends on the
number of cells and not on the number of particles. The results were 9 ns per cell for
the complete cycle depicted in Figure 3.8. During the complete simulation cycle this
value is not relevant, since usually one has much more particles (millions) than cells
(tens of thousands). For example, in our 1.2 M run, the total time per cycle was 46 ms,
of which just 0.15 ms corresponded to the EM field update.
We have implemented several of the performance parameters exposed in previ-
ous sections and evaluated their impact on code efficiency. The optimal number of
threads per block depends on the kernel in use. We found that a number of 64 or 128
was optimal for the pusher kernel (that also includes field interpolation and current
deposition), and that 128 was best for the fields update. These values do not depend
on the number of particles or of vertices, but on the amount of shared memory and
registers required per kernel.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 80/140
60 Chapter 3. Fine grain parallelism
To avoid collisions during current deposition, two consecutive threads should not
handle particles on the same cell (Section 3.4.2 and Figure 3.7). So we implementeda stride parameter that defines the index distance between the particles handled by
two consecutive threads. Also, to increase the amount of arithmetic operations per
memory access, we have implemented a parameter that defines the number of par-
ticles handled by each thread. Interestingly, we have found these parameters have
similar effects. When handling one particle per thread, a stride of 4× the number of
particles per cell guaranteed best performance and minimal collisions. Similarly, with
a stride of 1, we have found that handling 4× the number of particles per cell per
thread achieved a similar performance. Figures 3.13 (a) and (b) represent the varia-
tion in performance per particle for different particles per thread (different lines) and
different stride sizes (x-axis), for the 1.2 M run. Figure 3.13 (a) refers to particles per
thread that are submultiples of the particles per cell, and Figure 3.13 (b) to particles
per thread that are multiples of the particles per cell. As already said, the best perfor-
mance is achieved with a stride of 1 and 4 times the number of particles per cell per
thread (39 ns per particle). However, a stride of 4 times the number of particles per
cell with one particle per thread gets a similar performance.
Two other simulation parameters were implemented and tested, but did not con-
tribute to a performance improvement. First, the possibility to serialize the launching
of kernels at the CPU level, and second, not to sort at every time step.
60
55
50
45
40Ppt = Ppc = 72 Ppt = Ppc = 72
Ppt = Ppc/2 = 36
Ppt = Ppc/4 = 18
Ppt = 4×Ppc = 288
Ppt = 2×Ppc = 144
Ppt = Ppc/8 = 9 Ppt = 8×Ppc = 576
n s / p a r t i c l e
64321684Stride size
21 321684Stride size
21
60
55
50
45
40
n s / p ar t i c l e
(a) (b)
FIGURE 3.1 3: Evaluation of performance ( y-axis, in ns per particle) with stride size ( x-axis)
and particles per thread (Ppt, different lines), for the 1.2 M run. Each line uses a number of
particles per thread (Ppt) which is either a submultiple (a) or multiple (b) of the number of
particles per cell (Ppc).
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 81/140
3.6. Results and Analysis 61
It was considered that to further minimize memory collisions during current de-
position, the kernel threads could be launched in batches, thus guaranteeing that theCUDA device was loaded in a manner that minimized the risk of two particles up-
dating the current in the same cell at the same time. This meant that some part of the
thread block control was made at the CPU level and not by the CUDA implementa-
tion. This lead to worse performance than launching all threads at once.
We also considered that not sorting at each time step might save the 6 ns step
without incurring in too much extra memory conflicts. This was not the case. The
memory conflicts by not sorting increased the current deposition by several tens of
nanoseconds, so that, in the end, the best performance was achieved by sorting at
each time step.Based on the values of Table 3.3, we also estimated how efficient the device was
being used. Table 3.4 shows the estimated number of single precision floating point
operations per second (in GFLOPS) of different parts of the pusher algorithm, and
also of the whole algorithm. A comparison with the peak performance of the Tesla
C1060 [78] is also shown. The same code running on a single core of the Intel Xeon
E5420 (theoretical peak performance of 5 GFLOPS) achieved 1.25 GFLOPS. Hence it
can be estimated that for this algorithm one Tesla C1060 can replace 3 Intel Xeon E5420
(or approximately 11 cores).
Step GFLOPS Efficiency
Fields interp. 103 33%
Vel. & part. advance 200 64%
Curr. deposition 14.1 4.5%
Complete pusher 13.6 4.4%
TABLE 3.4 : Performance estimation for some parts and for the complete pusher algorithm.
The simulation used the 1.2 Mpart run and 128× 128 cells running on the Tesla C1060. Based
on the values from table 3.3 and the theoretical peak performance of 311 GFLOPS.
The application of this algorithm to large scale plasma simulations is ultimately
limited by the total memory available on the CUDA device (currently up to 6 GB
on a Tesla C2070), with state of the art simulations [63] requiring ∼ 1010 particles to
be followed for ∼ 106 − 107 timesteps, with total memory requirements going up to
∼ 1 TB. The solution will undoubtedly rely on the use of a (massively) parallel GPU
system, where an ecosystem of CUDA devices operates cooperatively communicating
through some form of interconnect, much like a distributed memory parallel com-
puter. Our implementation of boundary conditions using guard cells, as explained
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 82/140
62 Chapter 3. Fine grain parallelism
in Section 3.4.4, is well suited to a spatially decomposed parallel version of our algo-
rithm, where each CUDA device is responsible for a smaller region of the total simu-lation space. Each CUDA device would only need to communicate with neighboring
devices sending guard cell values and particles crossing the device boundary. How-
ever, this implies a device-to-host transfer when sending data to other nodes, and a
host-to-device transfer when receiving to be done at every timestep, which may have
a negative impact on overall performance.
3.7 OVERVIEW OF OTHER WORK ON PIC AN D CUDA
During the duration of this thesis several groups presented results in this area. We will briefly give an overview of their results. Our own effort before CUDA, using GLSL
instead, is reported in Appendix B.
Stantchev et al. [74] published in 2008 the first paper on this subject. Their focus
was just on the current deposition part and they developed an algorithm to optimize
that scatter operation. In that algorithm, particles were grouped in bins and spatial
cells were grouped in clusters. Particles in one bin belonged to the same cluster. This
allowed for an easier mapping of particle bins and cell clusters to hardware SMPs,
and avoided full sorting of particles at each time step. However, as there were still
memory conflicts during current deposition, a “thread tagging” technique was used,like the one described in [75].1
In 2009, Burau et al. [79] developed a lightweight relativistic PIC code in CUDA.
The problem of memory collisions was overcome by allowing for violation of charge
conservation, using the data structure and particle sorting technique of [74] (called
linked particle list), and resorting to atomic operations. They also used texture interpo-
lation to accelerate read access to particle data. Finally, they deployed the code in a
small GPU cluster, interleaving MPI communication calls with CUDA kernels.
Decyk et al. [80] implemented an electrostatic spectral PIC code in CUDA, and
expanded it further with parameters that adapted the same code for different multi-core architectures [81]. They also expanded the notion of the linked particle list by
having empty spaces in the particles’ array. This allowed for minimal sorting at the
cost of more memory usage. Based on this work, Kong et al. [82] implemented a full
relativistic PIC code using an adaptation of the Villasenor-Bunemann method [68] for
current deposition. In this step, they group grid cells in clusters and use an interleave
1In this technique, the value to be deposited has some of the most significant bits reserved as an
unique thread identification. This allows for each thread to detect if they were able to successfully deposit
the value or not.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 83/140
3.8. Overview and Conclusions 63
technique to avoid conflicts beween clusters; inside a cluster of cells, they also use the
“thread tagging” technique of [75].
As it was explained previously (see Section 3.4.2), in this work we decided to im-
plement the Esirkepov method of current deposition. It has less thread divergence
than Villasenor-Bunemann and allows for higher order interpolation. To avoid mem-
ory conflicts, we developed an atomic operation on floats. This avoids the precision
loss of the “thread tagging” technique, caused by using some bits to store a unique
thread tag. We also used a stride distance between consecutive threads, to minimize
atomic collisions. We found this approach to be more flexible and easier to extend to
3D than the bin/cluster approach. We also use a compact memory scheme to store par-
ticles. This requires less memory and allows for faster direct display results (see Sec-tion 4.2.4).
3.8 OVERVIEW AND CONCLUSIONS
We have implemented a fully relativistic 2D PIC code on a GPU using C for CUDA.
We have validated the implementation using a well-known benchmark problem of the
Weibel instability in electron-positron plasmas. The code performs significantly faster
on a Tesla C1060 than on a single core of an Intel Xeon E5420. The main performance
bottleneck is current deposition (approximately 67% of a simulation cycle), since it in-volves a scattering operation to global memory. We were able to avoid serializing this
step by implementing a pseudo atomic add with floats, which can be extended to in-
clude other kinds of atomic operations, as long as they have a neutral element and are
commutative. To avoid memory access conflicts during the current deposition step,
several strategies where implemented and evaluated. The best results were obtained
with a particle sorting mechanism and by ensuring that consecutive threads deposit
current in different cells. This approach achieved minimal memory conflicts during
current deposition.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 84/140
64 Chapter 3. Fine grain parallelism
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 85/140
CHAPTER 4
MULTI-SCALE VISUALIZATION
4.1 DISTRIBUTED RENDERING WITH VISIT
Visual exploration of data produced by a numerical simulation is a major task in sci-
entific High-Performance Computing. The amount of time a scientist takes to explore
and understand the data generated by a numerical simulation/experiment can be sev-
eral orders of magnitude larger than the time it takes to produce. For example, our nu-
merical simulation infrastructure at GoLP (simulation codes, hardware systems where
this codes run) is able to produce terabytes of data in a few days of simulation. Hence
powerful tools for data exploration are essential.
Our research center has developed a visualization infrastructure called VisXD [63]
which is able to handle most of our current visualization needs. It is based on IDL [83]
and contains routines for manipulation of large data sets and to minimize the effort of
creating high quality graphics.
However, VisXD is still a single-processor tool. Although the Operating System
(OS) might automatically distribute the computational tasks over the available cores,
VisXD is not able to work in distributed memory systems. That is a major limitation
for the exploration of large data sets, since in that case the display data to be processed
by the graphic processor (GPU) might not fit into the graphics RAM of a single system.
To overcome such limitations, we did an evaluation of current distributed-memory
visualization tools. We have evaluated SCIRun [84], Paraview [85] and OpenDX [86],
but finally VisIt [26] was chosen, because it was the only one to offer all the features
needed: parallel data processing and rendering (distributed and shared memory),
scriptable, expandable, high performance and scalability, multi-platform.
We developed a data plugin for VisIt that allows the results produced by our sim-
ulation software (Osiris, dHybrid, QuickPic) to be visualized and analyzed. We have
65
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 86/140
66 Chapter 4. Multi-scale visualization
also set up a set of data post-processing tools in VisIt, like vector intensity and kinetic
energy diagnostics, that tightly integrate with VisIt’s visualization tools, allowing foran easier exploration of data features that are not directly available in the raw output
data from the simulation.
In addition, we have also set up a distributed memory visualization cluster with 10
cores and 5.5 GB of RAM to test the parallel visualization capabilities of VisIt and eval-
uate its usefulness in exploring data with several gigabytes per time step (Figure 4.1).
FIGURE 4.1 : Part of the hardware for distributed visualization testing. These four machines
represent 8 cores and 4.4 GB of RAM. Another similar node (not in the picture) acted both as
the scheduler and as a computing node, and added the remaining 2 cores and 1.1 GB of RAM.
The same system was also deployed on IST Cluster [87] (an AIX system with 280
cores and 2 GB of RAM per core located at IST) and on the ExPP Cluster [88] (a 40
core PowerPC cluster for the internal use of GoLP at IPFN). These efforts showed
the feasibility of expanding our visualization infrastructure to distributed memory
systems without spending too much effort in a migration and re-factoring process.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 87/140
4.2. Direct visualization with CUDA 67
We are currently developing a plugin for HDF5 data (the previous data plugin
only allowed HDF4) and we are increasing the post-processing tools available, namelyintegrating particle tracking.This work represents a major first step towards a fully-
parallelized visualization infrastructure at IST, which is critical for multi-gigabyte data
analysis.
4.2 DIRECT VISUALIZ ATION WITH CUDA
As with any numerical experiment, visualization plays a critical role in particle-in-
cell simulations. This can be a time consuming and computationally demanding task,
that can benefit greatly from running the simulation on the GPU itself. Since most of
the CUDA enabled devices available are graphic processors (NVIDIA’s GeForce and
Quadro boards), the data is already available in video memory, and can readily be dis-
played, avoiding time consuming memory transfers from CPU to video memory. Even
in CUDA hardware that is not a video card, such as NVIDIA’s Tesla boards, the band-
width throughput to video RAM over PCI Express×16 is very high (up to 1000 GB/s),
allowing for very efficient visualization. There are also several post-processing visual-
ization diagnostics, such as smoothing or energy calculations, that can make use of the
available computational power on the GPU, bringing an added benefit to doing direct
visualization. In this sense, we have developed code in OpenGL that tightly integrateswith our PIC implementation in CUDA (explained in Chapter 3), expanding it to al-
low for the display and exploration of the resulting simulation data. This expanded
system is able to display millions of particles and also to produce several standard
diagnostics (EM field, current, charge density) and other custom ones at interactive
frame rates.
4.2.1 PARTICLE (POINT) VISUALIZATION
As with any particle code, point data is one of the fundamental datatypes to be vi-sualized. As explained in Section 3.4.1, particle positions are stored as normalized
coordinates to the grid cell, using values in the range [0; 1[, together with the particle
cell index, stored as an integer. To display particle (point) data in OpenGL, two ap-
proaches are available: i) for each simulation particle, we could move the coordinate
system to the origin of each cell in the grid and render a point in the particle position
using the corresponding scale factor (the cell dimension length) or ii) we could con-
vert the normalized particle coordinates and cell index to a global position array that
would be used to draw the particles.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 88/140
68 Chapter 4. Multi-scale visualization
In this work, we chose the latter approach, since it offers several benefits over the
first:
— handling different scaling in each dimension is easier and more straight forward
than with OpenGL coordinate scaling;
— the conversion of coordinates can be done very efficiently in CUDA, since all the
data is independent from one another and is already available in the GPU memory;
— after conversion, the data can be displayed very quickly in OpenGL using a Ver-
tex Buffer Object (VBO) [89, chap. 2]. VBOs cannot be used efficiently in the first
method.
Particle visualization is then done using a CUDA kernel that converts from normal-
ized coordinates to absolute coordinates. This conversion is done in a single pass
maximizing memory bandwidth. The resulting data is then used as a VBO to be dis-
played in one single OpenGL render command. All operations are performed in video
memory achieving a minimal overhead. We already showed an example of direct par-
ticle visualization during the simulation of a 2D Weibel instability in Figure 3.10. We
present in Figure 4.2 a similar picture, since that particular simulation step will be
used in the next section as a basis for several diagnostics.
FIGURE 4.2: Direct visualization of a 2D PIC simulation of the formation of a Weibel instability.
(For a color version, see Figure C.5 on page 107.)
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 89/140
4.2. Direct visualization with CUDA 69
4.2.2 GRI D VISUALIZATION AND DIAGNOSTICS
Being a particle-mesh algorithm, visualization tools for PIC codes are also required to
handle grid data, and we have implemented several OpenGL routines for this type
of visualization. They take advantage of the programmer interoperability of textures
between OpenGL and CUDA. The main idea is to use data in CUDA global memory
(GPU’s RAM) as an OpenGL texture to be rendered and displayed. Again, this allows
for fast rendering, since no transfer of data from the CPU to the GPU is necessary.
We have also implemented several grid related diagnostics, which are particularly
adequate for GPU algorithms, such as vector field magnitudes, or EM field energies.
Figure 4.3 shows several different diagnostic possibilites, using the 2D Weibel instabil-
ity of Section 3.5: electron and positron charge density, intensity of the perpendicular
component of the current ( j3) and the transverse magnetic field energy B21 + B2
2.
4.2.3 INTERACTIVITY
Interactive control of direct visualization was also implemented, to simplify interac-
tion with the simulation code, and for easy switching between the several available
diagnostics. The routines implementing interactivity must be unobtrusive, so that
the simulation performance does not suffer too much, but also need to offer enough
interactivity to allow the user to have enough control over the visualization and di-agnostics. Our choice of implementation went to OpenGL’s GLUT [89], providing a
lightweight, multi-platform API, with minimal impact on performance, and a sim-
ple integration with the CUDA GPU simulation code. Figure 4.4 shows a generic
overview of a simulation cycle, including parsing user events and rendering. To min-
imize the impact on performance, it is also possible not to update the visualization
at every timestep, but only at larger intervals, by doing several simulation cycles per
event parsed, or even to skip rendering and user interaction altogether if maximum
performance is required.
We have added the possibility to control interactively both the visualization pa-
rameters, like the zoom level and the rotation of the whole scene, and some of the
simulation parameters, such as the number of particles per cell or the grid spatial
resolution. These later features restart the simulation with the new settings. This is es-
pecially useful in testing phases, as it allows to quickly scan the effect of certain code
changes for various simulation scenarios. Figure 4.5 shows some of the interactive
features implemented.
We have also evaluated the penalty for adding user interaction during a simula-
tion. Table 4.1 shows the time it takes for our system to simulate different numbers
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 90/140
70 Chapter 4. Multi-scale visualization
(a) (b)
(c) (d)
FIGURE 4.3 : Examples of different diagnostics during the formation of a 2D Weibel instabil-
ity: (a) electron charge density, (b) positron charge density, (c) intensity of the perpendicular
component of the current ( j3), and (d) transverse magnetic field energy B21 + B2
2.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 91/140
4.2. Direct visualization with CUDA 71
Visualization
Restart simulation
Start OpenGL context
Initialize simulation
Get user events
Change
visualization
parameters
Change
simulation
parametersIdle
Get positions/diagnostics
Render
Do N simulation cycles
Interaction
Simulation
FIGURE 4.4: Overview of the integration between the simulation code (middle layer, in green),
and the interaction and visualization code (top and bottom layers, in red and blue respec-
tively). (For a color version, see Figure C.6 on page 108.)
(a) (b)
FIGURE 4.5 : Two examples of direct interaction during a simulation: (a) depicting the grid
used in the PIC simulation over the particles, (b) zooming in the center of the image while the
simulation is running. (For a color version, see Figure C.7 on page 108.)
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 92/140
72 Chapter 4. Multi-scale visualization
of cycles with and without user interaction, for the 1.2 k run referred in Sections 3.5
and 3.6.
CyclesNo interaction
(seconds)
Interaction
(seconds)Penalty
25 3.764 3.928 4.36%
250 18.533 19.309 4.19%
2500 105.463 112.099 6.29%
5000 195.112 208.265 6.74%
TABLE 4.1 : Estimation of user interaction penalty on the 1.2 kpart run (see Section 3.6).
The measured penalty, although significant larger than the rendering (the render-
ing is considered to be irrelevant, as showed in Table 3.2 of page 58), is also still ac-
ceptable. For simulations that run for a high number of cycles it might be desirable to
lower the user interaction priority, that is, to run several cycles per user event parsing.
The same approach is to be used if the cycle time is made significantly shorter. This
possibility has been included in the code, represented as the third line from the bottom
of Figure 4.4.
4.2.4 3D VISUALIZATION
Integrating visualization with 3D simulations is a simple extension of the work ex-
posed in the previous sections. In fact, we started this work with a visualization of
the 3D version of the particle pusher of Sections 3.2 and 3.4.1. We used forced fields,
so that we could take better advantage of the trilinear hardware interpolation of the
texture units of the GPU. We have already referred to this approach on Section 3.4.1,
page 48. Figure 4.6 shows the result of the simulation. Electrons are initialized in the
presence of a static magnetic field that forces them into a circular movement around
the axis.
4.2.5 PREVIOUS WORK
Direct visualization of simulation results in other scientific areas have been explored
since the first programmable GPUs, for example in fluid and flow simulations [90, 91],
and in rigid-body and molecular dynamics simulations [92]. Visualization of plasma
simulations is also an important topic, with several applications available [93, 63, 94].
However, to our knowledge this is the first approach to a direct visualization of plasma
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 93/140
4.3. Conclusions 73
FIGURE 4.6: Direct visualization of a 3D particle pusher on a static EM field. (For a color
version, see Figure C.8 on page 109.)
simulations, where the user interacts with the GPU-accelerated simulation while it is
occurring.
4.3 CONCLUSIONS
We have developed large-scale and small-scale tools for scientific visualization. At
the large scale, we built a proof-of-concept visualization cluster that allowed us to use
VisIt to achieve parallel rendering and visualization of large data sets. At the small
scale, we integrated a visualization and interaction layer on a plasma simulation code.
This had the important benefits of allowing for a direct visualization and an interactive
exploration of the results. We also estimated the performance penalty of this system
on the overall performance of the simulation code. To overcome the possibility of it
being significant, we have added the option to interactively lower the priority of the
interaction layer, so that the user is able to control the compromise between betterperformance or more user interaction. This system is easily extended from 2D to 3D
and also with more visualization tools, besides the ones already developed.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 94/140
74 Chapter 4. Multi-scale visualization
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 95/140
CHAPTER 5
OVERVIEW AND FUTURE WORK
The relative stability of commodity computing hardware since the mid 90s has led to
the development of highly optimized numerical codes that allow for very high-scale
simulations of physical phenomena. When the power limit was hit in the first half of
this decade, other alternatives for increasing performance had to be found. However,
they must be flexible enough to allow for the reuse of the previous code, that had
already achieved high reliability and performance.
One possibility is to further explore the coarse grain parallelism available with the
Grid. To that goal, we have developed a lightweight library to manage transfers of large sets of data in a way that is transparent to the user and very easy to integrate
with existing applications (see Chapter 2). It is also flexible enough to allow for the
overlap of computation and data transfer. We have shown the use of the library in a
real-case scenario, where it is integrated with the high-performance plasma simulation
code Osiris.
Further development of this approach is desirable. We are already working on
extending the interface of the library to increase its flexibility, and on implementing
an easier file transfer interface that is based on existing tools and exposes a POSIX
compliant interface. This will allow the deployment of non-Grid code without any
modifications. A web portal for easy access of output files is also being implemented.
Another possibility to overcome the power wall is to increase the number of local
computing cores. We have developed a plasma simulation code that achieves high-
performance on systems with hundreds of cores and thousands of threads (see Chap-
ter 3). The memory hierarchy of such systems proved to be general enough (small
and fast local memory together with large and slow global memory) so that the solu-
tions we found can easily be ported to other massively parallel systems. In particular,
we have developed a general algorithm for atomic functions that allowed for an opti-
75
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 96/140
76 Chapter 5. Overview and Future Work
mized scatter operation that would be very costly otherwise.
Work in this area is continuing with the development of a 3D version of the code
and with the evaluation of the integration with existing CPU codes as abstract func-
tions calls. This will allow for acceleration of existing intensive computational parts,
while keeping the more elaborated features intact.
Data is useless without insight. The more data we are able to generate, the more
challenging it is to make sense of it. Visualization is a powerful tool for insight, but it
must be coupled with a direct and immediate feedback, so that exploration is quick,
effective and fun. We have explored both large scale and small scale visualization
applications (see Chapter 4). On one side, we developed a cluster that allowed for
parallel visualization of simulation data. On the other side, we have integrated avisualization layer directly on a plasma simulation code. This allowed for a direct and
interactive exploration of the simulation results as they were being produced, with a
minimal penalty in performance.
A larger scale visualization infrastructure is now being built at GoLP, partially
based on the tools (VTK) we have been using. Also, direct visualization is already
being further developed, with 3D integration and more powerful tools, like interac-
tive color spaces, particle tracking and function diagnostics.
We still have not found a limit for the human understanding. With this thesis and
the tools developed for it, we hope to have contributed to the exploration of Nature’smysteries, adding a bit more to our capacity for grasping, being surprised and in awe.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 97/140
APPENDIX A
CODE LISTINGS
In this appendix, we present the relevant code developed during this thesis. Sec-
tion A.1 refers to Chapter 2, specially the DataLight library. Section A.2 refers to Chap-
ter 3, specially to the CUDA relevant parts. Finally, Section A.3 refers to Chapter 4,
both to the VisIt plugin and to the direct visualization implementation.
A.1 CODE FROM
CHAPTER
2In this section, we present the relevant C code used to implement the DataLight li-
brary. An overview of the algorithm implemented in the library was already given
in Section 2.3.3, page 26. As illustrated in Figure A.1 (which is a reproduction of Fig-
ure 2.7, placed here for convenience), the library consists mainly of a queue manager
thread, that checks for files to be transferred, and on several transfer threads. An ap-
plication uses the library by adding a file to a waiting queue, using the export function
write_remote.
Listing A.1 shows the implementation of the exported function write_remote,called in the top left of Figure A.1 at the line “Add to waiting queue”. The func-
tion takes as an argument the path of a file to be transferred to a SE. First, it checks
to see if the queue manager thread is already active. If not, it initializes the two main
data structures waiting and writing, which are FIFO lists that store the files to be
transferred and the ones being transferred, respectively. Then, it launches the queue
manager thread, which is mostly implemented in the function lists_manager. Fi-
nally, the function adds the file to the waiting list. The lists waiting and writing are
represented in Figure A.1 as the “Waiting queue” and the “Transfer queue”.
77
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 98/140
78 Appendix A. Code listings
Main thread
Transfer thread
Queue manager thread
Compute results
Write locally
Add to waiting queue
Disk
Waiting queue
Waiting queueempty?
Transfer slotsavailable?
Wait
Move first elementfrom waiting queue to
transfer queue
Initiate transfer thread
Transfer queue
Transfer thread
Transfer first elementfrom transfer queue
Remove transferredelement from
transfer queue
End
N
Y
Y
N
Queue manager
Main program
Queue manager
thread
FIGURE A.1: A simplified fluxogram of DataLight (left and bottom) and its connection with a
simulation code (top).
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 99/140
A.1. Code from Chapter 2 79
LISTING A.1: The main exported function for DataLight. It is called for each file that should
be transferred.
in t w r i t e _ r e m o t e (char * f na me ) {
if ( m a n a ge r _ th r e ad _ a ct i v e = = 0 ) {
/ / I n i ti a li z e t he l i st s :
w r it i ng = l i st _ ne w ( ) ;
w a it i ng = l i st _ ne w ( ) ;
/ / I n i t i a l i ze t h r e ad a t t r
p t h r e a d _ a t t r _ i n i t ( & a t t r ) ;
p t h r e a d _ a t tr _ s e t d e t a c h s ta t e ( & a t t r , P T H R E A D _ C R EA T E _ J O I N A B L E ) ;
/ / L a un c h t he m a na g er t h re a d :
m a n a g e r _ t h r ea d _ a c t i v e = 1 ;
p t h r e a d _ cr e a t e ( & m a n a g e r _t h r e ad , & a t tr , ( void * ) l i s t s _m a n a ge r , \
( void * ) N U LL ) ;
}
/ / A d d t he f il e t o b e w ri tt en t o t h e w ai ti ng l is t :
l i s t _ ad d ( w a i t i ng , f n a m e ) ;
return 0;
}
Listing A.2 shows the implementation of the queue manager function called
lists_manager. This is mainly a loop that, at each iteration, takes a file name fromthe waiting list, places it in the writing list, and creates a transfer thread with that
file.
Listing A.3 shows the implementation of write_thread, the file transfer function
that is run by each transfer thread. The main action is the call to lcg_cr, a gLite data
transfer function (see Section 2.3.1 on page 22 and [1]). First some string parameters
have to be set, like the local and remote filenames and the remote path. Although this
code is mostly not shown due to space constrains, we show the use of make_lfc_dir,
which creates a directory path recursively in the remote SE. Then the actual transfer
takes place and finally the file transferred is removed from the writing queue.
Finally, in Listing A.4 we show the other exported function, write_finished, that
finishes the transfer threads. It simply sets the variable manager_thread_active to
zero and waits for the manager thread to exit.
Not listed here is the implementation of the waiting and writing data structures.
They are instances of a standard FIFO data type with basic functions for creation
(list_new), addition (list_add), deleting (list_del_element) and moving (list_mv).
Examples of usage are shown in the previous listings.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 100/140
80 Appendix A. Code listings
LISTING A.2 : The queue manager thread.
/ / M a na g er t h re a d t h at d i sp a t ch e s t he w r it e t h re a ds :
in t l i s t s _ m an a g e r ( ) {
p t h r e a d _ a t t r_ s e t d e t a c h s t at e ( & a t t r , P T H R E A D _ C R E AT E _ D E T A C H E D ) ;
/ / M ai n l oo p :while ( ( m a n ag e r _t h r ea d _ ac t i ve = = 1 ) | | ( w a it in g - > f i rs t ! = N U LL ) ) {
if ( ( w ai ti ng - > f i rs t = = N U LL ) | | ( n _ t hr e a ds > = M A X _T H R EA D S ) ) {
/ / I f t he re ’ s n ot hi ng t o d o o r w e ’ re t oo b us y , w ai t 1 00 m s :
usleep(1000);
}
else {
if ( w a it in g - > f i rs t ! = N U LL ) {
p t h r e a d _ m u t e x _ l o c k ( & n _ t h r e a d s _ m u t e x ) ;
l i s t _ mv ( w a i t i ng , w r i t in g ) ;
p t h r e a d _c r e a t e ( & ( w r i t i n g _ t h re a d [ n _ t h r e a d s ] ) , & a t tr , \
( void * ) w r i t e _t h r e ad , ( void *)writing ->last);
n_threads++;
p t h r e a d _ m u t e x _ u n l o c k ( & n _ t h r e a d s _ m u t e x ) ;
}
}
}
// I f w e g et he re , t he n we g ot t he s ig na l to s to p.
/ / I f t he re ’ s s ti ll s om e th in g g oi ng on , w ai t .
while ( w r it in g - > f i rs t ! = N U LL ) {
usleep(1000);
}
return 0;
}
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 101/140
A.1. Code from Chapter 2 81
LISTING A.3 : The remote file transfer thread.
in t w r i t e _ t hr e a d ( l i s t _ e l e m e n t _ t * e l e m en t ) {
char i n _ n a m e [ L C G _ M A X P A T H L E N ] ;
char o u t _ n a m e [ L C G _ M A X P A T H L E N ] ;
char v o _ n a m e [ L C G _ M A X P A T H L E N ] ;
char l f n _ p a t h [ L C G _ M A X P A T H L E N ] ;
/ / R em ov ed c od e s et ti ng u p t he a bo ve s tr in gs .
/ / C re at e t he l fc d ir s :
lfc_umask(0755);
make_lfc_dir(lfn_path);
/ / A dd t he f il e n am e t o l fn _p at h :
s t r c at ( l f n _ p at h , " / " ) ;
strcat(lfn_path , basename( element ->name ));
/ / T r an s fe r :
lcg_cr(in_name , out_name ,
N U L L , / / G U ID
lfn_path , vo_name ,N U L L , / / r e la t iv e p a th
2 , / / n b s t r ea m s
N U L L , / / c o n f _ f il e
0 , / / i n s e c ur e
1 , / / v e r b os e
NULL / / a c t u a l _ gu i d
);
/ / R em ov e u s f ro m t he w ri ti ng l is t :
p t h r e a d _ m u t e x _ l o c k ( & n _ t h r e a d s _ m u t e x ) ;
l i s t _ d e l _e l e m e n t ( w r i ti n g , e l e m en t ) ;
n_threads --;p t h r e a d _ m u t e x _ u n l o c k ( & n _ t h r e a d s _ m u t e x ) ;
return 0;
}
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 102/140
82 Appendix A. Code listings
LISTING A.4: The function that finishes the transfer threads.
in t w r i t e _ f i ni s h e d ( ) {
if ( m a n a ge r _ th r e ad _ a ct i v e = = 1 ) {
m a n a g e r _ t h r ea d _ a c t i v e = 0 ;
p t h r e a d _j o i n ( m a n a g e r _t h r e a d , N U L L ) ;
return 0;
}
else {
f p ri n tf ( s t de rr , " T r yi n g t o k il l a n on - e x i s t en t m a na g er t h re a d . \ n" ) ;
return 1;
}
}
A.2 CODE FROM CHAPTER 3
In this section we will present the relevant CUDA code developed in Chapter 3.
We start with the implementation of the generalized atomic function of Figure 3.6,
on page 52, shown on Listing A.5. This is a particular case of the general function,
where the operation is the addition of floats.
This atomic function was used for the current deposition. Listings A.6 and A.7show the implementation of this step using the Esirkepov method and handling sev-
eral particles per thread. Since the complete function does not fit in one page, we
decided to separate it into these two listings.
Listing A.6 shows the initial and final tests, before the current is calculated. These
tests are necessary since we might be dealing with more than one particle per thread.
LISTING A.5: Applying the generalized atomic function from page 52 to the addition of floats.
__device__ float m y A t o m i c A d d _ d ( float * address , float v al ue ) {
v o l a t il e f l o at n e w_ v al = v a lu e ;
while ( n e w_ va l ! = 0 .0 f ) {
n e w_ v al = a t om i c Ex c h ( a d dr es s , 0 .0 f ) ;
n e w_ v al + = v a lu e ;
n e w _ va l = a t o m i c Ex c h ( a d d re s s , n e w _ v al ) ;
v a lu e = n e w_ v al ;
}
return new_val;
}
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 103/140
A.2. Code from Chapter 3 83
LISTING A.6: Entering and leaving the current deposition function.
__device__ void C u rr e nt D ep os i t2 D_ d ( . .. ) {
if ( gc ! = p re v_ gc ) {
u n s i g ne d i n t f ro m = 0 ;
fo r ( in t i y =0 ; iy < 4 ; i y ++ ) {
fo r ( in t i x =0 ; ix < 4 ; i x ++ ) {
in t to = p re v_ gc + ( ix - 1) + ( iy - 1) * ( di v. x +1 +2 );
m y A t o m i c Ad d _ d ( J 1 + t o , J 1 a cc [ f r o m ] ) ;
J 1 ac c [ f r om ] = 0 .0 f ;
m y A t o m i c Ad d _ d ( J 2 + t o , J 2 a cc [ f r o m ] ) ;
J 2 ac c [ f r om ] = 0 .0 f ;
m y A t o m i c Ad d _ d ( J 3 + t o , J 3 a cc [ f r o m ] ) ;
J 3 ac c [ f r om ] = 0 .0 f ;
from++;
}
}
}
/ / C o de r e mo v ed f ro m t hi s l i st i ng : i m p le m e nt a t io n o f t he E s ir k ep o v
/ / a l go r it h m u s ed t o c a lc u l at e t he c u rr e nt J t em p o r i gi n at e d
/ / b y o ne p a rt i cl e , a nd i ts a d di t io n t o t he a c c um u l at e d c u rr e nt J * a cc/ / c al c ul at ed b y t hi s t hr ea d o n t hi s c el l .
/ / A f te r a ll t he p a rt i c le s i n t hi s t h re a d h a ve b e en a c cu m ul a te d ,
/ / d ep os it J * ac c t o t he g lo ba l J * :
if ( ( p ar tT oG o == 1) ) { / / L a st p a rt i cl e i n t hr ea d , s o d e po s it n ow :
u n s i g ne d i n t f ro m = 0 ;
fo r ( in t i y =0 ; iy < 4 ; i y ++ ) {
fo r ( in t i x =0 ; ix < 4 ; i x ++ ) {
in t to = g c + ( i x -1 ) + ( iy - 1) * ( d iv . x +1 +2 );
m y A t o m i c Ad d _ d ( J 1 + t o , J 1 a cc [ f r o m ] ) ;
m y A t o m i c Ad d _ d ( J 2 + t o , J 2 a cc [ f r o m ] ) ; m y A t o m i c Ad d _ d ( J 3 + t o , J 3 a cc [ f r o m ] ) ;
from++;
}
}
}
}
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 104/140
84 Appendix A. Code listings
LISTING A.7: Implementation of the Esirkepov method.
float S 0x [ ] = { 0, 1 - p. x , p . x , 0 }; float S 0y [ ] = { 0, 1 - p. y , p . y , 0 };
float S 1x [] = {0 , 0 , 0 , 0 }; float S 1y [] = {0 , 0 , 0 , 0 };
float D Sx [ 4 ] , D Sy [ 4 ] , W x [ 16 ] , W y [ 16 ] , W z [ 16 ] ;
float Jtemp[16];
/ / d ir ec ti on = { 0, 0 } i f p ar t . s ta ye d i n t h e s am e c el l
/ / d ir ec ti on . x = 1 i f m ov ed r ig ht , -1 i f m ov ed l ef t
/ / d ir ec ti on . y = 1 i f m o v ed up , - 1 i f m ov ed d ow n
S 1x [ d i r e ct i on . x + 1 ] = 1 - n e wR . x ;
S 1x [ d i r e ct i on . x + 2 ] = n e wR . x ;
S 1y [ d i r e ct i on . y + 1 ] = 1 - n e wR . y ;
S 1y [ d i r e ct i on . y + 2 ] = n e wR . y ;
fo r ( in t i = 0; i < 4; i + +) {
D Sx [ i ] = S 1x [ i ] - S 0x [ i ]; D Sy [ i ] = S 1y [ i ] - S 0y [ i ];
}
fo r ( in t i y =0 ; i y <4 ; i y ++ ) {
fo r ( in t i x =0 ; ix < 4 ; i x ++ ) {
W x[ i y *4 + ix ] = D Sx [ ix ] * ( S 0y [ iy ] + D Sy [ iy ] /2 ) ;
W y[ i y *4 + ix ] = D Sy [ iy ] * ( S 0x [ ix ] + D Sx [ ix ] /2 ) ;
W z[ i y *4 + ix ] = S 0x [ ix ] * S 0y [ iy ] + D Sx [ ix ] * S 0y [ iy ] /2 +
S 0x [ i x ] * D Sy [ i y ] /2 + D Sx [ i x ] * D Sy [ i y ] /3 ;
}
}
// x
float qv = p .q * d im _d t .x ;
fo r ( in t i y =0 ; i y <4 ; i y ++ ) {
J t em p [ i y *4 ] = - q v * W x [ iy * 4 ];
fo r ( in t i x =1 ; ix < 4 ; i x ++ ) {
J te mp [ ix + i y *4 ] = J te mp [ ix - 1 + i y *4 ] - q v *W x[ i x + i y *4 ];
}
}
fo r ( in t i = 0; i < 16 ; i + +)
J 1 ac c [ i ] + = J t em p [ i ];
/ / C od e r em ov ed f or y a nd J 2a cc
// z
qv = p .q * p . uz * r ga mm a ;
fo r ( in t i = 0; i < 16 ; i + +)
J t em p [ i ] = q v * Wz [ i ] ;
fo r ( in t i = 0; i < 16 ; i + +)
J 3 ac c [ i ] + = J t em p [ i ];
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 105/140
A.2. Code from Chapter 3 85
As explained in Sections 3.4.2 and 3.6, we implemented a stride parameter and
a “particles per thread” parameter. This means that one thread might handle morethan one particle, and that those particles might not belong to the same cell. Hence
the initial code checks to see if the particle we are currently handling is the first one
of a cell. If it is, then we must deposit the current accumulated so far and set the
accumulators to zero. The final test checks if we are handling the last particle in this
thread. In that case, a similar action has to be done: deposit the accumulated current.
In both cases we have to use the atomic function of Listing A.5.
Listing A.7 shows the actual implementation of the Esirkepov algorithm. This is a
straightforward implementation. Since this is a 2D code, the z component is handled
slightly different than x and y.
Advancing the EM fields is mostly a direct implementation of the Yee technique
(see Section 3.2), but we use shared memory to optimize bandwidth. Listing A.8
shows the magnetic field update. CTPB is a constant that defines the number of threads
per block for this kernel.
The EM-field interpolation is shown in Listings A.9 and A.10. Because of the stag-
gered grid, offsets have to be taken into account. They are illustrated in Figure A.2,
which is a copy of Figure 3.2.
Listing A.9 shows the interpolation using textures. It is a matter of calculating the
normalized texture indexes and do a texture fetch, taking the offsets of the staggered
grid into account.
For the non-texture interpolation, the Listing A.10 is a bit more complicated. The
main issue is not the interpolation per se, but taking advantage of the fact that each
thread might deal with several particles. In that case, most of the particles might
belong to the same cell, and the EM-field values need to be read only once. So we start
by checking if we’re dealing with a new cell and, if we are, we read in the necessary
values. After that, we decide which values do we need for interpolation. Because
of the staggered grid, not all values are needed by all the particles in a cell. They
depend of which quadrant the particle belongs to. See Figure A.2 for a depiction of
the quadrants in a cell.
Putting this all together, we show in Listing A.11 the main kernel for the particle
push. The first three lines calculate the global index of the thread, called idx. This
is standard CUDA pratice. The next lines, starting with #ifdef STRIDE, adjust idx,
which is actually the particle index, to take the stride into account. The variable Tpt is
the total number of threads and Ppt is the total number of particles pushed per thread.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 106/140
86 Appendix A. Code listings
LISTING A.8: Magnetic field update using shared memory.
__shared__ float E 1 s h [ C T P B ] ;
__shared__ float E2sh[CTPB+1];
__shared__ float E3sh[CTPB+1];
__shared__ float E1Ush[CTPB];
__shared__ float E3Ush[CTPB];
__shared__ float B 1 s h [ C T P B ] ;
__shared__ float B 2 s h [ C T P B ] ;
__shared__ float B 3 s h [ C T P B ] ;
/ / R ea d i n t he v al ue s :
E 1s h [ t h r ea d _i d ] = E 1 [ id x ] ;
E 2s h [ t h r ea d _i d ] = E 2 [ id x ] ;
E 3s h [ t h r ea d _i d ] = E 3 [ id x ] ;
if ( t h r e a d _ i d = = b l o c k_ s i z e - 1 ) {
E 2 s h [ t h r e a d_ i d + 1 ] = E 2 [ i d x + 1 ] ;
E 3 s h [ t h r e a d_ i d + 1 ] = E 3 [ i d x + 1 ] ;
}
E 1 Us h [ t h re a d _i d ] = E 1 [ id x + ( d iv . x + 1 + 2) ] ;
E 3 Us h [ t h re a d _i d ] = E 3 [ id x + ( d iv . x + 1 + 2) ] ;
B 1s h [ t h r ea d _i d ] = B 1 [ id x ] ;
B 2s h [ t h r ea d _i d ] = B 2 [ id x ] ;
B 3s h [ t h r ea d _i d ] = B 3 [ id x ] ;
__syncthreads();
// U pd at e h al f B w it h E
B 1s h [ th r ea d_ id ] + = - d t2 dr . y * ( E 3U sh [ t h re ad _i d ] - E 3s h [ th r ea d_ id ] ) ;
B 2s h [ th r ea d_ id ] + = d t2 dr . x * ( E 3s h [ th re ad _i d + 1] - E 3s h [ th r ea d_ id ] ) ;
B 3s h [ t h r ea d _i d ] + = - d t 2d r . x * ( E 2 sh [ t h r e ad _ id + 1 ] - E 2s h [ t h r ea d _i d ] )
+ d t2 dr . y * ( E 1U sh [ t h re ad _i d ] - E 1s h [ th re ad _i d ] ) ;
/ / W ri te o ut t he v al ue s :
B 1 [ id x ] = B 1s h [ t h r ea d _i d ] ;
B 2 [ id x ] = B 2s h [ t h r ea d _i d ] ;
B 3 [ id x ] = B 3s h [ t h r ea d _i d ] ;
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 107/140
A.2. Code from Chapter 3 87
FIGURE A.2: The staggered 2D grid used for EM-field interpolation.
LISTING A.9: EM-field interpolation at a particle’s position using textures./ / I n t er p o la t i on w i th t e xt u re s :
/ / g i a nd g j a re t he c el l i nd ex es , i nc lu di n g g ua rd c el ls .
float s , t ; / / N o r ma l iz e d t e xt u re i n de x es .
s = ( ( gi + p .x ) * d im . x - p os _m in . x ) / ( p os _m ax . x - p os _m in . x ) ;
t = ( ( gj + p .y ) * d im . y - p os _m in . y ) / ( p os _m ax . y - p os _m in . y ) ;
float o ff x , o f fy ; / / S t ag g er e d g r id o f fs e ts
o ff x = 0 .5 f / ( d iv . x + 1+ 2) ;
o ff y = 0 .5 f / ( d iv . y + 1+ 2) ;
E i1 = t e x2 D ( m _t e xE 1 , s - o ff x , t ) ;
E i2 = t e x2 D ( m _t e xE 2 , s , t - o f fy ) ;
E i3 = t e x2 D ( m _t e xE 3 , s , t ) ;
B i1 = t e x2 D ( m _t e xB 1 , s , t - o f fy ) ;
B i2 = t e x2 D ( m _t e xB 2 , s - o ff x , t ) ;
B i3 = t e x2 D ( m _t e xB 3 , s - o ff x , t - o f fy ) ;
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 108/140
88 Appendix A. Code listings
LISTING A.10: EM-field interpolation at a particle’s position without textures.
if ( g != p re v_ g ) { / / n ew c el l , s o r ea d n ew E & B v al ue s
/ / E 1
E 1 ce n te r = E 1 [ g ];
E 1 le f t = E 1 [g - 1 ] ;
E 1 ri g ht = E 1 [ g +1 ];
E 1 up = E 1 [ g+ v ] ;
E 1 le f tu p = E 1 [g - 1 + v ];
E 1 ri g h tu p = E 1 [ g +1 + v ];
/ / S im il ar c od e f or B1 , E2 , B2 , E3 , B3
.. .}
float * E 1h or , * E 1d ia g ; float * B 2h or , * B 2d ia g ;
float * E 2v er t , * E 2 di a g ; float * B 1v er t , * B 1 di a g ;
float * B 3h or , * B 3v er t , * B 3d ia g ;
if ( p. x <= 0 .5 f ) { / / l e ft v a lu e
w1 = p .x + 0.5 f;
E 1 ho r = & E 1 r ig h t ;
E 1 di a g = & E 1 r ig h tu p ;
B 2 ho r = & B 2 r ig h t ;
B 2 di a g = & B 2 r ig h tu p ;
B 3 ho r = & B 3 r ig h t ;
}
else { / / r i gh t v a lu e s
w1 = 1.5 f - p .x ;
E 1 ho r = & E 1 l ef t ;
E 1 di a g = & E 1 l ef t up ;
B 2 ho r = & B 2 l ef t ;
B 2 di a g = & B 2 l ef t up ;
B 3 ho r = & B 3 l ef t ;
}
/ / S i mi l ar p r oc e d ur e f or y
if ( p. y <= 0 .5 f ) { / / b o tt o m v a lu e s
. .. } else { . ..
}
/ / I n t e r p o l a t e :
Ei1 = w1 * ( (1 -p .y ) * E1 cen te r + p .y * E 1up )
+ (1 - w1 ) * ( (1 - p. y) * * E 1h or + p .y * * E 1d ia g );
/ / S im il ar p ro ce d ur e f or B i2 , E i2 , B i1 , E i3 , B i3
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 109/140
A.2. Code from Chapter 3 89
LISTING A.1 1: The kernel for a particle push.
u n s i g ne d i n t t h re a d_ i d = t h re a d Id x . x + t h re a d Id x . y * b l oc k Di m . x ;
u n s i g ne d i n t b l oc k _i d = b l oc k Id x . x + b l oc k Id x . y * g r id D im . x ;
u n s i g ne d i n t i dx = b lo ck _i d * b lo ck Di m . x * b lo ck Di m . y \
+ t hr ea d_ i d + o ff se t ;
#ifdef STRIDEu n s i g ne d i n t s tr id e _i dx = i dx * s tr id e ;
u n s i g ne d i n t s t r id e _ of f s et = s t r id e _i d x / T pt ;
s t ri d e_ i d x = s t ri d e _i d x % T pt + s t r id e _ of f s et ;
#else
u n s i g ne d i n t s t ri d e_ i d x = i dx ;
#endif / / N S T R ID E
i dx = s t ri de _i d x * P pt ;
/ / S o me i n i ti a l iz a t io n s r e mo v ed .
/ / . ..
u n s i g ne d i n t p r ev _ gc = 0 x f f ff f ff f ;
u n s i g ne d i n t p a rt T oG o = P pt ;
while ( p a r tT o Go ) {
/ / R em ov ed c od e : r ea d v al ue s f ro m g lo ba l t o l oc al m em or y .
/ / . ..
/ / C a l cu l at e :
I n t e r p o l a t e Fi e l d s 2 D _ d ( . . . ) ;
C a l c u l a t e V e lo c i t y 2 D _ d ( . . . ) ;
U p d a t e P o s i ti o n A n d C u r r e n t 2D _ d ( . . . ) ;
/ / R em ov ed c od e : w ri te v al ue s f ro m l oc al t o g lo ba l m em or y .
/ / . ..
/ / P r ep a re f or t he n e xt p a rt i cl e :
partToGo --;
p re v_ gc = g c ;
} / / w h il e ( p a r tT o Go )
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 110/140
90 Appendix A. Code listings
Also interesting is the code to initialize the CUDA device. Usually, the CUDA
drivers are initialized when the first CUDA function is called. However, in systemswith multiple devices it is important that the user is able to select which device she/he
wants to use. We have implemented a function, shown in Listing A.12, that, when
called without parameters, simply selects the first available CUDA device. However,
it also accepts as parameter a device to be used throughout the simulation.
LISTING A.1 2: Initialization of a CUDA device.
void init_cuda( in t d e vi c e ) {
in t deviceCount;
c u d a G e t D e v i c e C o u n t ( & d e v i c e C o u n t ) ;if ( d e vi c eC ou n t < 1 ) {
p r in t f ( " CU D A :\ t N o d e vi c es f o un d . " );
return ;
}
fo r ( in t i = 0 ; i < d ev ic eC ou nt ; i ++ ) {
c u d a D e v i ce P r o p d e v i c e Pr o p ;
c u d a G e t D e v i ce P r o p e r t i e s ( & d e v i ce P r o p , i ) ;
p r in t f ( " CU D A :\ t D e vi c e % d : % s . \ n" , i , d e vi c e Pr o p . n am e ) ;
float d e v ic e C ap a b il i t y = d e vi c e Pr o p . m aj o r + 0 . 1* d e v i c eP r op . m i n or ;
if ( ( d ev ic e < 0 ) & & ( d e v ic e Ca p ab i li t y > = 1 . 0) ) {
d ev ic e = i ; / / S e le c t t he f i rs t v a li d d e vi c e .
// D ev ic e is n ow > 0. W e d on ’ t b re ak t he l oo p s o t ha t/ / w e l is t a ll t he a va il ab le d ev ic es
}
}
if ( ( d e vi c e < 0 ) | | ( d e vi c eC o un t < d e vi c e ) )
p r in t f ( " CU D A :\ t N o v a li d d e vi c e s e le c te d . \ n " );
else {
p r in t f ( " CU D A :\ t M a ki n g d e vi c e % d a c ti v e . \n " , d e vi c e ) ;
c u d a S e t D e v i c e ( d e v i c e ) ;
}
}
A.3 CODE FROM CHAPTER 4
In this section we will present the relevant parts of the visualization code from Chap-
ter 4. We start with code from the VisIt plugin, explained in Section 4.1, and then move
to the direct visualization code from Section 4.2.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 111/140
A.3. Code from Chapter 4 91
A.3.1 VISIT PLUGIN
Implementing a data plugin for VisIt is mostly straightforward, so we will not list
most of the code. We did add some code to automatically detect if the data is vectorial
or scalar. We used the fact that the output files from Osiris and dHybrid create a well
known directory tree. When the user points VisIt to one file to be opened, the plugin
automatically scans the directory structure looking for other possible components. If
other component files are found, it is assumed that we are dealing with a vector vari-
able. In that case, not only each component is present as a scalar variable for the user,
but also the full vector variable is made available. Listing A.13 shows the vector data
check for the magnetic field files.
Another interesting feature programmed into the plugin is the spatial domain de-
composition. Usually, the output from Osiris, dHybrid or Quickpic have only one
spatial domain. However, VisIt allows for several domains to be defined. In this case,
if all the domains do not fit into memory for visualization, they are processed by VisIt
either in a serial fashion, or, if running in a parallel environment, they are processed
in parallel across the available processors. The plugin has only to detect the available
processors (we did that using the standard MPI MPI_Comm_size and MPI_Comm_rank
functions) and to make the domain decomposition.
Listing A.14 creates the domains for the serial case. First, the current domain is
split in half across the largest axis. This procedure is repeated recursively until allaxis are smaller than maxCellsPerDomain. We are thus halving each domain size and
duplicating the number of domains at each pass.
Domain decomposition in the parallel case adds an interesting feature: we do not
want to divide in more than the number of available processors. Hence the general
algorithm has some more steps and special cases:
1. Initialize of n_cells, n_domains, and biggest_axis, like in the serial case.
2. If the total number of domains is less then the available number of processors,
add one domain to the biggest axis and do everything again (goto 1).
3. If we have more domains then the number of processors, return 0.
4. If we are returning from ourselves, check the returned value. If it is lower than
1, then we divided too much (e.g., we returned from point 3). Decrease one
domain from the biggest axis and return the total number of domains calculated
in 1.
5. If it is 1 or greater, then we finished calculating the number of domains. Return
that value.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 112/140
92 Appendix A. Code listings
LISTING A.1 3: The function to check if the file contains vectorial magnetic field data from an
Osiris simulation.
bool a v t G o L P F i le F o r m a t : : C h e c k I s O s i ri s B ( ) {
s t ri n g f i le _ na m e = m D at a . b a se N am e ;
s t ri n g d i r_ n am e = m D at a . b a se D ir ;
/ / I f i t i s v ec to ri al , t he p at h h as a v ec to r s tr uc tu re :
/ / < . .. / [ B 1| B2 | B3 ] / [ B1 | B2 | B3 ] f il en am e >
if ( ( d i r_ n am e . s u bs t r ( d i r_ n am e . s i ze ( ) - 3 , 3 ) = = " / B 1 ") | |
( d i r _n a me . s u b st r ( d i r_ n am e . s i ze ( ) -3 , 3 ) = = " / B 2 ") | |
( d i r _n a me . s u b st r ( d i r_ n am e . s i ze ( ) -3 , 3 ) = = " / B 3 " )) {/ / T he p at h h as a v ec to r s tr uc tu re .
/ / L oo k f or t he o th er p os si bl e f il es .
d i r _ n am e . e r a s e ( d i r _ n a m e . s i z e ( ) -2 , 2 ) ;
f i l e _ n am e . e r a s e ( 0 , 2 ) ;
i n t3 2 s t at u s ; s t ri n g s ;
s = d ir _n am e + " B1 / B1 " + f il e_ na me ;
s t at u s = S D st a rt ( s . c _ st r ( ) , D F A CC _ RE A D ) ;
if ( s t at u s ! = F AI L ) { / / B 1 c o m po n en t f o un d
S D e n d ( s t a t u s ) ;
m D a t a . f i l e N am e s [ m D a t a . n D i m s ] = s ;
m D a t a . d i m E x is t s [ m D a t a . n D i m s ] = true ;
mData.nDims++;}
s = d ir _n am e + " B2 /B2 " + f il e_ na me ;
s t at u s = S D st a rt ( s . c _ st r ( ) , D F A CC _ RE A D ) ;
if ( s ta tu s ! = F AI L ) {
S D e n d ( s t a t u s ) ;
m D a t a . f i l e N am e s [ m D a t a . n D i m s ] = s ;
m D a t a . d i m E x is t s [ m D a t a . n D i m s ] = true ;
mData.nDims++;
}
/ / . .. S im il ar c od e r em ov ed t o d et ec t t he B 3 c o mp on en t
if ( m Da ta . n Di ms > 0 ) {
m D a t a . b a s e Di r = d i r _ n am e ; m D a t a . k i nd = O s i r is B ;
r e t u rn t r u e;
}
}
r e t u rn f a l se ;
}
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 113/140
A.3. Code from Chapter 4 93
LISTING A.14: Domain decomposition for a serial visualization system.
in t a v t G o L P F i le F o r m a t : : C r e a t e D o m a in s ( ) {
/ / T hi s w il l p oi nt t o t he a xi s w it h b ig ge st d om ai n s iz e
/ / ( b i gg e st m A xi s [ ] . s iz e / m A xi s [ ]. n D o m ai n s ) ./ / F or n ow , i t i s s im pl y i n it ia l iz ed .
A x is I nf o * b i g ge s t _a x i s = & m A xi s [ 0 ];
/ / S to re s t he t ot al n um be r o f c el ls :
in t n _c el ls = 1 ;
/ / S to re s t he t ot al n um be r o f d om ai ns :
in t n _ do m ai n s = 1 ;
/ / I n it i a li z e n _c el l s , n _ do m ai n s a nd b i gg e s t_ a x is :
fo r ( in t i = 0; i < m Nu mA xi s ; i + +) {
n _ ce l ls * = m A xi s [ i ]. s i ze ;
n _ d o m a in s * = m A x is [ i ] . n D o m a i n s ;
/ / A c t ua l iz e b i g ge s t _a x i s :
if ( m A x i s [ i ] . s i z e / m A x i s [ i ] . n D o m a in s >
biggest_axis ->size/ biggest_axis ->nDomains )
b i gg e s t_ a x is = & m A xi s [ i ];
}
/ / I f w e s ti ll h av e t oo m an y c el ls p er d om ai n ,
if ( n _ c el l s / n _ d om a in s > m a x Ce l l sP e r Do m a in ) {
/ / s pl it t he b ig ge st d om ai n a nd t ry a ga in .
biggest_axis ->nDomains ++;
n _ d o m a in s = C r e a t e D om a i n s ( ) ; / / r e cu r s iv e c a ll
}
return n_domains;
}
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 114/140
94 Appendix A. Code listings
Listing A.15 shows the implementation of the previous algorithm.
LISTING A.1 5: Domain decomposition for a parallel visualization system. We do not decom-
pose in more domains than the number of available processors.
in t a v t G o L P F i le F o r m a t : : C r e a t e N P r o c D o ma i n s ( ) {
/ / . .. C od e r em ov ed : t he s am e i n it i al i za t io n a s i n t he s er ia l c as e .
/ / I f w e s ti ll h av e f re e p r oc es so rs
if ( n _ d om a in s < m N Pr o cs ) {
/ / s pl it t he b ig ge st d om ai n a nd t ry a ga in .
biggest_axis ->nDomains ++;
in t t e m p _ d o ma i n s = C r e a t e N P r oc D o m a i n s ( ) ;
/ / I f w e g et h er e , w e f in is he d c al cu l at in g .
/ / E it he r w e s pl it 1 t im es t oo m an y a nd we h av e t o c le an u p
/ / o r w e a lr ea dy c le an ed u p a nd j us t h av e t o r et ur n .
if ( t e mp _ do ma i ns < 1 ) {
/ / W e s pl it t oo m uc h . G o b ac k o n s t ep a nd f in is h .
biggest_axis ->nDomains --;
return n _ d o m a i n s ;
} else
return t e m p _ d o m a i n s ;
} else { / / N o m or e f re e p r oc es so rs . W e r et ur n .
if ( n _ d om a i ns > m N Pr o cs ) {
/ / F i ni s he d s p li t t in g .
/ / G o b ac k o ne s te p if w e o ve rd id it .
return 0;
} else {
/ / I f n _ d om a in s = m NP r oc s , w e ’ re f i ne .
return n _ d o m a i n s ;
}
/ / W e s ho ul d n ev er g et h er e .
}
/ / W e s ho ul d n ev er g et h er e .
}
A.3.2 DIRECT VISUALIZATION
The implementation for direct visualization was in OpenGL, using its GLUT exten-
sions. The application begins by initializing the simulation parameters and setting the
usual GLUT callbacks. Then it enters the GLUT main loop, where it checks for events
from the user and the operating system, and dispatches them to the callback functions.
The most relevant part of the code is the display function, that is called each time
the display needs refreshing. It is there that we push the simulation a certain number
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 115/140
A.3. Code from Chapter 4 95
of cycles and get the display data (particles’ position, EM field diagnostics, etc) to the
screen. We set up the idle function callback (with glutIdleFunc()) so that it is also aredisplay (glutPostRedisplay()). This ensures that the simulation is run also when
no events are detected.
Listing A.16 shows our implementation of the display function. The first lines run
a simulation cycle (called by pusher->Push()) n_cycles times. Then we set up the
OpenGL state and finally draw either particles or diagnostics.
The function drawParticles, called from the display function, is shown in List-
ing A.17. Particles are drawn as points. We also have to take into account if we are
LISTING A.16: GLUT display function for direct visualization.
void display( void ) {
if ( ! p au s e ) {
fo r ( in t i =0 ; i < n _c yc le s ; i + +) {
p u s h e r - > P u s h ( ) ;
total_cycles++;
}
}
g l C l e ar ( G L _ C O L O R _ B U F F E R _ B I T | G L _ D E P T H _ B UF F E R _ B I T ) ;
g l M a t r i x M o d e ( G L _ P R O J E C T I O N ) ;
glLoadIdentity();
g l u P e r s pe c t i v e ( 6 0 .0 , w i n W / w in H , 0 . 01 , b a c k _ p la n e ) ;
g l M a t r ix M o d e ( G L _ M O D E L V I EW ) ;
glLoadIdentity();
/ / V i ew t r an s f .
g l u L o o kA t ( p o s _ f r o m . x , p o s _ f ro m . y , p o s _ f ro m . z ,
p o s _ to . x , p o s _ to . y , p o s _ to . z ,
0 .0 , 1 .0 , 0 . 0) ;
/ / M o de l t r an s f .
G L f l o at m [ 4 ] [ 4 ] ;
b u i l d _ r ot m a t r i x ( m , m _ q u at ) ;
g l M u l t M a t r i x f ( & m [ 0 ] [ 0 ] ) ;
/ / A x is
g l C a l l L i s t ( a x i s _ l i s t ) ;
in t n u m _ p a r ti c l e s = p u sh e r - > G e t N P ( ) ;
if ( d i a g_ m od e = = n o ne )
d r a w P a r t i c l e s ( n u m _ p a r t i c l e s ) ;
else
d r a w D i a g n o s t i c ( ) ;
glutSwapBuffers();
}
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 116/140
96 Appendix A. Code listings
using VBOs or a vertex array. Finally, if we keep the species ordered inside the array,
we can draw them with different colors without a significant time penalty.
The other relevant function called from the display function is drawDiagnostics.
It is shown in Listing A.18. It calls ReplaceDiagnostic, which is a function that gets
the correct diagnostic to display from the simulation, turns it into a normalized tex-
ture, and makes this texture active. After that, the staggered grid has to be taken into
account (hence the variable cell_offset) and the texture can be displayed.
LISTING A.1 7: Drawing the simulation particles.
void d r a w P a r t i c l e s (in t n u m _ p a r ti c l e s ) {
g l P o i n t S i z e ( p o i n t _ s i z e ) ;
g l E n a b l e C l i e n t S t a t e ( G L _ V E R T E X _ A R R A Y ) ;
G L u i nt v b o ;
switch ( m od e ) {
case VBO:
v b o = p u sh e r - > G e t V B O ( ) ;
if ( v bo ) {
g l B i n d Bu f f e r ( G L _ A R R A Y_ B U F F ER , v b o ) ;
g l V e r t e xP o i n t e r ( 2 , G L _ FL O A T , 0 , N U L L ) ;
}
break ;
default :
p a r t i cl e s = p u sh e r - > G e t P o s i t i o ns ( ) ;
g l V e r t e x Po i n t e r ( 2 , G L _ FL O A T , 0 , ( float * ) p a r t i c l e s ) ;
}
/ / C o lo r p er s p ec i es :
g l C o l o r3 f ( 0 . 0 , 1 .0 , 1 . 0 ); / / c ya n = s pe ci es 1
g l D r a w Ar r a y s ( G L _ P O IN T S , 0 , n u m _ p a r ti c l e s / 2 ) ;
g l C o l o r3 f ( 1 . 0 , 1 .0 , 0 . 0 ); / / y el lo w = s pe ci es 2
g l D r a w Ar r a y s ( G L _ P O IN T S , n u m _ p a rt i c l e s / 2 , n u m _ p a r ti c l e s / 2 ) ;
switch ( m od e ) {
case VBO:
glBindBuffer (GL_ARRAY_BUFFER , 0);
break ;
default : ; / / N ot hi ng t o d o
}
g l D i s a b l e C l i e n t S t a t e ( G L _ V E R T E X _ A R R A Y ) ;
}
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 117/140
A.3. Code from Chapter 4 97
LISTING A.1 8: Drawing the diagnostics as textures.
void d r a w D i a g n o s t i c (void ) {
g l E n a bl e ( G L _ T E X T UR E _ 2 D ) ;
g l T e x E nv f ( G L _ T EX T U R E _E N V , G L _ T E XT U R E _ E NV _ M O D E , G L _ R E P LA C E ) ;
R e p l a c e D i a g n o s t i c ( d i a g _ m o d e ) ;
g l B e g i n ( G L _ Q U A D S ) ;
{
g l T ex C o or d 2 f ( 0 .0 , 0 .0 ) ;
g l V er t ex 3 f ( b o x_ m in . x - c e l l_ o ff s e t .x ,
b o x_ m in . y - c e l l_ o ff s e t .y ,
0 .0 ) ;
g l T ex C o or d 2 f ( 1 .0 , 0 .0 ) ;
g l V er t ex 3 f ( b o x_ m ax . x - c e l l_ o ff s e t .x ,
b o x_ m in . y - c e l l_ o ff s e t .y ,
0 .0 ) ;g l T ex C o or d 2 f ( 1 .0 , 1 .0 ) ;
g l V er t ex 3 f ( b o x_ m ax . x - c e l l_ o ff s e t .x ,
b o x_ m ax . y - c e l l_ o ff s e t .y ,
0 .0 ) ;
g l T ex C o or d 2 f ( 0 .0 , 1 .0 ) ;
g l V er t ex 3 f ( b o x_ m in . x - c e l l_ o ff s e t .x ,
b o x_ m ax . y - c e l l_ o ff s e t .y ,
0 .0 ) ;
}
glEnd();
}
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 118/140
98 Appendix A. Code listings
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 119/140
APPENDIX B
A GPGPU IMPLEMENTATION
OF THE BORIS PUSHER ALGORITHM
In this appendix we describe our first approach in implementing a PIC code in GPUs,
referred in the beginning of Chapter 3. This occurred in 2006, before the advent of
CUDA. At the time, several approaches existed to use the computing power of pro-
grammable GPUs for general algorithms. We used the OpenGL Shading Language
(GLSL), version 2.0 [27] and were able to implement the full Boris pusher in about 3
months, after a one-semester course on Computer Graphics. A short report on that
effort follows.
B.1 MOTIVATION
The particle push is one of the most time consuming steps in PIC codes. Since it uses
vectorial quantities and it often needs to do vectorial operations, the Boris pusher is a
very good candidate for high code acceleration, by adapting the algorithm to use vec-
tor and stream instructions (SIMD). Although most processors have vector instruc-
tions available, like Altivec on the PowerPC and MMX/SSE/SSE2/SSE3 on the In-
tel/AMD, they are usually working at the top of their capabilities during a numericalsimulation. On the other hand, Graphics Processor Units (GPUs) in 2006 already had
SIMD capabilities and had shown to provide good performance not only on streaming
applications by performing the same operation over large collections of data as well
as on applications that have sufficient parallelism and computational intensity to hide
memory latency [20]. GPUs of the time, like the ATI X1900 series [95] and the NVIDIA
GeForce 7800 series [96], featured both programmable vertex and fragment processors
and provided support for floating point operations, making them available targets for
streaming computation.
99
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 120/140
100 Appendix B. A GPGPU Implementation of the Boris Pusher Algorithm
These GPU had shown its usefuless for many scientific computations [97, 98].
Work at the time had shown the feasibility of using the stream capabilities of pro-grammable GPU for particle tracing [90], mainly due to the following reasons:
— Programmable shaders offered a programmable environment with highly acceler-
ated vectorial operations.
— It was possible to allocate GPU memory that could be used as a texture map or a
vertex array alternatively. This gave us the ability to run the whole algorithm with
minimal data transfer between CPU and GPU memory.
— GPU support for float precision was improving. GPU processors in 2006 could al-
ready use full 32-bit floats in the entire pipeline, and 16-bit floats (OpenEXR format)
for textures.
Taking advantage of these characteristics and the fact that GPUs are usually idle
during numeric computations, we decided in 2006 to programme a simplified PIC
code in the GPU’s fragment processor as well as take some preliminary steps in a
distributed implementation over a cluster of GPUs.
B.2 THE BORIS PUSHER IN A SIMPLIFIED PIC CODE
A description of the computational steps of the Boris pusher were already given in
Section 3.4.1. However, to use the particle pusher in a relevant scenario, we also had
to implement the interpolation of the EM-field at each particle position. The fields
did not have to be updated with the particles current and charge, thus simulating a
forced-field scenario. This simplified PIC code, where the self-consistent evolution of
the fields is neglected, can be resumed as follows:
1. Define the initial conditions:
t = t0, rt0 , vt0− ∆t2 .
2. Calculate Bt and Et.
3. For each particle:
(a) Interpolate Bt and Et at rt.
(b) Push the particles, getting the new positions, with the Boris pusher.
4. Advance the time by ∆t and go to 2.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 121/140
B.3. GPGPU implementation 101
B.3 GPGPU IMPLEMENTATION
The implementation starts by allocating and initializing the textures that will hold the
values of the vectorial quantities. The highest precision available for textures is 16-bit
floats in the OpenEXR format [99].
For N particles, r and v are stored in 1D RGB textures of length N (or of the next
power of two, since non power of two textures can still be penalized in performance).
Two textures for each one of these quantities are needed, one that stores rt and vt− ∆t2
,
and another for rt+∆t and vt+ ∆t2
. We call this textures T r0, T v0, T r1 and T v1. Two 3D
RGB textures, T E
and T B
, are also needed, one for E and the other for B. They have the
same size as the number of cells of the simulation space (again, some power of two
ajustment might be needed). Figure B.1 illustrates how this textures are used in the
algorithm.
T x0
T v0
T B
T E
T x1
T v1
T B
T E
Step Step
Interp.
Interp.
Boris Pusher
T x0
T v0
T B
T E
Step Step
Interp.
Interp.
Boris Pusher
FIGURE B.1: The use of textures in the GPGPU implementation of the Boris pusher.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 122/140
102 Appendix B. A GPGPU Implementation of the Boris Pusher Algorithm
Scalar quantities, like ∆t, m and q, are passed as uniform shader parameters.
For each particle, E and B are interpolated at the particle position, after it is read
from T r0. The new positions are calculated using the Boris pusher: vt+ ∆t2
is calculated
and rendered to T v1, and rt+∆t is calculated and rendered to the T r1 texture. At this
step, T r1 can also be rendered to the screen.
The time is advanced and a new cycle begins. T E and T B are updated, and T r1 and
T v1 are used to calculate the new v and r, that are rendered to T v0 and T r0.
Boundary conditions have to be implemented. In our simulation, we have used a
periodic boundary, so that the particles are re-injected when they leave the simulation
box.
B.4 PARALLELIZATION ISSUES
With this implementation, we were able to have a real-time display of a simple PIC
algorithm with a Boris pusher. However, the memory size of the GPUs of the time
was an important limitation. A small simulation, with 1283 cells and 8 particles per
cell requires 408 MB just for texture memory, which might not be available even on
video cards with 512 MB RAM.
One way to overcome this limitation was to distribute the simulation over several
GPUs. Similar approaches had already been done with other algorithms [90, 100].
With our simplified PIC code, the particles do not influence the fields. The Boris
pusher could be parallelized with very little penalty if we only partition the particles
over each node, but still use the full fields (see Figure B.2). In our previous example,
memory requirements were reduced from 408 MB to 24 + 384/P, where P is the num-
ber of available GPU processors. With just 2 GPUs, this results in 216 MB of texture
memory, which fits well in processors with 512 MB video memory.
More interesting was the possibility to run simulations of a reasonable size, like
2563 cells and 8 particles per cell. This is a 8
×increase on texture memory require-
ments. On a 16 node cluster, 288 MB are needed per node. As long as communication between the nodes was kept to a minimum, this was an acceptable solution.
B.5 CLUSTER IMPLEMENTATION
The techniques described in the previous sections were implemented on the commod-
ity hardware that was used as part of the Grid node for plasma physics simulation that
we were maintaining at the time. Each working node (WN) was configured with an
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 123/140
B.6. Performance and limitations 103
FIGURE B.2: The use of forced-fields allows for easy domain decomposition: each GPU gets
the whole spatial domain (represented by the grid), but the particles are distributed over the
GPUs (circles of different tones). This leads to no particle exchange occurring between the
GPUs and less node communication. (For a color version, see Figure C.9 on page 109.)
AMD Athlon 64 3200+ CPU with 1 GB RAM and a NVIDIA 6600GT PCIe video card
with 128 MB RAM. An user interface node (UI) was also used, which was responsi-
ble for initializing the textures with the data and for launching the GPU processes on
the working nodes. The LAM MPI implementation was used as the communication
interface between the nodes [5].
The application running on the UI detected the WN available and distributed the
particles. After receiving the particles, each WN ran the simulation for a certain sim-
ulation time or number of steps (set by the UI). During this time, no node communi-
cation was needed. The final result (particle position and velocity) was sent from the
WN to the UI, where it was stored and displayed.
B.6 PERFORMANCE AND LIMITATIONS
Simulations run on a system with one UI and two WNs showed good performance,
comparable to the simulation on the CPU (see Table B.1). Overhead caused by the
transfer of textures between CPU and GPU memory was compensated by the SIMD
characteristics of the GPU.
The biggest limitation in this GPGPU implementation of the Boris pusher was the
lack of support for double precision floats in the GPU. To study some very detailed
plasma behavior, half or limited single precision is not enough. However, even in
these cases a GPU Boris pusher with limited float precision is still helpful as a way
to see a real-time evolution of the simulation, that helps to quickly grasp the general
behavior of the particles.
The limited video memory available to most GPUs of the time (usually less than
1 GB) was also a constraint on the application of this algorithm on big-scale simula-
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 124/140
104 Appendix B. A GPGPU Implementation of the Boris Pusher Algorithm
GPUs PPC FPS
1 8 3–5 (comparable to a CPU)
2 8 6–9
2 16 3–5
TABLE B.1: Performance of the GPGPU implementation of the Boris pusher. The GPUs were a
NVIDIA GeForce 6600GT and the CPU an AMD Athlon 64 3200+. Intranode communication
used LAM-MPI over a gigabit ethernet. Simulation parameters included a viewport of 512 ×512 pixels and 643 cells.
tions. Parallelization is a good option for a simplified PIC code, since very little node
communication is required.
Finally, mapping generic PIC algorithms to CG concepts was not an easy task. The
lack of a significant performance gain made the enterprise not worth to follow, at least
until CUDA was introduced one year later.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 125/140
APPENDIX C
COLOR FIGURES
SP1
Registermemory
SP8
Registermemory
Shared memory
Constant cache
Texture cache
SMP N
SP1
Registermemory
SP8
Registermemory
Shared memory
Constant cache
Texture cache
SMP 1
Constant memory
Texture memory
Local memory
Global memory
Local memory Local memory
Device DRAM
Device (GPU)
Host (CPU)
FIGURE C.1 : CUDA memory architecture, both from the hardware and software level. (Re-
production of Figure 3.3 from page 43.)
105
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 126/140
106 Appendix C. Color figures
FIGURE C.2: Illustration of a stride during particle push and current deposition, so that kernelsin the same warp handle particles in different cells. p is the number of particles per thread and
s is the stride. (Reproduction of Figure 3.7 from page 53.)
FIGURE C.3: Final state of a full 2D PIC simulation of a Weibel instability (positrons in cyan
and electrons in yellow). (Reproduction of Figure 3.10 from page 56.)
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 127/140
107
R/W
Sort
Interpolaon
Vel.+Pos.
J deposion67%
13%
17%
2%
1%
FIGURE C.4: Percentage of time spent at each step of the pusher algorithm. Based on the
values from Table 3.3. (Reproduction of Figure C.4 from page 107.)
FIGURE C.5: Direct visualization of a 2D PIC simulation of the formation of a Weibel instabil-
ity. (Reproduction of Figure 4.2 from page 68.)
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 128/140
108 Appendix C. Color figures
Visualization
Restart simulation
Start OpenGL context
Initialize simulation
Get user events
Change
visualization
parameters
Change
simulation
parametersIdle
Get positions/diagnostics
Render
Do N simulation cycles
Interaction
Simulation
FIGURE C.6: Overview of the integration between the simulation code (middle layer, in green),
and the interaction and visualization code (top and bottom layers, in red and blue respec-
tively). (Reproduction of Figure 4.4 from page 71.)
(a) (b)
FIGURE C.7 : Two examples of direct interaction during a simulation: (a) depicting the grid
used in the PIC simulation over the particles, (b) zooming in the center of the image while the
simulation is running. (Reproduction of Figure 4.5 from page 71.)
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 129/140
109
FIGURE C.8: Direct visualization of a 3D particle pusher on a static EM field. (Reproduction
of Figure 4.6 from page 73.)
FIGURE C.9: The use of forced-fields allows for easy domain decomposition: each GPU gets
the whole spatial domain (represented by the grid), but the particles are distributed over the
GPUs (circles of different tones). This leads to no particle exchange occurring between the
GPUs and less node communication. (Reproduction of Figure B.2 from page 103.)
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 130/140
110 Appendix C. Color figures
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 131/140
BIBLIOGRAPHY
[1] S. Burke, S. Campana, A. Peris, F. Donno, P. Lorenzo, R. Santinelli, and A. Sciabà.
gLite 3 User Guide, 2009.
[2] D. R. Hartree. The Eniac, an electronic computing machine. Nature, 158:500–506,
1946.
[3] R. Rojas. How to make Zuse’s Z3 a universal computer. IEEE Annals of the
History of Computing, 20(3):51–54, JUL-SEP 1998.
[4] T. Sterling, D. Becker, and D. Savarese. Beowulf: A parallel workstation for
scientific computation. In Proceedings of the International Conference on Parallel
Processing. CRC Press, 1995.
[5] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programmingwith the Message-Passing Interface. MIT Press, 1999.
[6] Federal Coordinating Council for Science, Engineering, and Technology. A re-
search and development strategy for high performance computing. Technical
report, Office of Science and Technology Policy, 1987.
[7] Ian Foster. The anatomy of the Grid: Enabling scalable virtual organizations.
International Journal of Supercomputer Applications, 15(3), 2001.
[8] K. Anstreicher, N. Brixius, J. P. Goux, and J. Linderoth. Solving large quadratic
assignment problems on computational grids. Mathematical Programming,91(3):563–588, February 2002. 17th International Symposium of the Mathemati-
cal Programming Society, Atlanta, Georgia, Aug 07-11, 2000.
[9] C. E. Nugent, T. E. Vollmann, and J. Ruml. An experimental comparison of
techniques for assignment of facilities to locations. Operations Research, 16(1):150,
1968.
[10] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit.
International Journal of Supercomputer Applications, 11(2):115–128, 1997.
111
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 132/140
112 Bibliography
[11] Ian Foster, Carl Kesselman, Jeffrey M. Nick, and Steven Tuecke. The Physiol-
ogy of the Grid: An Open Grid Services Architecture for Distributed SystemsIntegration. In Open Grid Service Infrastructure WG. Open Grid Forum, 2002.
[12] G. E. Moore. Cramming more components onto integrated circuits. Proceedings
of the IEEE, 86(1):82–85, Jan 1998.
[13] Robit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald,
and Ramesh Menon. Parallel programming in OpenMP. Morgan Kaufmann Pub-
lishers Inc., San Francisco, CA, USA, 2001.
[14] Thomas Willhalm and Nicolae Popovici. Putting Intel R
Threading Building
Blocks to work. In IWMSE ’08: Proceedings of the 1st International Workshop on
Multicore software engineering, pages 3–4, NY, USA, 2008. ACM.
[15] Christopher Barton, Calin Cascaval, George Almási, Yili Zheng, Montse Far-
reras, Siddhartha Chatterje, and José Nelson Amaral. Shared memory program-
ming for large scale machines. In PLDI ’06: Proceedings of the 2006 ACM SIG-
PLAN conference on Programming language design and implementation, pages 108–
117, New York, NY, USA, 2006. ACM.
[16] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle,
A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak,
M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and
K. Yazawa. The design and implementation of a first-generation cell proces-
sor. In Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005
IEEE International, pages 184–185 Vol. 1, 2005.
[17] B. Flachs, S. Asano, S. H. Dhong, P. Hotstee, G. Gervais, R. Kim, T. Le,
P. Liu, J. Leenstra, J. Liberty, B. Michael, H. Oh, S. M. Mueller, O. Takahashi,
A. Hatakeyama, Y. Watanabe, and N. Yano. A streaming processing unit for a
cell processor. In Solid-State Circuits Conference, 2005. Digest of Technical Papers.
ISSCC. 2005 IEEE International, pages 134–135 Vol. 1, 2005.
[18] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and
Katherine Yelick. The potential of the cell processor for scientific computing. In
CF ’06: Proceedings of the 3rd conference on Computing frontiers, pages 9–20, New
York, NY, USA, 2006. ACM.
[19] David Luebke, Mark Harris, Jens Krüger, Tim Purcell, Naga Govindaraju, Ian
Buck, Cliff Woolley, and Aaron Lefohn. GPGPU: general purpose computation
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 133/140
Bibliography 113
on graphics hardware. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Course Notes,
page 33, New York, NY, USA, 2004. ACM.
[20] I. Buck, T. Foley, D. Horn, J. Sugerman, P. Hanrahan, M. Houston, and K. Fata-
halian. Brook for GPUs: Stream Computing on Graphics Hardware. In Proceed-
ings of the ACM SIGGRAPH 2004, 2004.
[21] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krueger,
Aaron E. Lefohn, and Timothy J. Purcell. A survey of general-purpose compu-
tation on graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007.
[22] R. W. Hamming. Numerical methods for scientists and engineers. Dover Publica-tions, Inc., New York, NY, USA, 1986.
[23] Chip Walter. Kryder’s law. Scientific American Magazine, 161(8), August 2005.
[24] M. H. Kryder and Chang S. Kim. After hard drives—what comes next? Magnet-
ics, IEEE Transactions on, 45(10):3406–3413, September 2009.
[25] Peter Shirley. Fundamentals of Computer Graphics. A K Peters, 2001.
[26] VisIt: Software that Delivers Parallel Interactive Visualization. https://wci.
llnl.gov/codes/visit/.
[27] Randi J. Rost. OpenGL(R) Shading Language (2nd Edition). Addison-Wesley Pro-
fessional, January 2006.
[28] I. Foster, K. Czajkowski, D. E. Ferguson, J. Frey, S. Graham, T. Maguire,
D. Snelling, and S. Tuecke. Modeling and Managing State in Distributed Sys-
tems: The Role of OGSI and WSRF. Proceedings of the IEEE, 93(3):604–612, March
2005.
[29] The Globus Project. GridFTP: Universal Data Transfer for the Grid. White
Paper, September 2000. http://globus.org/toolkit/docs/2.4/datagrid/
deliverables/C2WPdraft3.pdf.
[30] Data Access with GFAL. http://wiki.egee-see.org/index.php/Data_
Access_with_GFAL.
[31] Peter Kunszt. File Transfer Service User Guide, 2005. https://edms.cern.ch/
document/591792/.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 134/140
114 Bibliography
[32] Sergio Andreozzi, Natascia De Bortoli, Sergio Fantinel, Antonia Ghiselli,
Gian Luca Rubini, Gennaro Tortone, and Maria Cristina Vistoli. GridICE: a mon-itoring service for Grid systems. Future Generation Computer Systems, 21(4):559–
571, 2005.
[33] Fabrizio Pacini. WMS User’s Guide. EGEE-JRA1. https://edms.cern.ch/
document/572489/1.
[34] Jennifer M. Schopf, Laura Pearlman, Neill Miller, Carl Kesselman, Ian Foster,
Mike D’Arcy, and Ann Chervenak. Monitoring the grid with the Globus Toolkit
MDS4. Journal of Physics: Conference Series, 46:521–525, 2006.
[35] C. K. Birdsall and Langdon. Plasma Physics via Computer Simulation (Series on
Plasma Physics). Taylor & Francis, January 1991.
[36] I. Campos, M. Plociennik, H. Rosmanith, and S. Stork. Application support in
int.eu.grid. In Proceedings of IBERGRID: 1st Iberian Grid infrastructure conference,
2007.
[37] Péter Kacsuk and Gergely Sipos. Multi-grid, multi-user workflows in the
p-grade grid portal. Journal of Grid Computing, 3(3):221–238, 9 2005. DOI
10.1007/s10723-005-9012-6.
[38] Thierry Delaitre, Tamas Kiss, Ariel Goyeneche, Gabor Terstyanszky, Stephen
Winter, and Peter Kacsuk. GEMLCA: Running Legacy Code Applications as
Grid Services. Journal of Grid Computing, 3(1):75–90, 6 2005.
[39] M. Kupczyk, R. Lichwala, N. Meyer, B. Palak, M. Plociennik, and P. Wolniewicz.
Migrating Desktop interface for several grid infrastructures. In Hamza, MH, ed-
itor, Proceedings of the IASTED International Conference on Parallel and Distributed
Computing and Networks, pages 183–188, Canada, 2004. IASTED, Acta Press.
[40] R. A. Fonseca, L. O. Silva, F. S. Tsung, V. K. Decyk, W. Lu, C. Ren, W. B. Mori,
S. Deng, S. Lee, T. Katsouleas, and J. C. Adam. OSIRIS: A three-dimensional,
fully relativistic particle in cell code for modeling plasma based accelerators.
In Sloot, P. and Tan, C. J. K. and Dongarra, J. J. and Hoekstra, A. G., editor,
Computational Science-ICCS 2002, PT III, Proceedings, volume 2331 of Lecture Notes
in Computer Science, pages 342–351. Springer-Verlag Berlin, 2002.
[41] L. Gargaté, R. Bingham, R. A. Fonseca, and L. O. Silva. dHybrid: A massively
parallel code for hybrid simulations of space plasmas. Computer Physics Commu-
nications, 176(6):419–425, 2007.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 135/140
Bibliography 115
[42] EGEE Fusion Virtual Organization. http://grid.bifi.unizar.es/egee/
fusion-vo/.
[43] Torque Admin Manual. http://www.clusterresources.com/products/
torque/docs/.
[44] Maui Scheduler Administrator’s Guide. http://www.clusterresources.com/
products/maui/docs/mauiadmin.shtml.
[45] E. Laure, S. M. Fisher, A. Frohner, and C. Grandi. Programming the Grid with
gLite. Computational Methods in Science and Technology, 12(1):33–45, 2006.
[46] The HDF Group. HDF 4.2r4 Reference Manual. http://www.hdfgroup.org/doc.
html.
[47] The CERN DataGrid Project. http://www.cern.ch/grid/.
[48] Mehnaz Hafeez, Asad Samar, and Heinz Stockinger. A Data Grid Prototype for
Distributed Data Production in CMS. In 7th International Workshop on Advanced
Computing and Analysis Techniques in Physics Research, 2000.
[49] Mehmet Balman and Tevfik Kosar. Data scheduling for large scale distributed
applications. In Proceedings of the 5th ICEIS Doctoral Consortium, In conjunction
with the International Conference on Enterprise Information Systems (ICEIS?07, 2007.
[50] Caitriana Mairi and Macleod Nicholson. File Management for HEP Data Grids.
PhD thesis, University of Glasgow, 2006.
[51] B. Radic, V. Kajic, and E. Imamagic. Optimization of Data Transfer for Grid Us-
ing GridFTP. In 29th International Conference on Information Technology Interfaces,
2007., pages 709 –715, June 2007.
[52] Jiazeng Wang and Linpeng Huang. Intelligent file transfer protocol for grid en-vironment. In Wu Zhang, Weiqin Tong, Zhangxin Chen, and Roland Glowinski,
editors, Current Trends in High Performance Computing and Its Applications, pages
469–476. Springer Berlin Heidelberg, 2005.
[53] C. Huang, V.K. Decyk, C. Ren, M. Zhou, W. Lu, W.B. Mori, J.H. Cooley, T.M.
Antonsen, Jr., and T. Katsouleas. Quickpic: A highly efficient particle-in-cell
code for modeling wakefield acceleration in plasmas. Journal of Computational
Physics, 217(2):658 – 679, 2006.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 136/140
116 Bibliography
[54] Mikhail V. Medvedev, Massimiliano Fiore, Ricardo A. Fonseca, Luis O. Silva,
and Warren B. Mori. Long-time evolution of magnetic fields in relativisticgamma-ray burst shocks. The Astrophysical Journal Letters, 618(2):L75, 2005.
[55] S. F. Martins, R. A. Fonseca, J. Vieira, L. O. Silva, W. Lu, and W. B. Mori. Model-
ing laser wakefield accelerator experiments with ultrafast particle-in-cell simu-
lations in boosted frames. Physics of Plasmas, 17(5):056705, 2010.
[56] W. Lu, M. Tzoufras, C. Joshi, F. S. Tsung, W. B. Mori, J. Vieira, R. A. Fonseca,
and L. O. Silva. Generating multi-gev electron bunches using single stage laser
wakefield acceleration in a 3d nonlinear regime. Phys. Rev. ST Accel. Beams,
10(6):061301, Jun 2007.
[57] C. Ren, M. Tzoufras, F. S. Tsung, W. B. Mori, S. Amorini, R. A. Fonseca, L. O.
Silva, J. C. Adam, and A. Heron. Global simulation for laser-driven mev elec-
trons in fast ignition. Phys. Rev. Lett., 93(18):185004, Oct 2004.
[58] J. Tonge, J. May, W. B. Mori, F. Fiuza, S. F. Martins, R. A. Fonseca, L. O. Silva,
and C. Ren. A simulation study of fast ignition with ultrahigh intensity lasers.
Physics of Plasmas, 16(5):056311, 2009.
[59] Ian Blumenfeld, Christopher E. Clayton, Franz-Josef Decker, Mark J. Hogan,
Chengkun Huang, Rasmus Ischebeck, Richard Iverson, Chandrashekhar Joshi,Thomas Katsouleas, Neil Kirby, Wei Lu, Kenneth A. Marsh, Warren B. Mori,
Patric Muggli, Erdem Oz, Robert H. Siemann, Dieter Walz, and Miaomiao Zhou.
Energy doubling of 42 gev electrons in a metre-scale plasma wakefield acceler-
ator. Nature, 445(7129):741–744, Feb 2007.
[60] Christian Hededal. Gamma-Ray Bursts, Collisionless Shocks and Synthetic Spectra.
PhD thesis, The Niels Bohr Institute, University of Copenhagen, 2005. http:
//arxiv.org/abs/astro-ph/0506559.
[61] J. L. Martins, S. F. Martins, R. A. Fonseca, and L. O. Silva. Radiation post-processing in pic codes. In Dino A. Jaroszynski and Antoine Rousse, editors,
Harnessing Relativistic Plasma Waves as Novel Radiation Sources from Terahertz to
X-Rays and Beyond, Prague, Czech Republic, 2009. SPIE.
[62] M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule. Shader algebra. ACM
Transactions On Graphics, 23(3):787–795, August 2004.
[63] R. A. Fonseca, S. F. Martins, L. O. Silva, J. W. Tonge, F. S. Tsung, and W. B. Mori.
One-to-one direct modeling of experiments and astrophysical scenarios: push-
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 137/140
Bibliography 117
ing the envelope on kinetic plasma simulations. Plasma Physics and Controlled
Fusion, 50(12):124034 (9pp), 2008.
[64] J. M. Dawson. Particle simulation of plasmas. Rev. Mod. Phys., 55(2):403–447,
April 1983.
[65] J. P. Boris. Relativistic plasma simulation—optimization of a hybrid code.
In Proc. Fourth Conference on the Numerical Simulation of Plasmas, pages 3–67,
November 1970.
[66] R. W. Hockney and J. W. Eastwood. Computer Simulation Using Particles. Institute
of Physics Publishing, Bristol and Philadelphia, 1988.
[67] Kane Yee. Numerical solution of initial boundary value problems involving
maxwell’s equations in isotropic media. Antennas and Propagation, IEEE Transac-
tions on, 14(3):302–307, May 1966.
[68] J. Villasenor and O. Buneman. Rigorous Charge Conservation For Local
Electromagnetic-Field Solvers. Computer Physics Communications, 69(2-3):306–
316, March 1992.
[69] T. Z. Esirkepov. Exact charge conservation scheme for Particle-in-Cell simula-
tion with an arbitrary form-factor. Computer Physics Communications, 135(2):144–
153, April 2001.
[70] T. Umeda, Y. Omura, T. Tominaga, and H. Matsumoto. A new charge conser-
vation method in electromagnetic particle-in-cell simulations. Computer Physics
Communications, 156(1):73–85, 2003.
[71] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel
programming with CUDA. Queue, 6(2):40–53, 2008.
[72] C. Sigg and M. Hadwiger. Fast third-order texture filtering. In GPU Gems 2:
Programming Techniques for High-Performance Graphics and General-Purpose Com- putation, chapter 20, pages 313–329. Addison-Wesley, 2005.
[73] Daniel Ruijters, Bart M. ter Haar Romeny, and Paul Suetens. Efficient GPU-
Based Texture Interpolation using Uniform B-Splines. Journal of Graphics, GPU,
& Game Tools, 13(4):61–69, January 2008.
[74] George Stantchev, William Dorland, and Nail Gumerov. Fast parallel Particle-
To-Grid interpolation for plasma PIC simulations on the GPU. Journal of Parallel
and Distributed Computing, 68(10):1339–1349, 2008.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 138/140
118 Bibliography
[75] V. Podlozhnyuk. Histogram calculation in CUDA. Technical report, NVIDIA,
2007. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf.
[76] N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms
for manycore gpus. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE
International Symposium on, pages 1 –10, May 2009.
[77] R. A. Fonseca, L. O. Silva, J. Tonge, R. G. Hemker, J. M. Dawson, and W. B.
Mori. Three-dimensional particle-in-cell simulations of the Weibel instability in
electron-positron plasmas. Plasma Science, IEEE Transactions on, 30(1):28–29, Feb
2002.
[78] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim,
Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennu-
paty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the
100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and
GPU. SIGARCH Comput. Archit. News, 38(3):451–460, 2010.
[79] H. Burau, R. Widera, W. Hö andnig, G. Juckeland, A. Debus, T. Kluge,
U. Schramm, T.E. Cowan, R. Sauerbrey, and M. Bussmann. PIConGPU: A Fully
Relativistic Particle-in-Cell Code for a GPU Cluster. Plasma Science, IEEE Trans-
actions on, 38(10):2831–2839, October 2010.
[80] Viktor Decyk, T. V. Singh, and Scott Friedman. Graphical processing unit-based
particle-in-cell simulations. To be published in Proceedings of the 10th Interna-
tional Computational Accelerator Physics Conference, 2010.
[81] Viktor Decyk and T. V. Singh. Adaptable Particle-in-Cell Algorithms for Graph-
ical Processing Units. To be published in Computer Physics Communications Jour-
nal, 2010.
[82] Xianglong Kong, Michael C. Huang, Chuang Ren, and Viktor K. Decyk. Particle-
in-cell Simulations with Charge-Conserving Current Deposition on GraphicProcessing Units. To be published in Journal of Computational Physics, 2010.
[83] IDL. http://www.ittvis.com/IDL.
[84] SCIRun: A Scientific Computing Problem Solving Environment, Scientific Com-
puting and Imaging Institute (SCI). http://www.scirun.org.
[85] Paraview: an open-source, multi-platform data analysis and visualization ap-
plication. http://www.paraview.org.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 139/140
Bibliography 119
[86] OpenDX: The Open Source Software Project Based on IBM’s Visualization Data
Explorer. http://www.opendx.org.
[87] M. Marti, L. Gargate, R. A. Fonseca, L. L. Alves, J. P. S. Bizarro, P. Fernandes,
J. P. M. Almeida, H. Pina, F. M. Silva, and L. O. Silva. The IST Cluster: an
integrated infrastructure for parallel applications in Physics and Engineering.
In Ibergrid 2008 — 2nd Iberian Grid Infrastructure Conference Proceedings. Netbiblo,
2008.
[88] EPP: Extreme Plasma Physics at IST. http://epp.ist.utl.pt/.
[89] Dave Shreiner. The OpenGL Programming Guide, The Official Guide to Learning
OpenGL. Addison-Wesley, 2009.
[90] Jens Krüger, Peter Kipfer, Polina Kondratieva, and Rüdiger Westermann. A
particle system for interactive visualization of 3d flows. IEEE Transactions on
Visualization and Computer Graphics, 11(6):744–756, 2005.
[91] Keenan Crane, Ignacio Llamas, and Sarah Tariq. Real-time simulation and ren-
dering of 3d fluids. In GPU Gems 3, chapter 30, pages 633–676. Addison-Wesley,
2008.
[92] Takahiro Harada. Real-time rigid body simulation on GPUs. In GPU Gems 3,
chapter 29, pages 611–632. Addison-Wesley, 2008.
[93] I. Campos, F. Castejón, G. Losilla, J. M. Reynolds, F. Serrano, A. Tarancón, R. Val-
lés, and J. L. Velasco. IVISDEP: a Fusion plasma application ported to the Inter-
active European Grid e-Infrastructure. In Proceedings of IBERGRID: 1st Iberian
Grid infrastructure conference, 2007.
[94] Tech-X. Vorpalview, 2010. http://www.txcorp.com/products/VORPAL/
vorpalview.php.
[95] ATI. Radeon x1900 product site, 2006. http://www.amd.com/us/products/desktop/graphics/other/Pages/x1900-oveview.aspx.
[96] NVIDIA. Geforce 7800 product site, 2005. http://www.nvidia.com/page/
geforce_7800.
[97] A. Lastra, M. Lin, and D. Manocha. GP2: ACM Workshop on General Purpose Com-
putation on Graphics Processors. The Organizing Committee of the 2004 ACM
Workshop on General-Purpose Computing on Graphics Processors, October
2004. http://www.cs.unc.edu/Events/Conferences/GP2/proc.pdf.
8/3/2019 PAbreu Thesis
http://slidepdf.com/reader/full/pabreu-thesis 140/140
120 Bibliography
[98] Nico Galoppo, Naga K. Govindaraju, Michael Henson, and Dinesh Manocha.
Lu-gpu: Efficient algorithms for solving dense linear systems on graphics hard-ware. In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing ,
page 3, Washington, DC, USA, 2005. IEEE Computer Society.
[99] Industrial Light & Magic. OpenEXR, 2010. http://www.openexr.com/.
[100] Daniel Reiter Horn, Mike Houston, and Pat Hanrahan. ClawHMMER: A
Streaming HMMer-Search Implementation. In SC ’05: Proceedings of the 2005
ACM/IEEE conference on Supercomputing, page 11, Washington, DC, USA, 2005.
IEEE Computer Society.