PAbreu Thesis

8/3/2019 PAbreu Thesis

http://slidepdf.com/reader/full/pabreu-thesis 1/140

UNIVERSIDADE TÉCNICA DE LISBOAINSTITUTO SUPERIOR TÉCNICO

Multiscale High-Performance Computingin Plasma Physics

fine and coarse parallelism in simulation and visualization

Paulo Tribolet de AbreuLicenciado

Dissertação para obtenção do Grau de Doutor

em Engenharia Física

Orientador: Doutor Luís Miguel de Oliveira e Silva

Co-orientador: Doutor João António Madeiras Pereira

Jurí: Presidente: Presidente do Conselho Científico do IST

Vogais: Doutor Fernando António de Oliveira Carvalho Rodrigues

Doutor Carlos António Abreu Fonseca Varandas

Doutor Luís Miguel de Oliveira e Silva

Doutor João António Madeiras Pereira

Doutor Ricardo Parreira de Azambuja Fonseca

Doutor Viktor Konstantyn Decyk

Dezembro de 2010



iii

Título: Computação Multi-Escala de Alta Performance em Física de Plasmas:

níveis de paralelismo em simulação e visualizaçãoNome: Paulo Tribolet Abreu

Doutoramento em Engenharia Física

Orientador: Luís de Oliveira e Silva

Co-orientador: João Madeiras Pereira

Resumo:

Na primeira metade desta década, a comunidade científica teve que en-

contrar novos meios para satisfazer as suas crescentes necessidades de podercomputacional. Duas arquitecturas destacaram-se como especialmente pro-

missoras: computação Grid e processadores gráficos. Mostramos nesta tese

novos resultados obtidos nestas arquitecturas tão recentes, na área da simula-

ção numérica de plasmas.

No lado da Grid, desenvolvemos e testámos uma biblioteca multithread que

permite a programas transferir grandes conjuntos de dados da Grid de uma

forma simples, robusta, transparente para o utilizador e que minimiza o im-

pacto na performance. O desempenho e o impacto foram testados em vários

cenários de utilização e num código de produção. No lado dos processadores

gráficos, implementámos um código de simulação de plasmas totalmente rela-

tivista. Apresentamos alguns resultados de simulações 2D e uma comparação

com um código semelhante de última geração.

Explorámos também diferentes níveis de paralelismo na área da visuali-

zação científica. Construímos um cluster de visualização em memória distri-

buída e utilizámo-lo para visualizar grandes conjuntos de dados. Implemen-

támos a possibilidade de visualização directa no nosso código de simulação,

de modo a permitir ao utilizador uma interacção e exploração imediata dos

resultados, enquanto estes estão ainda a ser produzidos.

Palavras-chave: Computação; Física de Plasmas; Algoritmos Paralelos;

Visualização; Computação Grid; Simulação.



v

Title: Multiscale High-Performance Computing in Plasma Physics:

fine and coarse parallelism in simulation and visualization

Abstract:

When commodity processors hit the power wall in the first half of this

decade, the scientific community needed to find new means to satisfy its in-

creasing computational needs. Two architectures stand out as very promis-

ing in the near future: Grid computing and graphic processors. They are thecoarse and the fine extremes in the spectrum of parallel computing. This thesis

shows several new results obtained with these new computing architectures in

numerical simulations in plasma physics.

In the coarse end of parallel computing, we have developed and deployed

a multithreaded library that allows code to manage transfers of large data sets

from the Grid, in a way that is simple, robust, transparent to the user and

minimizes performance impact. We present an analysis of the performance

impact on several application scenarios and on a real case scenario. In the fine

grain parallelism side, we have implemented a full relativistic plasma code on

graphic processor. We present some results from 2D simulations and a com-

parison against a similar state-of-the-art code.

We have also explored coarse and fine grain parallelism in scientific visu-

alization. We have set up a distributed memory scientific visualization cluster

and used it to visualize large data sets. We have also implemented a direct

visualization feature in our plasma simulation code, so that the user can inter-

act on the screen with the simulation and watch and explore as the results are

being produced.

Keywords: Computing; Plasma Physics; Parallel Algorithms; Visualization;

Grid Computing; Simulation.



ACKNOWLEDGMENTS

This work would never have been done without the support of a large number of

individuals and institutions. Unfortunately I do not have the ability to personallythank them all, but at least I would like to refer to some in particular.

Prof. Luís Silva was my supervisor during the five years of this work, and also

during the previous years as an inexperienced researcher. During all this time he was

able to maintain the ideal environment for research at the Grupo de Lasers e Plasmas

(GoLP/IPFN) and to guide me through all the obstacles towards a successful outcome.

It was a privilege to work under the guidance of such an extraordinary scientist and

team leader.

Prof. João Madeiras Pereira was my co-supervisor and introduced me to the fan-

tastic world of Computer Graphics. This is a knowledge that brought me to new areasI would not have been able to explore on my own. For that I am very thankful.

Prof. Ricardo Fonseca was an invaluable help through most of the work in this

thesis. His deep knowledge of many different scientific, computational, and numerical

areas enlightened me in various stages, and his perseverance and pragmatic optimism

pulled me forward at the most difficult moments. I consider myself very lucky to have

had the opportunity to receive his guidance and support.

The research group at GoLP is the perfect environment for pushing the limits of

knowledge in Plasma Physics. Its members are excellent and highly motivated scien-

tists and support personel that master the very fine balance between hard work and

play. They all contributed in various ways to the positive outcome of this work, but

I would like to thank in particular: Michael Marti, for companionship in the server

room and all kinds of important lessons in server maintenance; Luís Gargaté, for his

incentive in finishing his PhD earlier than me; and Jorge Vieira and Samuel Martins

for a careful revision of the first version of this thesis.

My week at UCLA was short but very fruitful. I am thankful for the support of

Prof. Warren Mori and to Dr. Viktor Decyk for the helpful discussions. Samuel Mar-

vii



viii Acknowledgments

tins, Cláudia Franco, and Frederico Fiúza where the best of hosts and created the per-

fect homely set for a fruitful research week.

My parents started this work when they gave me my first chemistry set as present

for my 7th birthday. They never stopped supporting me in various ways through all

these years. Their example of love and courage is the foundation of my present life.

My brothers and sister never doubted my success, and all of them contributed to my

desire to study science and technology, and to become an engineer and a researcher.

I would never have embarked on this 5 year project without the support of my

lover and companion Britta, and our daughter Kalú. They offered their love and un-

conditional support from the first moment and at all times, even when I was least

expecting it. With all my love, this work is dedicated to you.This work was partially supported by FCT, Portugal, grant SFRH/BD/17870/2004

and project GRID/GRI/81800/2006, and by the NVIDIA Professor Partnership Pro-

gram.



CONTENTS

1 Introduction 1

1.1 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Multi-core Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Scientific Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Original Contributions in this Thesis . . . . . . . . . . . . . . . . . . . . . 9

2 Coarse grain parallelism 13

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 The EGEE Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Motivation for Plasma Simulation Codes . . . . . . . . . . . . . . 16

2.2 An EGEE Grid Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 DataLight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Available Tools for Data Transfer . . . . . . . . . . . . . . . . . . . 20

2.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.4 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.5 Application deployment . . . . . . . . . . . . . . . . . . . . . . . . 332.4 Portal development and other future work . . . . . . . . . . . . . . . . . 34

3 Fine grain parallelism 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Basic PIC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 CUDA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 CUDA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

ix



x Contents

3.4.1 Particle push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.2 Current deposit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.3 EM field update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4.4 Boundaries and guard cells . . . . . . . . . . . . . . . . . . . . . . 54

3.5 Application to Weibel Instability . . . . . . . . . . . . . . . . . . . . . . . 54

3.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.7 Overview of other work on PIC and CUDA . . . . . . . . . . . . . . . . . 62

3.8 Overview and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 Multi-scale visualization 654.1 Distributed rendering with VisIt . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Direct visualization with CUDA . . . . . . . . . . . . . . . . . . . . . . . 67

4.2.1 Particle (point) visualization . . . . . . . . . . . . . . . . . . . . . 67

4.2.2 Grid Visualization and Diagnostics . . . . . . . . . . . . . . . . . . 69

4.2.3 Interactivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.4 3D visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.5 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Overview and Future Work 75

A Code listings 77

A.1 Code from Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77



A.3.1 VisIt plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91A.3.2 Direct visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

B A GPGPU Implementation of the Boris Pusher Algorithm 99

B.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

B.2 The Boris pusher in a simplified PIC code . . . . . . . . . . . . . . . . . . 100

B.3 GPGPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

B.4 Parallelization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102



Contents xi

B.5 Cluster implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

B.6 Performance and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 103

C Color figures 105



xii Contents



LIST OF FIGURES

1.1 First generation Grid model. . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Second generation Grid model using Globus Toolkit 2 middleware. . . . 4

1.3 Third generation Grid model. . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 The EGEE site CFP-IST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Using GFAL instead of POSIX calls with HDF. . . . . . . . . . . . . . . . 21

2.3 Serial Data First (DF) scenario variant: the application alternates be-

tween a simulation step and a transfer step. . . . . . . . . . . . . . . . . . 24

2.4 Serial data last (DL) scenario: the application completes the simulation

and only after that it initiates the transfer of the result data. . . . . . . . . 252.5 Weak Data Threaded (W-DT) scenario: data transfers occur on a dif-

ferent thread than the simulation and take usually less time than one

simulation step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Intensive Data Threaded (I-DT) scenario: data transfers occur on a dif-

ferent thread than the simulation and take usually longer time than one

simulation step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 A simplified fluxogram of the algorithm implemented to allow for si-

multaneous Grid simulation and network transfer of results. . . . . . . . 28

2.8 Comparing the I-DT scenario (solid line) with the I-DF (dotted line) andI-DL (dashed line) using the F and T time intervals. . . . . . . . . . . . . 31

2.9 Comparing the W-DT scenario (solid line) with the I-DF (dotted line)

and I-DL (dashed line) using the F and T time intervals. . . . . . . . . . . 31

2.10 Comparing the I-DT scenario (solid line) with the I-DF (dotted line) and

I-DL (dashed line) using the S and C time intervals. . . . . . . . . . . . . 32

2.11 Comparing the W-DT scenario (solid line) with the I-DF (dotted line)

and I-DL (dashed line) using the S and C time intervals. . . . . . . . . . . 32

xiii



xiv List of Figures

2.12 Job state machine in gLite (adapted from [1]). . . . . . . . . . . . . . . . . 35

3.1 Overview of a full cycle of an electromagnetic PIC code. Subscript j

refers to cell quantities on the grid; subscript i refers to particle quantities. 40

3.2 Staggered 2D grids in PIC codes. . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 CUDA memory architecture, both from the hardware and software level.

(For a color version, see Figure C.1 on page 105.) . . . . . . . . . . . . . . 43

3.4 Illustration of particle splitting in the Villasenor-Bunemann current de-

position method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5 Illustration of current deposition in the Esirkepov method. . . . . . . . . 50

3.6 Pseudo-code for the generalized atomic function. . . . . . . . . . . . . . 52

3.7 Illustration of a stride during particle push and current deposition, so

that kernels in the same warp handle particles in different cells. . . . . . 53

3.8 Illustration of the sequence call of the three different kernels required

for a full EM field update. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.9 Charge density of electrons during the formation of a Weibel instability. 56

3.10 Final state of a full 2D PIC simulation of a Weibel instability (positrons

in cyan and electrons in yellow). (For a color version, see Figure C.3 on

page 106.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.11 B21 + B2

2 intensity during the formation of a 2D Weibel instability. . . . . 57

3.12 Percentage of time spent at each step of the pusher algorithm. Based on

the values from Table 3.3. (For a color version, see Figure C.4 on page 107.) 59

3.13 Evaluation of performance with stride size and particles per thread for

the 1.2 M run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1 Part of the hardware for distributed visualization testing. These four

machines represent 8 cores and 4.4 GB of RAM. Another similar node

(not in the picture) acted both as the scheduler and as a computing

node, and added the remaining 2 cores and 1.1 GB of RAM. . . . . . . . 66

4.2 Direct visualization of a 2D PIC simulation of the formation of a Weibel

instability. (For a color version, see Figure C.5 on page 107.) . . . . . . . 68

4.3 Examples of different diagnostics during the formation of a 2D Weibel

instability: (a) electron charge density, (b) positron charge density, (c) in-

tensity of the perpendicular component of the current ( j3), and (d) trans-

verse magnetic field energy B21 + B2

2. . . . . . . . . . . . . . . . . . . . . . 70



List of Figures xv

4.4 Overview of the integration between the simulation code (middle layer,

in green), and the interaction and visualization code (top and bottomlayers, in red and blue respectively). (For a color version, see Figure C.6

on page 108.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.5 Two examples of direct interaction during a simulation: (a) depicting

the grid used in the PIC simulation over the particles, (b) zooming in

the center of the image while the simulation is running. (For a color

version, see Figure C.7 on page 108.) . . . . . . . . . . . . . . . . . . . . . 71

4.6 Direct visualization of a 3D particle pusher on a static EM field. (For a

color version, see Figure C.8 on page 109.) . . . . . . . . . . . . . . . . . . 73

A.1 A simplified fluxogram of DataLight (left and bottom) and its connec-

tion with a simulation code (top). . . . . . . . . . . . . . . . . . . . . . . . 78

A.2 The staggered 2D grid used for EM-field interpolation. . . . . . . . . . . 87

B.1 The use of textures in the GPGPU implementation of the Boris pusher. . 101

B.2 The use of forced-fields allows for easy domain decomposition. . . . . . 103

C.1 CUDA memory architecture, both from the hardware and software level.

(Reproduction of Figure 3.3 from page 43.) . . . . . . . . . . . . . . . . . 105

C.2 Illustration of a stride during particle push and current deposition, so

that kernels in the same warp handle particles in different cells. . . . . . 106

C.3 Final state of a full 2D PIC simulation of a Weibel instability (positrons

in cyan and electrons in yellow). (Reproduction of Figure 3.10 from

page 56.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

C.4 Percentage of time spent at each step of the pusher algorithm. Based on

the values from Table 3.3. (Reproduction of Figure C.4 from page 107.) . 107

C.5 Direct visualization of a 2D PIC simulation of the formation of a Weibel

instability. (Reproduction of Figure 4.2 from page 68.) . . . . . . . . . . . 107

C.6 Overview of the integration between the simulation code (middle layer,

in green), and the interaction and visualization code (top and bottom

layers, in red and blue respectively). (Reproduction of Figure 4.4 from

page 71.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

C.7 Two examples of direct interaction during a simulation: (a) depicting

the grid used in the PIC simulation over the particles, (b) zooming in

the center of the image while the simulation is running. (Reproduction

of Figure 4.5 from page 71.) . . . . . . . . . . . . . . . . . . . . . . . . . . 108



xvi List of Figures

C.8 Direct visualization of a 3D particle pusher on a static EM field. (Repro-

duction of Figure 4.6 from page 73.) . . . . . . . . . . . . . . . . . . . . . 109

C.9 The use of forced-fields allows for easy domain decomposition. . . . . . 109



LISTINGS

A.1 The main exported function for DataLight. It is called for each file that

should be transferred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79A.2 The queue manager thread. . . . . . . . . . . . . . . . . . . . . . . . . . . 80

A.3 The remote file transfer thread. . . . . . . . . . . . . . . . . . . . . . . . . 81

A.4 The function that finishes the transfer threads. . . . . . . . . . . . . . . . 82

A.5 Applying the generalized atomic function from page 52 to the addition

of floats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.6 Entering and leaving the current deposition function. . . . . . . . . . . . 83

A.7 Implementation of the Esirkepov method. . . . . . . . . . . . . . . . . . . 84

A.8 Magnetic field update using shared memory. . . . . . . . . . . . . . . . . 86

A.9 EM-field interpolation at a particle’s position using textures. . . . . . . . 87

A.10 EM-field interpolation at a particle’s position without textures. . . . . . . 88

A.11 The kernel for a particle push. . . . . . . . . . . . . . . . . . . . . . . . . . 89

A.12 Initialization of a CUDA device. . . . . . . . . . . . . . . . . . . . . . . . . 90

A.13 The function to check if the file contains vectorial magnetic field data

from an Osiris simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.14 Domain decomposition for a serial visualization system. . . . . . . . . . 93

A.15 Domain decomposition for a parallel visualization system. We do not

decompose in more domains than the number of available processors. . 94

A.16 GLUT display function for direct visualization. . . . . . . . . . . . . . . . 95

A.17 Drawing the simulation particles. . . . . . . . . . . . . . . . . . . . . . . . 96

A.18 Drawing the diagnostics as textures. . . . . . . . . . . . . . . . . . . . . . 97

xvii



xviii Listings



To Britta and Kalú

with whom I’ve learned

the great mysteries that can’t be taught.



CHAPTER 1

INTRODUCTION

Why Nature’s mysteries can be grasped by us humans in mathematical form is in it-

self a mystery. Several explanations have been formulated, but none can really fully

satisfy our inquiring minds. Some suggest that it is not Nature, but our mind that

is mathematical. Nonetheless, ever since the appearance of electric computers in the

middle of the twentieth century that science has been using them to advance knowl-

edge about the mathematical aspects of Nature: as a confirmation of theories, as a

prediction of experimental results, and as a base for new directions of discovery.

The ENIAC [2], built at the University of Pennsylvania between 1943 and 1946,is usually considered to be the first Turing-complete machine, hence the first modern

computer.1 Although through most of its 10 years of life it was used in military ap-

plications, these usually involved solving physics problems, like ballistic trajectories

and computations related with hydrogen bombs. Science use of computers has not

stopped since then, and has been increasing as fast as computing power has allowed.

In the 50s and 60s, with the transistor and the integrated circuit, science started

relying more and more on digital numeric simulations. During the 70s and 80s, fol-

lowing the successful ARPA experiment of the 60s, research centers got digitally con-

nected, first in the USA and later throughout the world. At the time, a single pow-erful computer system for research would cost from several hundred thousand up to

several million dollars, like the Cray-1, which sold for $8.8 million, or the Goodyear

Massively Parallel Processor, estimated at $750 thousand. This trend of single2 and

very powerful systems continued through the 80s and the first half of the 90s.

1It has been proven [3] that the Z3, built in 1941, was also Turing-complete, although not in a practical

way.2Even when implementing an internal parallel architecture, these were single systems, not a connec-

tion of several independent systems.

1



2 Chapter 1. Introduction

These machines had often unique architectures and forced the user to develop spe-

cific applications for them. Most used a mix of shared and distributed memory, oftencoupled with a hierarchy of access patterns (fast/slow) to it, and different connections

to storage and processing power. A successful application in one system was not likely

to be well deployed in another without serious modifications.

In the 90s, the PC was finally adopted by the numerical simulation scientific com-

munity. With the appearance of the Bewolf type of cluster [4], scientists had access

to cheap and powerful systems made from commodity, off-the-shelf (COTS) compo-

nents. Several PCs, connected with inexpensive ethernet cards and running the free

Linux OS, could run successful numerical experiments for a fraction of the price of

a dedicated machine. Moreover, they were easy to gradually upgrade, as researchmoney became available and PCs became more powerful, unlike the specialized ma-

chines that usually required a full replacement.

The Message Passing Interface (MPI) specification [5] was essential to the wide

acceptance of the PC cluster. It defined a standard way for implementing the com-

munication between the different instances of the deployed application. If an appli-

cation or an algorithm implementation ran successfully on a certain PC cluster using

the MPI specification, it stand a good chance of working unmodified in any other PC

cluster throughout the world. For the best part of ten years, until the mid 2000s, most

computer clusters were based on the shared memory model, using COTS machines,

connected by ethernet and using an implementation of MPI.

1.1 GRID COMPUTING

In the late 80s, the USA Office of Science and Technology Policy issued a report where

several scientific grand challenges were identified [6]. These grand challenges were de-

fined as “fundamental problems in science or engineering whose solution would be

enabled by the application of high-performance computing resources.” In the begin-

ning of the 90s, it became obvious that these grand challenges could not be solved by

one single supercomputer, but only with the joint efforts of the different supercom-

puters installed throughout the research and academia centers. And thus, in 1999,

with the proliferation of PC clusters and the ubiquitous access to the Internet, Grid

computing was born.

Grid computing was defined by Ian Foster et al. in 2001 [7] as “a system that co-

ordinates resources not subjected to a centralized control, that delivers a nontrivial

quality of service and that uses standard, open, general-purpose protocols and in-

terfaces.” This definition was born from a successful grand challenge project called



1.1. Grid Computing 3

NUG30 [8], created to solve a 30-factor quadratic assignment problem [9]. This prob-

lem was posed in 1968, but it was only solved in 2000. It took 7 days of computation,using a grid of 9 clusters, and summing a total of 1000 processors. It required more

than 96 000 CPU hours, which is equivalent to about 11 years on one single CPU. So 11

years of computation were reduced to one week by gathering several computational

clusters into one common task. A general overview of a first generation Grid, like the

one we have just described, is given in Figure 1.1.

FIGURE 1.1 : First generation Grid model.

The idea is that computing power (the “Resource Provider”) is offered to a match-

maker system. This system gathers all the resources available and waits for resource

requests from clients (the resource requesters). Once a match is made, the requester

can use the provided resources. This model had limited success, because of a few

shortcomings:

— the protocols (represented by the arrow lines in Figure 1.1) were not standards;

— the connection between the requester and the provider was fixed during the life-

time of the connection;

— the requester could not choose resources directly.

These shortcomings meant the connections tended not to be reliable. If one of the

players (the requester, the provider, or the match maker) suffered an offline period,

the whole cycle (publishing, requesting, using the resources) had to be redone, usually

requiring human intervention.

Foster’s paper [7] pointed the solution to these problems: physical machines had

to be abstracted by resources, and connections should be done using reliable open

standards. A set of software packages were already being developed to follow those

requirements, in particular the Globus Toolkit [10], which became one of the most

widely used Grid middleware.




Several Grids were built using this new approach and they became part of what is

now called the 2nd generation, where emphasis is put on resources and open proto-cols. A general overview of a 2nd generation Grid is given in Figure 1.2.

FIGURE 1.2 : Second generation Grid model using Globus Toolkit 2 middleware. The Mon-

itoring and Discovery Services (MDS) keep track of the available services and resources. A

request for a service or resource is made using the specific API, and the use of that service or

resource is done through the Grid Resource Allocation Management (GRAM) protocol.

A 2nd generation Grid implements several general components that offer the cor-

rect services and resources:

Security: This component is responsible for the authentication of the user to the Grid.

It should allow for proxy authentications, so that a user can login at only one

place and use any resource anywhere she/he is entitled to.

Information services: This component collects the services and resources available and

announces them to any requester.

Data management: This component is responsible for the data allocation and transfer

services on the Grid.

Execution management: This component allows the user to control applications run-

ning on the Grid: launching, querying, pausing, relaunching, etc.

Common runtime: This component is the general environment required for a remote

application to be launched and run (eg., operating system and application li-

braries).

The most notable example of a 2nd generation Grid might be the Large Hadron

Collider Computing Grid (LCG), originally built specifically to supply computing

power and data storage for the Large Hadron Collider experiments at CERN. Based

partially on the Globus Toolkit middleware and partially on other tools, it is consid-

ered to be one of the most successful 2nd generation Grid implementations. In fact it

grew from being a dedicated Grid to the LHC to become a wide eScience dedicated

Grid called EGEE.



1.1. Grid Computing 5

In 2002 Ian Foster et al. laid out the plans for the third generation Grid [11], by

introducing the Open Grid Service Architecture (OGSA). It allows for a higher level of abstraction by basing the Grid on services instead of resources. A general overview of

a 3rd generation Grid is given in Figure 1.3.

FIGURE 1.3 : Third generation Grid model. Services are published to the registry and waitinvocation from a requester. All communications use open protocols: services are described

using the Web Service Description Language (WSDL), entities are identified through Universal

Description Discovery and Integration (UDDI), and messages are sent using the Simple Object

Access Protocol (SOAP).

The main trend that evolved during these three generations of Grid models was an

increase in the abstraction of how resources are made available. This can be seen by

comparing Figures 1.1 to 1.3. In the first generation grids, resources (like computing

power or data storage) were tightly coupled to the protocols to use them and to the

physical equipment that supported them. Different equipments usually meant differ-ent protocols and resources: a different cluster might mean a different way to transfer

and launch a program; a different storage system (eg., hard disk or tape based) usually

meant different ways to transfer and retrieve the data. In the second generation grids,

each type of resource was associated with one protocol, and the physical equipment

was abstracted away. It could be changed and, as long as the new equipment still

provided the same kind of resource, this change would be transparent to the client,

since the same protocol could be used. Finally, in the third generation grids both the

protocol and the resource are abstract away as a service. A client only has to know

how to communicate with a server to request a service. Each service has an interface

description that is passed to the client at the beginning of the transaction. Thus there

is a loosely-coupled contract between the client and the server.

A major improvement achieved by the 3rd generation Grid was the unification

of Web services and Grid services, creating the Web Services Resource Framework

(WSRF). The idea was that, although Web services supply a good specification for

dealing with services on the Web, they are nominally stateless, that is, no data is

retained between invocations of the service. This is not adequate for Grid services.

WSRF provided a set of operations and components that extended Web services to




allow for state preservation. This allowed for services to simply extend their existing

Web capabilities to become Grid services. The Globus Toolkit version 4 (GT4) is oneof the most widely used implementations of WSRF.

At the moment, the EGEE Grid is one of the world’s largest 3rd generation Grid. It

uses the gLite middleware [1] which is actually a mix of GT4 tools with some specific

services. Most of the computing solutions for today’s grand challenges involve taking

advantage of Grid infrastructures. To that goal, one of the major contributions of this

thesis is the development of a software library that allows programs that produce as

output large amounts of data to be quickly and easily deployed on the EGEE Grid (see

Chapter 2).

1.2 MULTI-CORE MACHINES

During the time when computer grids were being developed and behind all this evo-

lution in scientific computing, Moore’s law [12] was acting in a most spectacular way.

In its current form, it states that the density of components placed on a chip dou-

bles roughly every two years. For a single chip, more density can cause more clock

speed, and so until 2005 Moore’s law effect was to double processor clock speed every

24 months.3 However, around that time chip manufacturers were faced with a new

problem: the power wall.Power consumption on a chip grows roughly with the square of the clock speed.

With the increase in speed and transistor density, the power consumed per unit area

was starting to be an issue for consumer level computers. It was clear that Moore’s

law could no longer mean a steady increase in processor’s speed. The solution was

to keep the same speed, but increase the number of processors per chip. This reduces

the power consumption by√

2 ≈ 1.41 for the same theoretical number of instructions

per second, and has been the approach of CPU makers in the last five years. At the

moment, it is common to find COTS systems with 16 cores.

Most implementations of MPI can treat different cores in one machine as differentnodes in a distributed memory system. This means that MPI code could run unmod-

ified in these clusters made of multi-core machines (known as fat nodes), although not

taking advantage of all the possibilities of a shared memory system. And thus we are

presented with the problem that the scientific high-performance computing commu-

nity has been facing in the last six years: how to take advantage of the multiple cores

3This effect was so reliable that some scientists report anecdotally that if a certain algorithm did not

run fast enough in a recent processor, it was cheaper, faster, and easier to wait for the processor’s speed

to increase than to rethink the algorithm or to add more computers to the calculation.



1.2. Multi-core Machines 7

that are now available in most shared memory clusters while still using the effort

spend in the last 15 years in distributed memory architectures, specially consideringthat clusters are still a very important way to advance numerical simulations?

Several solutions have been emerging, some acting on the level of programming

languages, compilers, and software in general, others as a mix of new hardware archi-

tectures and programming models. At the software level, some of the most relevant

solutions are:

OpenMP: OpenMP [13] extends C, C++ and Fortran to allow for shared memory mul-

tiprocessing. It requires explicit parallelism, hence leaving to the hands of the

programmer which parts to parallelize with MPI and which parts with OpenMP.

TBB: Threading Building Blocks [14] abstracts away the idea of a thread. The pro-

gram is seen as a set of concurrent tasks running on a system with a certain num-

ber of threads available. A task scheduler launches the tasks using the available

threads and is responsible for the load balancing of the whole system. The par-

allelism has to be explicitly programmed, but data locality can be either explicit

or implicit. A disadvantage is that it is available only for C++.

PGAS: Partitioned Global Address Space [15] tries to abstract the typical memory hi-

erarchy of modern systems. On a single core, one has the RAM, usually large

and slow, and several levels of cache, much smaller in size but also much faster.

These concepts can be extended to distributed memory systems (multi-core ma-

chines) or to shared memory systems (clusters of machines). For example, on

a cluster the slow memory is the one available over the network while the fast

memory is the one locally available to the processor. When programming with

PGAS extensions, the programmer deals with local, shared or global memory,

and the underlying physical access (locally over the processor bus or remotely

over network) is abstracted. However, parallelism has to be explicitly declared.

These software approaches give the scientific developer tools to adapt his code in a

more efficient way to the existing trend of clusters of fat nodes, that is, of systems with

a mix of distributed and shared hierarchical memory.

At the same time that these software solutions were being developed, there was

also a movement towards a more efficient multi-core design in hardware, without

the legacy of a single-core architecture. The most notable processor chip that came

out of this movement is the Cell Broadband Engine Architecture (Cell BE for short)

[16]. On the computational level, it is made of a main processor called the Power

processing Element or PPE, which is actually an IBM’s Power Architecture core, and




of eight co-processors called the Synergistic Processing Elements, or SPEs [17]. They

are connected by a high-bandwidth circular data bus called the Element InterconnectBus or EIB. A cache coherent direct memory access (DMA) to main memory and other

external storage completes the architecture. Although several projects in scientific

computing used the Cell BE [18], and the Roadrunner supercomputer being partially

based on it, it failed to gain wide spread acceptance. In November 2009 IBM declared

they had discontinued the development of the Cell.

Unexpectedly, a promising new multi-core hardware architecture became avail-

able to the HPC scientific community from another place: the graphics and gaming

industry.

Programmable Graphic Processor Units (GPU) were introduced in the beginningof this decade. The first generation offered limited programmability (no conditional

expressions, assembler only), but within a few years the next versions evolved to a

full programmable system, although still with a tight connection to the graphics en-

gine [19, 20]. Finally, in 2007, NVIDIA launched the “Compute Unified Device Ar-

chitecture” (CUDA), a software stack (compiler and APIs) that allows for a general

programming approach on specific CUDA-enabled GPUs. Due to the ubiquitous pres-

ence of graphics chips and its low cost, CUDA was accessible to almost anyone who

wanted to try to take advantage of its unique massively parallel architecture [21]. One

major contribution of this thesis is the implementation of a numerical plasma simula-

tion code on a GPU using CUDA (see Chapter 3).

1.3 SCIENTIFIC VISUALIZATION

R. W. Hamming’s ground breaking classic on numerical methods [22] is based on a

very powerful motto:

“The purpose of computing is insight, not numbers.”

Throughout the history of computing, scientists have gained a very powerful ca-

pacity to generate huge amounts of data. Let us not forget that, although processorspeed has not increased significantly in the last few years, Moore’s law is still valid:

our capacity to generate data is still doubling every 24 months [23, 24], using some of

the techniques explained previously. If we expand Hamming’s motto with the follow-

ing corollary:

Data is useless without insight.

the question comes naturally: how do we achieve insight into such amounts of data

that scientists are able to produce?



1.4. Original Contributions in this Thesis 9

One of the answers includes visualization. Since it involves the primary sensory

apparatus (vision), as well as the processing power of the human mind, it is a simpleand effective medium for gaining insight into complex and/or voluminous informa-

tion [25]. Hence a work on new effective ways to produce scientific data as a result

of numerical simulations would not be complete without some contribution in this

area. Chapter 4 explains our novel contribution to the field of scientific visualization,

mainly as an immediate tool to gain insight in running numerical simulations.

1.4 ORIGINAL CONTRIBUTIONS IN THIS THESIS

In Chapter 2 we present a novel approach to adapt HPC applications and its users tothe Grid, in particular to the EGEE Grid.

First we give a short overview of the EGEE Grid in general and of a minimal Grid

site in particular, focusing on the services that have to be available. Next, we tackle

the problem of distributed transfer and storage of large data. We present DataLight, a

software library that allows programs that produce as output large amounts of data to

be quickly and easily deployed on the EGEE Grid. This is an important tool for a broad

class of numerical simulations, mainly because the EGEE Grid is one of the largest in

the world and, although highly optimized for dealing with both data intensive and

computational intensive applications, it has been mainly aimed at applications thatrequire large amounts of data as input but produce a relatively small amount of data

as output. Hence the need for a tool that could handle the inverse paradigm in a

simple, robust, and transparent way to the user.

Since Datalight is based on existing Grid data transfer tools, before developing

the library we do a careful examination of available tools in the EGEE middleware

gLite. Then we develop a methodology for evaluating the performance of the library

and present a general overview of the algorithm that was implemented. Finally, we

show and analyze the results of using DataLight, both in a test environment and with

production code.

This chapter finishes with a presentation of another tool to adapt HPC applications

to the Grid: a web portal that, thanks to its tight integration with DataLight, can warn

the user as each result file becomes available for post-processing, thus avoiding the

unnecessary delay of having to wait for the whole simulation to finish.

In Chapter 3 we present the implementation of a fully relativistic plasma simula-

tion code on CUDA enabled devices. We start with an overview of particle-in-cell al-

gorithms, followed by a description of a CUDA system (hardware and software). Then

we explain our particular implementation, with details about each part of the code.




The purpose was not only to measure what performance gains could be achieved (if

any), but also to estimate the amount of effort that was to be involved in porting anexisting code base to this new environment.

Next we present some results from 2D simulations (Weibel instability) and a com-

parison against a similar state-of-the-art code (Osiris). We also give an overview of

other solutions to this problem (plasma codes in CUDA devices), and finish the chap-

ter with some conclusions.

In Chapter 4 we present two solutions for the visualization and post-processing

of data produced by scientific numerical simulations in general, and by plasma sim-

ulation codes in particular. One deals with large datasets and implements a database

plugin for a parallel visualization program (VisIt [26]). This allows for visualizationand exploration of datasets that do not fit in the memory of a single workstation.

Another original contribution in visualization presented in this chapter is an im-

mediate and interactive exploration of the data produced during a simulation, which

we call direct visualization. This method can integrate directly with a CUDA imple-

mentation and, with minimal influence on the performance of the simulation code,

allows for a direct interaction between the scientist and the simulation while the latter

is running.

The Appendix has two main sections. Appendix A has the code listings of the

most relevant parts of the programs developed during this thesis. This includes theDataLight library, the CUDA implementation of a plasma code, the database plugin

for distributed visualization, and parts of the direct visualization implementation.

Appendix B presents our first work in implementing simulation codes in a graphic

processing units. This eventually proved to be not worth pursuing with the technol-

ogy available at the time (GLSL [27] in 2006), but led to the first development of a

direct visualization tool and to the CUDA approach 18 months later. Hence we think

it is relevant enough to be presented as part of this work, although in a secondary

position.

We end this thesis with a color reproduction of some of the figures and the bibli-

ography.

The work presented in this thesis led to the following publications and communi-

cations:

PAPERS IN REFEREED JOURNALS

— P. Abreu, R. A. Fonseca, J. M. Pereira, L. O. Silva, PIC codes in new processors: a

full relativistic PIC code in CUDA enabled hardware with direct visualization, in

IEEE Transactions on Plasma Science, to be published in January 2011.



1.4. Original Contributions in this Thesis 11

PAPERS IN CONFERENCE PROCEEDINGS

— P. Abreu, Real-Time Scientific Visualization and Exploration of Very Large Datasets,

in Proceedings of the Cumulus Doctoral Supervision Workshop, IADE, Lisbon, Portu-

gal, 2005.

— P. Abreu, J.M. Pereira, L. O. Silva, Exposing Stream Processors as Grid Services: a

GPGPU Example, in Ibergrid — 1st Iberian Grid Infrastructure Conference Proceedings,

Santiago de Compostela, Spain, 2007.

— P. Abreu, R. Fonseca, L. O. Silva, Migrating large output applications to Grid en-

vironments: a simple library for threaded transfers with gLite, in Ibergrid — 2nd

Iberian Grid Infrastructure Conference Proceedings, Porto, Portugal, 2008.

— P. Abreu, R. Fonseca, and L. O. Silva, Streaming the Boris Pusher: A CUDA Imple-

mentation, in Advanced Accelerator Concepts: Proceedings of the Thirteenth Advanced

Accelerator Concepts Workshop, Carl B. Schroeder, Wim Leemans, Eric Esarey (Eds.),

USA 2008, ISBN: 978-0-7354-0617-9.

CONTRIBUTED TALKS

— P. Abreu, Real-Time Scientific Visualization and Exploration of Very Large Datasets,

in Cumulus Doctoral Supervision Workshop, IADE, Lisbon, Portugal, 2005.

— P. Abreu, J.M. Pereira, L. O. Silva, A Distributed Memory GPU Implementation of

the Boris Particle Pusher Algorithm, in Eurographics Symposium on Parallel Graphics

and Visualization, Braga, Portugal, 2006.

— P. Abreu, R. Fonseca, L. Silva, Migrating large output applications to Grid environ-

ments: a simple library for threaded transfers with gLite, in Ibergrid — 2nd Iberian

Grid Infrastructure Conference, Porto, Portugal, 2008.

— P. Abreu, R. Fonseca, and L. O. Silva, Streaming the Boris Pusher: A CUDA Im-plementation, Thirteenth Advanced Accelerator Concepts Workshop, Santa Cruz, USA,

2008.

POSTERS IN CONFERENCES

— P. Abreu, J. M. Pereira, L. O. Silva, A Distributed Memory Grid Enabled GPU Imple-

mentation of the Boris Particle Pusher Algorithm, in APS 2006 48th Annual Meeting

of the Division of Plasma Physics, Philadelphia, USA.




— P. Abreu, J. M. Pereira, L. O. Silva, Exposing Stream Processors as Grid Services: a

GPGPU Example, in Ibergrid — 1st Iberian Grid Infrastructure Conference, Santiago deCompostela, Spain, May 2007.

— P. Abreu, J. M. Pereira, L. O. Silva, A Distributed Memory GPU Implementation

of the Boris Particle Pusher Algorithm, in AstroGPU 2007 — Workshop on General

Purpose Computation on Graphics Processing Units in Astronomy and Astrophysics, IAS,

Princeton, USA, November 2007.

— P. Abreu, R. A. Fonseca, J. M. Pereira, L. O. Silva, PIC codes in new processors: a

full relativistic PIC code in CUDA enabled hardware with real-time visualization,

in 21st International Conference on Numerical Simulation of Plasmas, Lisbon, Portugal,2009.

— V. K. Decyk, T. V. Singh, P. Abreu, R. A. Fonseca, L. O. Silva, Designing Particle-in-

Cell Algorithms for Advanced Computer Architectures: Application to GPUs, 51st

Annual Meeting of the APS Division of Plasma Physics, Atlanta, USA, 2009.



CHAPTER 2

COARSE GRAIN PARALLELISM

We already saw in Chapter 1 how the deployment of scientific codes on a Grid is one

of the possible ways to access enough computing power and memory space to solve

some of the grand challenges that scientists face today. However, deployment of an

existing high-performance scientific code on a Grid might not be a trivial task, since

Grids usually require specific implementations for security, data access and transfer,

system load, etc.

In this chapter, we present DataLight, a simple library based on EGEE middleware

function calls, in particular lcg_utils, that handles the transfer of large data file sets

from the local storage of the Worker Node to an EGEE Storage Element in a way that is

simple, robust, transparent to the user and minimizes performance impact. The library

is very easy to integrate into non-Grid/legacy code and it is flexible enough to allow

for overlapping of several data transfers and computation. We also present a careful

analysis of the performance impact on several application scenarios, and we analyze

the use of the library in a real case scenario, where it is used in a high-performanceplasma simulation code.

In addition, we refer to an ongoing effort to streamline the user interaction to

launch HPC codes on the Grid. We are currently developing a web service for launch-

ing and checking jobs on the Grid, and for keeping track of the output files for easy

post-processing. This service interoperates easily with DataLight and should make

accessing the large data files that result from large scale simulations a simple task for

the user.

13



14 Chapter 2. Coarse grain parallelism

2.1 INTRODUCTION

2.1.1 THE EGEE GRI D

As explained in Chapter 1, the EGEE is one of the largest Grids in the world. It started

as a dedicated support Grid for the Large Hadron Collider experiment (LHC), with

the apt name of LCG, for Large Hadron Collider Computing Grid. At the time, the

middleware used (LCG-2) was specifically developed for it, with some tools used from

the Globus Toolkit 2. This made it a 2nd generation Grid. The LCG was so successful

that its scope broadened to include several other scientific research areas beyond the

original High Energy Physics, like Fusion and Biomedicals. This move came withthe change in name to EGEE, which originally meant “Enabling Grids for eScience in

Europe.” Due to its present intercontinental scope, “Europe” was dropped from the

name, but the short version kept the original four letters.

The EGEE uses its own middleware, called gLite [1], which is based on the Globus

Toolkit 4, with other complementary services added. It is a full 3rd generation Grid

software stack, implementing most of the WSRF standards [28] and based on the com-

ponents described in [7] and [11], namely:

Security: This service includes components for authentication, authorization and del-egation, based on X509 certificates and proxies (for the delegation part).

Data management: This service includes low level data transfer tools like GridFTP [29]

and GFAL [30], and high-level tools like the Logical File Catalogue (LFC) and the

File Transfer Services [31] (FTS). Some of these services and tools will be further

explained throughout this chapter.

Execution management: Mainly the GridICE [32] and the Workload Management Sys-

tem [33] (WDS), this component is responsible for the launching, monitoring,

and updating the state of jobs on the Grid.

Information services: This service is responsible for keeping track and publicizing the

state of the Grid. It is based mainly on the Monitoring and Discovery System

[34] (MDS).

Common run-time: This component ensures that applications deployed on the Grid

find a common environment (libraries, compilers, processors, operating sys-

tems) to run.



2.1. Introduction 15

These services are usually hosted in specific machines, which get their typical

EGEE names according to their function. For example, a participating site usuallyhas to deploy and maintain the following nodes:

User Interface (UI): The UI is the node where the user logs in to interact with the Grid.

It is the only place where he/she is required to have a local account and where

the user certificates are kept. From this node the user authenticates, queries the

Information Services, launches jobs, queries their status and retrieves results. It

usually runs most of the tools to query the gLite services, like authentication and

certificate proxy request, information services queries, job and file management,

etc.

Computing Element (CE): The CE includes both the computing power of a Grid site

and the gateway to that computing power. These two roles are usually divided

in two different types of machines, that supply different functions: the Gate-

keeper (GK) service and the Worker Nodes (WN).

The GK runs services for authentication and authorization, and a local infor-

mation system that publishes information regarding local resources. It also has

to interact with the local batch system, which is responsible for running the re-

ceived jobs. The WN are the nodes where jobs are actually run. They are usually

part of a local cluster managed by a batch system and connect to the GK.

In a nutshell, users from the UI can specify which CE they want to run their jobs

on. Jobs enter a CE for execution through the GK, which interacts with the local

batch system.

Storage Element (SE): The SE is responsible for the data management services (transfer

and storage) offered at a Grid site. These include, besides the authentication and

authorization services, the Storage Resource Management (SRM), the gridFTP

protocol and the GFAL API protocol. This chapter deals mainly with the data

services in gLite, so we will explain these services more thoroughly later (Sec-

tion 2.3.1).

Monitoring Box (MON): The MON service collects the information given by different

software sensors installed on the different site machines.

Later in this chapter (Section 2.2) we will give a general overview of a minimal EGEE

Grid site that we have deployed and maintained.

Other nodes deploy more general services, called core services. They are managed

by larger Resource Centers (RC) and used by all users and smaller sites to allow for

interaction with the global Grid infrastructure. Some of them are:




Top Berkeley Database Information Index (Top-BDII): The Top-BDII collects the informa-

tion provided by the individual sites’ information systems. It provides a central-ized place where Grid services and sites can query that information.

File Catalogues (FC): The FC are a database for file services. They map the logical

names of Grid files to their storage locations and their global identifiers.

Resource Broker (RB): The RB implements the match between the user requirements

and the available resources (from queries to the top-BDII and to the FC). It also

runs the Logging and Bookkeeping service (LB) that stores the status of all jobs

submitted to the RB and can be queried by the user through the UI. In the latest

version of the software it has been replaced by the WMS.

2.1.2 MOTIVATION FOR PLASMA SIMULATION CODES

A very common type of HPC algorithm is one that produces very large data files as

output. An example is the class of algorithms used in physics simulations, where

the motion and interaction of several thousands or millions of particles are simulated.

Codes like this (e.g., PIC [35]) typically require a relatively small amount of data as

input (∼ 1 kB) but can produce a huge amount of output (up to 1TB).

We have seen that the EGEE Grid is highly optimized for dealing with both dataintensive and computational intensive applications. In fact, since it was built on top

of the existing LCG, two of its main purposes are to offer high computing power to

applications, and high storage and replication facilities for data. However, this leads

to a certain dominance of Grid applications that require large amounts of data as input

(like the ATLAS or CMS experiments), against Grid applications that produce large

amounts of data as output.

As the EGEE project expanded and matured, other uses for its Grid infrastructure

were devised, and other Grid projects have been developed based on it, like the In-

teractive European Grid [36] (I2G). There has also been an important development of

tools that facilitate the migration and integration of existing applications to the Grid

and the development of new applications [37, 38, 39]. Some of these applications,

although well suited to be deployed on the Grid, might need different uses of its in-

frastructure than what has been usual until now.

In our case, we have a very clear motivation to work in deploying code on the

Grid, since our research center develops, maintains and deploys two massively paral-

lel plasma simulation codes (Osiris [40] and dHybrid [41]), which have been tuned for

HPC systems ranging from one to several hundreds of thousands of processors. For



2.1. Introduction 17

this kind of massively parallel codes, deployment on a computing Grid can be an im-

portant advantage, since one has several clusters with different capabilities available, both in computing power (number of processor cores and speed of each core) and in

total memory. It is possible to choose the Grid cluster according to the problem size to

solve, and parameter scanning is also available, where one can launch variations of the

same kind of problem over several Grid clusters. For large problems like these, that

run over several hours or days, the overhead that the Grid middleware might add is

usual not relevant to the total computing time. Moreover, since computing clusters are

usual associated with large storage systems, the usually large output of the simulation

can easily be stored on the Grid, to be accessed later.

Our plasma simulation codes are thus good candidates for application deploymentin the EGEE and I2G Grids, not only because they can make good use of the computing

power and memory available, but also because they challenge the current migration

infrastructure, since they usually produce large data sets (several hundreds of GB)

as output for post-processing (e.g., data analysis and visualization). It is important

to make these large amounts of data produced by such codes available as quickly as

possible, ideally still at runtime, so that the user can have almost immediate access to

preliminary results. It is also likely that if the application is writing the results locally

on the Worker Node (WN), this node might not have enough storage for the complete

output data, specially when it is in the order of several hundreds of GB. In this chapter,

we present a solution on how to handle this output in a way that has the least impact

on the performance of the application deployed on the Grid.

On one hand, we developed DataLight, a library that handles the transfer of large

data file sets from the local storage of the Worker Node to an EGEE Storage Element

in a way that is simple, robust, transparent to the user and minimizes performance

impact. On another, we are developing a web portal that allows users to launch HPC

applications on the EGEE Grid and to interactively check on the data files produced.

As they are automatically transfered to the closest SE through DataLight, the portal

is updated and the user notified, so that he can access these files for post-processing

while the job is still running on the Grid.

The rest of this chapter is organized as follows: in Section 2.2 we present the EGEE

Grid site we have deployed and maintained since 2007; in Section 2.3 we describe the

development of the DataLight library, first by presenting an overview of the available

gLite tools for data transfer (page 20), then by exposing the algorithm developed for

the library and the methodology used in evaluating its performance (page 23), and

finally by analysing the results (page 29); finally, in Section 2.4 we present the current

state of the web portal and point some possible directions for future development.




2.2 AN EGEE GRID SIT E

In 2007 we have deployed an EGEE Grid site as a way to allow for our research group

to collaborate with the Fusion Virtual Organization [42] of the EGEE Grid. Most of the

time during 2007 it was a testing site. During 2008 it was in production, although with

some significant down time due to hardware issues. 2009 was a year of more stable

production, where the availability and quality of service improved significantly.

The site has the name CFP-IST and is physically located in the Complexo Interdis-

ciplinar building of the IST campus at Alameda Afonso Henriques in Lisbon.

CFP-IST consists of three physically distinct systems that offer the following gLite

services:

— the LCG-CE service, offered on the physical machine with a FQDN of golp-ce.

ist.utl.pt. This is a Computing Element service (CE) with a default batch system

combination of Torque [43] and Maui [44], that makes available to the Grid the

computing power of this site.

— the site-BDII service, located on the same machine golp-ce.ist.utl.pt. This ser-

vice announces to the top-BDII the occupancy and availability of the CE.

— the UI service, located on the physical machine with a FQDN of golp-ui.ist.utl.pt. This is the user interface entry point, where users can login, compile, launch

jobs to the Grid, check the status of those jobs, and retrieve their results. This is the

only system where grid users are allowed to log in.

— the MON service, located on the same machine golp-ui.ist.utl.pt. This is the

monitoring service of the grid site, that collects all the info and status from the CE

and checks that all announced services are available.

— the SE service, located on the physical machine with a FQDN of golp-se.ist.utl.

pt. This is a storage service of 1.2 TB, available to the Grid. The storage system isthe DPM (Disk Pool Manager) based on MySQL.

Behind the CE we have the Worker Nodes (WN), that actually run the jobs sub-

mitted to the CE. At the moment they consist of 10 cores, divided by 6 machines: two

single core machines with 3 GB RAM per core and four dual core machines with 2 GB

per core.

The original configuration in 2007 was just two single core WNs with 2 GB per core

and a SE with 500 GB storage. The configuration got a significant upgrade in 2008 with



2.2. An EGEE Grid Site 19

more RAM (also for the service systems, that now have 3 GB each) and the four dual

core WNs. In 2009 we got the upgrade of the storage space of the SE to 1.2 TB.

Besides this, during 2009 the improvements were also on power dissipation and

physical space occupation. All the systems were moved from individual off-the-shelf

cases to 3U racks that were fit in a 42U cabinet. They were also fitted with new power

supplies. This move increased the reliability of the system and the overall availability.

Figure 2.1 shows the physical machines and the cabinet where they are stored.

During 2009 we ran 3001 jobs for the Fusion VO, which corresponds to 3.5% of the

total number of jobs ran by the SWE federation, and to a total of 12972 CPU hours. We

also ran jobs for the SWETest VO, the dteam and the ops VO, as it is normal for any

node of the south-western Europe (SWE) federation.

FIGURE 2.1 : The EGEE site CFP-IST.




Currently (2010) the site is undergoing an important upgrade. We have started

the process of expanding the WNs to 54 cores with 2 GB RAM per node, and to adda RAID system of 6 TB (expandable to 24 TB). The complete system will move to a

64-bit operating system. The RAID is already in place and being tested and some of

the WNs are already deployed in 64 bits.

2.3 DATALIGHT

In this section we present DataLight, the library referred previously that handles the

transfer of large data files to an EGEE Storage Element. We already explained the mo-

tivation behind the development of this tool in Section 2.1.2. We start by presenting

an overview of the available gLite tools for data transfer. Then we will explain the

algorithm developed for the library and the methodology used to evaluate its perfor-

mance. Finally, we will analyze the results, both in different test case scenarios and in

a production code.

2.3.1 AVAILABLE TOOLS FOR DATA TRANSFER

Producing large data sets as result of simulations running on the EGEE Grid poses

an interesting problem for application development. On one hand, gLite offers twohigh-level APIs for data management (GFAL and lcg_util [1]) which are well suited

for the reliable transfer, replication and management of data on the Grid and one low-

level API for data transfer submissions (FTS [31]), which is suited for point to point

movement of physical files. On the other hand, such data management operations

should occur still during simulation time, not just after it, such that the generated data

is made generally available as soon as possible and the WNs are never filled with too

much output results. This requirement can lead to performance degradation (either at

the application level or at the WN level) due to the overhead and slowness of network

transfers when compared to local storage access.

FTS

Although specially designed for reliable transfer of large data sets, FTS [31] only

deals with physical files from point to point. Its asynchronous nature allows for non-

blocking transfers, but also involves an overhead (job submission) and complexity

(need for channel setting, lack of logical references, need for SURL endpoints) that is

not adequate for porting a user application to the Grid.



2.3. DataLight 21

GFAL

GFAL [1] offers a POSIX-like API for data management on the Grid. Instead of acting

on local files, the API allows for remote access of Grid files, stored in Storage Elements

(SE), using the gLite data management layer in a way that is transparent to the applica-

tion [45]. However, for our case in study, GFAL is not appropriate for implementation

for the following reasons:

Opaque file formats: Output data produced from simulations is often encapsulated

in a pre-defined file format. If the application does not use POSIX file calls and

instead uses the API defined by the file format specification, this can lead to an

impossibility of using GFAL. A good example is the Hierarchical Data Format(HDF) [46] (see Figure 2.2).

Although in this case the API is open and the source is available, which would

allow for the developer to modify the existing API implementation to use GFAL

instead of the usual POSIX calls, this task may be too complicated to even at-

tempt, since it usually involves changing highly optimized code, developed and

maintained for several years from a team of developers. There are also other

cases where the source is not available and such a modification is simply not

possible.

Local system(WN)

POSIX calls

HDF API

Application

L o c a l

Remote system(SE)

gLite datamanagement

ModifiedHDF API

Application

L o c a l

GFAL API

R e m o

t e

FIGURE 2.2: Using GFAL instead of POSIX calls with HDF. The left column shows the software

stack involved in a standard HDF usage. From top to bottom: the application reads and writes

HDF files through standard API calls; the API reads and writes the data locally using POSIX

calls, which are passed to the underlying Operating System. The right column depicts the

necessary layers to add if the GFAL API is to be used: the HDF API would have to be modified

to use GFAL instead of POSIX calls. This would allow for access to a remote Storage Element.




Application stall and memory usage: While the data is being written, it is still in

memory. There are simulation algorithms where it is not possible to partitionthe result space in time in a way such that results from one simulation cycle are

successively generated and stored, but instead the complete result data for one

simulation cycle must be available in memory for writing at the same time. The

application has to wait for that writing step to finish before it advances to the

next simulation cycle.

Using GFAL in this case would involve a big performance impact, since a net-

work transfer is several orders of magnitude slower than local storage access,

and even more if we add the overhead involved in using the gLite middleware.

Threading this step would avoid application lock, but would not avoid highmemory occupation during a very long period. This, in practical terms, would

have the same effect as an application lock, since the new simulation cycle has to

wait for the previous data to be released in order to use the memory thus freed.

LCG_UTIL

The lcg_util API [1] allows applications to copy files between a WN and a SE (among

other Grid nodes), to register entries in the file catalogue and to replicate files between

SEs. This API deals with complete files (unlike GFAL), which makes it adequate for

the transfer of large files. In addition, it offers a higher level abstraction to data man-agement (unlike FTS), which makes the programmer task simpler. Hence, it was our

natural choice for evaluation of the best method for the transfer of large data sets from

simulation results from a WN to a SE.

OTHER TOOLS

A great amount of work has been done on the ability to replicate and transfer large

data sets on the Grid. In fact, in the first years this century the European Union

funded the DataGrid project [47], aimed at sharing “large-scale databases, from hun-

dreds of TeraBytes to PetaBytes, across widely distributed scientific communities.”

This project was recently included in the EGEE Grid. As it was not aimed at generating

large amounts of data, but on its replication and accessibility, most of the tools devel-

oped for it are not fit for our purpose, since they require significant code change at the

application level [48, 49, 50].

Several of the Grid services for data transfer rely on GridFTP [29, 51]. In fact, even

the lcg_util API uses the low level GridFTP tools. Although other tools have been

proposed for file transfers on the Grid (like [52]), we found that a GridFTP based tool



2.3. DataLight 23

had all the requirements of stability, reliability and performance that we needed. Also,

GridFTP is one of the most widely used middelware for data transfer on Grid systems,so building our library on top of a software stack based on GridFTP is a first guarantee

for portability.

2.3.2 METHODOLOGY

Our goal is to develop a tool that will ease the deployment on the Grid of applications

that output large amounts of data, specially in the case where that data should be

made available as soon as it is produced. To this end, we have implemented a simple

library for threaded transfers for Grid applications using the gLite middleware. We

have also tested the impact of such a tool in a particular type of Grid application:

a typical numerical simulation run, where result data is generated regularly, after a

certain number of simulation tasks (usually, after the computation of the state of the

experiment after a certain number of time steps) have occurred. This usually involves

making the result data available to the Grid, i.e., transferring the data from the WN

to a SE and creating replicas. The library was also used in a real-case scenario, with a

state-of-the-art code for kinetic plasma simulations.

We have evaluated different application scenarios for incorporating transfer of

(usually large) data files on a running Grid application. As evaluation criteria, wedefine the following four measurable time intervals: total turn-over time (T ) is the total

(wall clock) time the application is running, including both simulation and data result

transfers; first turn-over time (F) is the (wall clock) time it takes to make the first result

data available to the user (e.g., on a Grid’s SE); simulation step time (S) is the average

(wall clock) time a simulation step takes to complete, until the output data is locally

available on the WN to be transferred; finally, complete simulation time (C) is the aver-

age (wall clock) time the application takes from start until it finishes the last simulation

step (i.e., it is T minus the average time of a data transfer step). Figures 2.3–2.6 depict

these four time intervals in different application scenarios.

We have identified two main application scenarios to evaluate: serial, where data

transfer and computational steps alternate, but the application is either simulating or

transferring, and threaded, where data transfer occurs at the same time as the next com-

putational step, thus overlapping data transfers with computation. The serial scenario

establishes a baseline of the time intervals (defined above) to which the threaded sce-

nario is compared against. It would be expected that the threaded scenario would

always yield better performance; results show, however, that a serial scenario might

be the best choice in some situations.




SERIAL SCENARIO

There are two relevant variants of the serial scenario of applications:

Serial Data First (DF): In this scenario, the output data is transferred as soon as it is

ready, thus the application alternates between a simulation step and a transfer

step (Figure 2.3). The transfer step can be seen as making part of the simulation,

since the application must finish the transfer of the data results of one step before

it can continue to the next simulation step. We expect F to be the shortest on this

scenario, which might be useful for interactive Grid applications (e.g., I2G).

Application

Simulation step time S

Total simulation time C

Transfer 1Simulation

step 1

L o c a l w r i t e 1

Simulation

step 2


...Simulation

step n

L o c a l w r i t e n

Transfer 2 Transfer n

Total turn-over time T

First turn-over time F

FIGURE 2.3 : Serial Data First (DF) scenario variant: the application alternates between a sim-ulation step and a transfer step.

Serial Data Last (DL): In this case, all simulation steps are done first, and data trans-

fers only start after that calculation phase is over (Figure 2.4). This variant sce-

nario is not relevant as a whole for our purpose, since it only starts transferring

when the complete output is available, thus not solving one of the problems we

want to address: the lack of local storage space in the WN. Besides, implement-

ing such a scenario is very simple and can be done using a script that would

transfer the whole data after the simulation is done. However, we found this

variant scenario useful to measure the shortest C, that is, the total time the simu-lation would take if a user would be running it in a desktop machine, and not on

the Grid. Thus, the difference between C in this scenario and T in other scenarios

is a measurement of the penalty of data transfer on the application.

THREADED SCENARIO

The serial scenarios described previously, although with some minimal advantages,

serve mainly to establish a minimal F (scenario DF) and a minimal C (scenario DL),



2.3. DataLight 25

Application



Simulationstep 1


Simulationstep 2


... Simulationstep n


Transfer 1 ... Transfer n



FIGURE 2.4 : Serial data last (DL) scenario: the application completes the simulation and onlyafter that it initiates the transfer of the result data.

in order to evaluate the penalty that a threaded approach might have on those two

parameters. It also establishes a T (total turn-over time) that we aim to improve.

However, the best results might be achieved if we transfer the output of a previous

simulation step while the application is calculating the next one, thus overlapping

data transfer and computation. This allows for the optimal overlap of two tasks that

have minimal influence on each other: the simulation step, which is CPU and memory

intensive, but has little effect on IO, and network transfer, which stresses mainly the IO

(local storage and network) and has a lesser impact on the CPU. This overlapping can

be achieved by launching a transfer thread as the last step of each simulation cycle for

the network transfer of the data. This data threaded scenario (DT) is the third scenario

we will evaluate.

DATA INTENSITY

Overlapping data transfers and calculation leads to two new variants of the DT sce-

nario, depending on the amount of data generated per simulation step:

Data weak: In this scenario, the transfer time of the data result from a simulation step

is usually shorter than the duration of that step (Figure 2.5). This means that data

transfers can occur as soon as data is available. We expect T to be much shorter

than in the corresponding DF and DL scenarios, while F will not be much larger

than in the DF scenario, and C will not be much larger than in the DL scenario.

Data intensive: In this scenario, each data transfer takes longer than the average sim-

ulation step (Figure 2.6). As a result, although time gains will still be substantial




Main thread

Transfer thread

Simulationstep 1


Simulationstep 2


... Simulationstep n


Transfer 1 ...


Transfer 2 Transfer n




FIGURE 2.5: Weak Data Threaded (W-DT) scenario: data transfers occur on a different thread

than the simulation and take usually less time than one simulation step.

when compared to DF and DL scenarios, we expect them to be more influenced

by the transfer thread.

These two threaded variants can only be compared against the corresponding se-

rial variants, i.e., against a DF and a DL where the simulation step takes the same

amount of time. Thus, we also define the corresponding variants for the serial scenar-

ios: Weak Data First (W-DF), Weak Data Last (W-DL), Intensive Data First (I-DF), and

Intensive Data Last (I-DL).

2.3.3 ALGORITHM

We have developed and implemented an application to test the different scenarios

outlined previously. It uses an algorithm that is flexible enough to allow for multi-ple transfer threads to occur, for optimizing the bandwidth usage at the cost of some

processing time, although that possibility is not analyzed in the current work. In Fig-

ure 2.7 we show a simplified fluxogram of the application.

The main loop of the application (top left of Figure 2.7) consists of a simulation cy-

cle followed by a write of the result data to local storage. These two steps are repeated

until the complete simulation finishes and are the only steps of the whole Figure 2.7

that actually belong to the test application. All the other steps are part of the Data-

Light library. The connection between the application and DataLight occurs after the



2.3. DataLight 27

M a i n t h r e a d

T r a n s f e r t h r e a d

S i m u l a t i o n

s t e p 1


S i m u l a t i o n

s t e p 2


. . .

S i m u l a t i o n

s t e p n


T r a n s f e r 1

. . .

T r a n s f e r 2

T r a n s f e r n

T

o t a l t u r n - o v e r t i m e T

F i r s t t u r n - o v e r t i m e F

S i m u l a t i o n s t e p

t i m e S

T o t a l s i m u l a t i o n t i m e C

FIGURE 2.6: Intensive Data Threaded (I-DT) scenario: data transfers occur on a different

thread than the simulation and take usually longer time than one simulation step.




Main thread

Transfer thread

Queue manager thread

Compute results

Write locally

Add to waiting queue

Disk

Waiting queue

Waiting queueempty?

Transfer slotsavailable?

Wait

Move first elementfrom waiting queue to

transfer queue

Initiate transfer thread

Transfer queue

Transfer thread

Transfer first elementfrom transfer queue

Remove transferredelement from

transfer queue

End

N

Y

Y

N

Queue manager

Main program

Queue manager

thread

FIGURE 2.7 : A simplified fluxogram of the algorithm implemented to allow for simultaneous

Grid simulation and network transfer of results.



2.3. DataLight 29

local write operation. Each local write operation is followed by a call to DataLight,

that adds the file just written to a waiting queue of files waiting to be transferred.

The first time that DataLight is called, that is, the first time that the main thread

writes to this queue, it launches a queue manager thread (right-hand side of Fig-

ure 2.7), which is responsible for managing the waiting queue. This queue manager

dispatches the files on the waiting queue list in a FIFO order to a transfer queue. It

checks if there is enough transfer slots available (these are bandwidth and system de-

pendent) and, for each file that is moved from the waiting queue to the transfer queue,

it initiates a transfer thread (bottom left of Figure 2.7). Launching the transfer threads

from a manager thread instead of the main thread allows for greater flexibility: they

can be interrupted and resumed, can occur out of order, and can be more than onerunning at the same time.

Each transfer thread uses a simple algorithm represented on the bottom left of

Figure 2.7. It is up to each transfer thread to remove the file reference from the transfer

queue and to remove the local file as soon as the transfer finishes.

Finally, as the complete simulation finishes, the main thread sends a signal to the

queue manager thread. It waits until all queues are empty and quits, thus finishing

the application.

2.3.4 RESULTS AND ANALYSIS

We have used the library extensively both in test and in production codes. Although

the library allows for multiple transfer threads, this work focus on the use of a single

transfer thread concurrently with computation. Tables 2.1 to 2.4 list the relevant time

intervals obtained when using our library in a test simulation. Times are averaged

over 20 runs and are measured for data sizes of 8 MB (Table 2.1), 32 MB (Table 2.2),

128 MB (Table 2.3), and 512 MB (Table 2.4). Figures 2.8 and 2.9 show the evolution of

the two more relevant time intervals defined previously for the different data sizes:

T (total turn-over time) and F (first turn-over time). Figure 2.8 represents the data

intensive variant, and Figure 2.9 the data weak variant.

All tests were done on WNs with a single core 2 GHz AMD Athlon 64 bit processor

with 1 GB of RAM running Scientific Linux 3.0.8 and gLite 3.0 (lcg_util 1.5.2). The

SE belongs to the local Grid node and has a similar configuration to the WNs. The

network between the WNs and the SE was a shared 100 Mbs LAN. No other jobs were

running on the WNs concurrently.

Examining the results, it is interesting to note that, although the DL scenario per-

forms poorly in almost all criteria, the other serial scenario (DF) has better perfor-




time Data intensive simulation Data weak simulation

interval I-DF I-DL I-DT W-DF W-DL W-DT

F 5.4 15.4 7 27 477.6 33

T 108 109 119 550 571 542

S 0.51 0.52 0.9 21.9 23.6 26.8

C 103 104 113 545 566 536

All values in seconds

TABLE 2.1: Average measured time intervals for 8 MB files.



F 8.5 51 10.1 43 784 55

T 170 213 158 864 1053 945

S 1.96 2.2 2.4 37.1 38.5 47.0

C 163 205 150 858 1038 937





F 19 259 59 160 3145 179

T 378 1763 972 3192 4085 3324

S 8.6 9.0 11 149 155 166

C 367 1684 924 3181 4035 3311



time Data intensive simulation Data weak simulationinterval I-DF I-DL I-DT W-DF W-DL W-DT

F 294 2760 423 618 8946 723

T 12818 6926 5957 12357 13147 8988

S 113 127 129 421 436 437

C 5710 6706 5662 12160 12928 9702





2.3. DataLight 31

1

10

100

1000

10000

8 32 128 512

F

( s

)

Data size (MB)

F on the Data Intensive scenario

DF

DL

DT

100

1000

10000

100000

8 32 128 512

T

( s

)

Data size (MB)

T on the Data Intensive scenario

DF

DL

DT

FIGURE 2.8 : Comparing the I-DT scenario (solid line) with the I-DF (dotted line) and I-DL

(dashed line) using the F and T time intervals.

10

100

1000

10000

8 32 128 512

F

( s

)

Data size (MB)

F on the Data Weak scenario

DF

DL

DT

100

1000

10000

8 32 128 512

T

( s

)

Data size (MB)

T on the Data Weak scenario

DF

DL

DT

FIGURE 2.9: Comparing the W-DT scenario (solid line) with the I-DF (dotted line) and I-DL

(dashed line) using the F and T time intervals.

mance than the threaded scenario (DT) in several situations. On the data intensive

variant (Figure 2.8), DF and DT are comparable for small data sizes (8 and 32 MB, Ta-

bles 2.1 and 2.2). But with files of 128 MB, the serial scenario has better T and F than

the threaded one (Table 2.3). This means that the first output data will be available

earlier and the application will finish sooner on the DF scenario for output files of that

size. We think that the main reason for this unexpected behavior is the CPU over-

head in launching a data transfer using lcg_utils, as explained below. For large data

files (512 MB, Table 2.4), the DT scenario shows better performance, which is to be ex-

pected. For the data weak variant (Figure 2.9), DF and DT have similar performance,

except for large data files, where the thread variant performs better.

The other two time intervals (S and C) show a similar trend (Figures 2.10 and 2.11):

the threaded variant allows for earlier availability of the output data files as these files




get larger, although the DF scenario is usually shorter in all the tests. The S results

also point to an important penalty of transferring small files in gLite: calculating F− Sresults at least in about 5 s. This is the minimal time it takes to transfer files from the

WN to a SE using the lcg_utils, since it is the difference between the average first turn-

over time and the average simulation time. We obtain these ∼ 5 s both with 8 MB and

with 32 MB files. Only above that the transfer time takes significantly more. During

this 5 s overhead it is to be expected that the calculation thread will suffer, since that

the CPU is also being requested by the transfer thread. Hence it is to be expected that

performance gains of the threaded scenario will be noticeable for data files that take

significantly longer to transfer than the overhead of ∼ 5 s.

0,1

1

10

100

1000

8 32 128 512

S

( s

)

Data size (MB)

S on the Data Intensive scenario

DF

DL

DT

100

1000

10000

8 32 128 512

C

( s

)

Data size (MB)

C on the Data Intensive scenario

DF

DL

DT

FIGURE 2.10: Comparing the I-DT scenario (solid line) with the I-DF (dotted line) and I-DL

(dashed line) using the S and C time intervals.

10

100

1000

8 32 128 512

S

( s

)

Data size (MB)

S on the Data Weak scenario

DF

DL

DT

100

1000

10000

8 32 128 512

C

( s

)

Data size (MB)

C on the Data Weak scenario

DF

DL

DT

FIGURE 2.1 1: Comparing the W-DT scenario (solid line) with the I-DF (dotted line) and I-DL

(dashed line) using the S and C time intervals.



2.3. DataLight 33

2.3.5 APPLICATION DEPLOYMENT

We have deployed our library on the code Osiris [40], which is a state of the art, mas-

sively parallel, electromagnetic fully relativistic Particle-In-Cell (PIC) code for kinetic

plasma simulations. In this code, the interaction of a large number of charged plasma

particles (up to ∼ 1010) is modeled using a particle-mesh technique specially suited

to this problem called PIC [35]. We will describe the PIC algorithm in more detail in

Chapter 3, Section 3.2. Applications to this code include astrophysical shocks, ultra-

intense laser plasma interactions, and nuclear fusion using fast ignition.

Osiris is written in Fortran 95. In its current implementation, our library presents

a C interface with only two functions exposed: one to add a file to be transferred to

the waiting list (see Figure 2.7) that takes as the only parameter the full path of that

file, and another to tell the queue manager to finish all transfers and quit that takes

no parameters. The user should call the first after each file is saved, and the latter at

the end of the simulation. As mentioned before, currently the URL of the SE is set at

the compile time of the library. To integrate our library with Osiris, thus allowing it

to interface with gLite and produce large output files to the Grid, two small function

wrappers were written to allow direct calling of our interface from Fortran. From the

Osiris point of view it was only required then to call the function that adds a diagnostic

file to the transfer queue after it is written to local storage, and to call the function to

terminate the queue manager at the end of the simulation, so that Osiris would waitfor the last transfers to finish. The output files are in the HDF format, which is handled

transparently by our library, as any other file format (see Section 2.3.1). The whole

migration process took only a few hours;1 we expect the linking of our library with

other codes/languages to present a similar difficulty.

In our simulation tests, the application ran for nearly one hour, producing more

than a hundred output files of very different sizes, varying from a few kB to hundreds

of MB, to a total of ∼ 5.5 GB. Performance gains from moving from a DS to DT imple-

mentation were very significant: the total turn-over time T was reduced from 3378 s to

2164 s, an improvement of over 35%. We also expect that for longer simulations, with

durations of up to ∼ 100 hours, the results will be even better. Furthermore, for such

simulations, receiving the output data as soon as it is produced is an added advantage

when compared to other HPC batch systems.

In Osiris, a simulation cycle is usually much shorter than the transfer time over

the network. Hence, Osiris could be considered as a “data intensive” application, as

defined in page 25. This means that performance gains using DataLight would be

1This migration process and the writing of the Fortran-to-C wrappers were done by Prof. Ricardo

Fonseca.




limited, since the whole simulation is limited by the network bandwidth. However,

most Osiris simulations do not write out the results at every time step. This wouldmake the simulation very slow (even using local storage) and very expensive in terms

of data storage. Often, diagnostics are produced after a certain number of time steps.

Depending on the simulation, that number can be tens or hundreds. In this case, Osiris

is actually behaving as a “data weak” application, meaning that the overall running

time is dominated by the simulation and not by the data transfer. This can lead to very

significant performance gains, as we saw when we moved from the serial scenario

(DS) to the threaded scenario (DT) in our deployment of DataLight with Osiris.

Deploying DataLight with Osiris overcame our main obstacle in using plasma sim-

ulation codes on the Grid. With this library we can now launch Osiris or dHybrid (orany other plasma simulation code, like QuickPIC [53]) in any Grid site which has gLite

and MPI (like any I2G site) and be certain that the data produced will be promptly de-

livered to a SE of our choice.

2.4 PORTAL DEVELOPME NT AND OTHER FUTURE WORK

One of the main strengths of DataLight is its simplicity. It allows for the overlapping

of computation and data transfers with a minimal effort to the developer of the non-

Grid application. The current implementation allows for setting some parameters atcompile time, like the URL of the SE (not mandatory). However, we have found useful

to add an initialization interface to make available such options at run time, using

command line parameters to be included in a JDL file. This will allow for finer user

control of the run-time parameters of the library while keeping its simplicity.

The tests presented here were run with one transfer thread and two streams per

transfer in parallel (the lcg_cr parameter nbstreams). This was chosen in order to limit

the variable parameters during this first test phase, while at the same time optimizing

bandwidth. However, we want to do more complete tests on the performance of our

library with multiple transfer threads, since it was build from the beginning with this

feature in mind. We expect the performance gains to increase when compared to the

DF scenario until the bandwidth limit is reached.

We have also started development of a Grid web portal for users of the Osiris

and dHybrid codes. The main idea is to make the infrastructure where the code is

running as much irrelevant/transparent for the user as possible. He accesses a web

page, sets the simulation parameters, targets a certain system, and it is up to the portal

to select the correct binary and launch the job. The user does not need to use the

specific commands of the target system.



2.4. Portal development and other future work 35

An important feature of this portal is its integration with DataLight. We have

added the ability to automatically keep track of when the output files are availablefor post-processing. With this feature, the user can easily take advantage of the much

shorter first turn-over time, F, instead of having to wait for the total simulation time,

C, or the total turn-over time, T . This feature has to be explicitly added to the code

(in our case, to the DataLight library) and to a portal, since the usual gLite job status,

shown in Figure 2.12, can not be expanded to include intermediate running status.

SUBMITTED

WAITING

READY

SCHEDULED

RUNNING

ABORTED

CLEARED

DONE (cancelled)

DONE (failed) DONE (ok)

FIGURE 2.1 2: Job state machine in gLite (adapted from [1]).

In gLite, a job stays in the RUNNING state as long as it is running on the remote

system. There is no standard way for the job to communicate back to the user whichdifferent phases it might be running through. In our implementation, the portal is up-

dated during the RUNNING state with intermediate status messages from the DataLight

library. These messages are displayed on the web page that the user is watching and

also stored in a database. In this way the user is aware of the output files becoming

available on the SE that was chosen, and he can start post-processing them while the

simulation is still running on the Grid.



CHAPTER 3

FINE GRAIN PARALLELISM

Kinetic plasma simulations using an electromagnetic particle-in-cell (PIC) algorithm

have become the tool of choice for numerical modeling of several astrophysical and

laboratory scenarios, ranging from relativistic shocks and plasma shell collisions [54,

55], to high intensity laser-plasma interactions [56], with applications to fast ignition

[57, 58], to particle acceleration[59], and to radiation generation [60, 61], among oth-

ers. However, fully relativistic kinetic codes are computationally intensive, and new

computing paradigms are required for one-to-one direct modeling of these scenarios.

In this chapter we look at the use of modern graphics processing units (GPU) for PIC

algorithm calculations, discussing the implementation of a fully relativistic PIC code

using NVIDIA’s CUDA concepts. Details on the algorithm implementation are given,

focusing on grid-particle interpolation and current deposition. Finally, results from a

test simulation of an electron/positron plasma shell collision are presented, focusing

on code validation and performance evaluation.

3.1 INTRODUCTION

There are many plasma physics scenarios where fully relativistic particle-in-cell codesplay a key role in providing essential understanding of the underlying physical pro-

cesses involved. Due to the problems they try to solve, PIC implementations tend

to be used in large-scale simulations that require large computing resources, usually

from tens to thousands of computing cores. Furthermore, as we explained in Chap-

ter 1, the last 15 years have witnessed a trend in the HPC scientific community to

move from highly customized shared-memory systems to cheap distributed-memory

systems, often built with commercial off-the-shell (COTS) computers. When the CPU

annual increase in speed slowed down in the middle of this decade, these cheap clus-

37



38 Chapter 3. Fine grain parallelism

ters started to show their limitations, specially in computational power (high latency),

space and power needs. As the physical problems to solve increase in complexity —with more memory (capacity and bandwidth) and computational power requirements

—, the scientific community has started looking for other possibilities from where to

get their next HPC system.

Programmable GPUs have received attention from the scientific computing com-

munity since their introduction on the market at the beginning of 2000 [21]. They

constitute a highly interesting alternative to the usual CPUs due to their high comput-

ing power (recently over a teraflop per GPU chip), their relative low cost and power

consumption, and their capability for massive data parallelism (several thousand si-

multaneous threads are available on a single chip of a recent GPU).However, despite their computational power, general programs for the GPU (GP-

GPU) suffered a serious implementation difficulty: the algorithm had to be mapped

to Computer Graphics (CG) concepts in order to use graphics API, and the program

had to be structured in terms of the graphics pipeline. In fact, in 2006 we implemented

one part of a PIC algorithm (the particle pusher, see Sections 3.2 and 3.4.1) using the

GPGPU approach, in particular using the OpenGL Shading Language (GLSL), version

2.0 [27]. We were able to implement the full Boris pusher in GLSL in about 3 months,

and deployed it in a system with an AMD Athlon 64 3200+ processor and a NVIDIA

GForce 6600. Since the performance on that GPU was similar to a CPU implementa-tion and the development effort was significant, we decided at the time not to advance

further with GLSL and moved later to CUDA. A full report on this GLSL effort can be

found in Appendix B.

To overcome this obstacle of mapping general algorithms to CG concepts, several

approaches were developed (eg., BrookGPU [20] and Sh [62]). Finally in 2007 NVIDIA

made available an architecture (hardware, drivers, API, and SDK) for HPC on GPUs

called CUDA (Compute Unified Device Architecture). In particular, “C for CUDA”

offers a high-level interface to program and deploy high-performance applications on

GPUs using a C-like syntax and compiler. This system allows for developers to build

GPGPU applications without using CG concepts but still using the high-performance

of the GPU. A usual CPU application can thus be more easily tap on the GPU perfor-

mance without requiring CG knowledge from the programmer.

In this chapter we will present an implementation of a fully relativistic 2D PIC

code in CUDA. We will discuss the main problems faced and the solutions we found

which, in most cases, are valid for deploying most particle mesh algorithms in CUDA

systems. We will also discuss performance issues while benchmarking our validation

test, a 2D Weibel instability.



3.2. Basic PIC Algorithms 39

The rest of this chapter is organized as follows: in Section 3.2 we present the basic

PIC algorithms that were ported to CUDA; in Section 3.3 we present a brief overviewof CUDA; in Section 3.4 we discuss the implementation of the PIC algorithms in “C

for CUDA”; in Section 3.5 we present the validation of the code by implementing a

Weibel instability simulation; after that, in Section 3.6 we discuss the results and the

performance of the code, followed by a brief description of other approaches to this

problem (Section 3.7); finally in Section 3.8 we will offer an overview and conclusions

of our work.

3.2 BASIC PIC ALGORITHMS

Directly modeling the interaction between all particles in plasma is only feasible for

a relatively small number of particles, given that this algorithm requires a number of

operations going with O(N 2). Even in a petascale system, a state-of-the-art kinetic

plasma simulation, with a total number of particles of ∼ 1010 [63], such algorithm

would require ∼ 1 day for a single time step, out of potentially millions of iterations.

To overcome this limitation, numerical simulations often resort to the particle-mesh

method for calculations. In this method, particles do not interact directly, but rather

through fields defined on a grid. Field values are interpolated at particle positions

to determine the forces acting on each particle, and particles are deposited back onthe grid to advance field values. Although the number of operations depends on the

number of grid cells, the number of particles, and the interpolation schemes used, it is

several orders of magnitude lower than a particle-particle method.

For plasma physics simulations this algorithm is generally referred to as the par-

ticle-in-cell (PIC) algorithm [64, 35]. Figure 3.1 illustrates a complete electromagnetic

PIC simulation loop. Simulation space is discretized as a regular grid, and the electro-

magnetic (EM) field values are defined on this grid. To advance simulation particles,

we use the Lorentz force acting on the particle, calculated by interpolating the field

values at the particle positions. The electric current resulting from particle motion is

deposited back on the grid, and used to advance the values of the EM field.

In our implementation particles are advanced using a leap frog scheme. Particle

positions x and fields are defined at time step ti, and are used to calculate the Lorentz

force acting on the particle:

u =q

m

E +

1

c

uγ× B

(3.1)

where u is the generalized velocity (u ≡ γv), γ is the Lorentz factor, q/m is the




Field interpolation:

(E , B) j⤻ (E , B)i

Particle advance:

(E , B)i⤻ pi⤻ xi

Current deposition:

xin-1

- xin⤻ j j

Field advance:

j j⤻ (E , B) j

t ↻

FIGURE 3.1 : Overview of a full cycle of an electromagnetic PIC code. Subscript j refers to cell

quantities on the grid; subscript i refers to particle quantities.

charge/mass ratio of the particle, and E and B are the EM field interpolated at the

particle’s position.

The interpolation of the values of E and B at the particle can be seen as the in-

teraction between the simulation grid and the particles. For linear interpolation, this

requires the values of 2 nearest grid points for every direction, leading to 4 points in

2D and 8 points in 3D. It should also be noted that, as we will discuss below, the E and

B are defined on a staggered grid (see Figure 3.2), which means that the grid points

required for each field component may be different.

FIGURE 3.2 : Staggered 2D grids in PIC codes.

Advancing the particles’ generalized velocity from time step ti − 1/2 to time step

ti + 1/2 is done using the so-called Boris pusher [65, 35], which is a second-order ac-

curate time centered numerical technique that has been successfully applied in many

simulation algorithms [66], and, in particular, widely adapted for PIC codes. This tech-

nique separates the effects of the electric and magnetic forces in four steps: starting at



3.2. Basic PIC Algorithms 41

a time ti − 1/2, i) add half the electric impulse to u, obtaining u; ii) rotate u with half

the magnetic impulse, obtaining u ; iii) rotate u with the full magnetic impulse usingu; iv) add the remaining half of the electric impulse.

Using the generalized velocity at time step ti + 1/2, we can then advance the par-

ticles’ positions with:

xi+1 = xi +ui+1/2

γi+1/2∆t (3.2)

To advance the EM field values we rewrite Maxwell’s equations, specifically Fara-

day’s and Ampère’s law, to give (again in cgs units):

∂E∂t

= 4π j− c× B (3.3)

∂B∂t

= −c× E. (3.4)

Starting from a known set of field values at t = 0, we can then advance the EM

field components at each time step, provided we find the rotational of the E and Bfields, and electric current j resulting from particle motion. The rotational operator is

approximated by a finite-difference operation on the grid using the technique devel-

oped by Yee [67]. To improve accuracy, the field values are not defined in the same

points inside a grid cell, but rather as staggered grids as described in Figure 3.2, whicheffectively results in a second order spatially accurate algorithm. To improve the time

accuracy, we advance the fields in three steps: starting with E and B defined at a time

ti, and j defined at a time ti+ 12

, i) advance B by half a time step using Equation 3.4;

ii) advance E by a full time step using Equation 3.3 and the intermediate B; iii) ad-

vance the intermediate B by the remaining half time step. This method allows for

second order accuracy with no memory penalty.

The electric current resulting from particle motion also needs to be defined in the

staggered grid. However, because of the finite difference approximation of the ro-

tational operator, a simple interpolation of qu/γ is not enough, since it will lead tocharge conservation errors. To overcome this, current deposition in PIC codes has cur-

rently two widely used approaches: the Villasenor-Bunemann method [68] and the

Esirkepov method [69]. Both methods ensure exact (analytical) charge conservation

and for linear interpolation both lead to the same result. However the Villasenor-

Bunemann method requires that particle motion is split into motion segments lying

inside the same cell, which is generally realized with a set of “IF” statements. The

Esirkepov method thus yields better performance [70], and was chosen for our imple-

mentation.




3.3 CUDA OVERVIEW

CUDA [71] is both a hardware and a software architecture for creating general purpose

programs on a GPU. At the hardware level, it is available for NVIDIA’s GeForce series

(8000 and later), the Tesla systems and the Quadro equipment. At the software level,

it has a stack composed by the hardware driver, the C-like API and its runtime, and

several higher-level mathematical libraries (CUFFT, CUBLAS).

With CUDA, the GPU is viewed as a compute device capable of executing a very

high number of threads in parallel. Hence, it operates as a co-processor to the main

CPU, or host: portion of an application that is executed many times, but independentlyon different data, can be isolated into a function, called a kernel, that is executed on the

device as many simultaneous threads.

CUDA hardware has a set of multiprocessors (Streaming Multiprocessors or SMP),

whose number varies depending on the GPU model. In the current architecture, each

multiprocessor has 8 processors (called Scalar Processors or SP), which set the number

of concurrent threads. Table 3.1 shows the number of SMP and the total number of

processors for some of the CUDA enabled products we have access to.

GPU model SMP/SP

Quadro FX1800 8/64

GeForce 9800 GX 16/128

GeForce 8800 GT 14/112

Tesla C1060 30/240

Tesla C870 16/128

TABLE 3.1 : Number of Streaming Multiprocessors (SMP) and total number of Scalar Proces-

sors (SP) for some CUDA enabled hardware.

All the SP cores in a chip have access to the whole RAM memory of the device,

which is called Global Memory. In addition, all SP cores in a SMP have access to a

certain amount of memory in the SMP, called Shared Memory. This shared memory is

usually 32 kB in size and has access times two orders of magnitude faster than global

memory — typically, 400 cycles for global memory and 4 cycles for shared memory.

Finally, each SP also has access to its own local memory (8 kB) and 32-bit registers (32).

Figure 3.3 shows an overview of the different memories available in CUDA devices.



3.3. CUDA Overview 43

SP1

Registermemory

SP8

Registermemory

Shared memory

Constant cache

Texture cache

SMP N

SP1

Registermemory

SP8

Registermemory

Shared memory

Constant cache

Texture cache

SMP 1

Constant memory

Texture memory

Local memory

Global memory

Local memory Local memory

Device DRAM

Device (GPU)

Host (CPU)

FIGURE 3.3: CUDA memory architecture, both from the hardware and software level. (For a

color version, see Figure C.1 on page 105.)




Data is transferred from the host memory to the device in three ways: to global

memory where, after being processed and transformed by a kernel, it can be read backto the host; to constant memory or to texture memory where it can be accessed by each

SMP in a read-only cached fashion. From global memory, a kernel can read the data to

shared memory to be processed by all the threads in a SMP, or to register memory to be

used by each thread individually. The constant and texture caches are also distributed

per SMP. Finally, local memory is also available per thread. However, as it is allocated

in the device’s RAM, it is as slow as global memory.

These two hardware properties (multiprocessor/processor organization and glo-

bal/shared/local/register memory hierarchy) allow for a very high flexibility in paral-

lel processing power but at the same time force some limitations on programmability.

Flexibility: The grouping of processors in SMP allows for deployment of the same

kernel over multiple GPUs with different capabilities.

The batch of threads that executes a computational kernel is organized as a grid of

thread blocks. Each thread block is a batch of threads that can cooperate together by

efficiently sharing data through the fast shared memory and by synchronizing

their execution to coordinate memory accesses. From the programmer’s point

of view, all threads in a block can cooperate (sharing memory and synchroniz-

ing) as if they are running concurrently; blocks in a grid cannot cooperate. This

thread granularity allows for kernels to run efficiently on various devices with

different parallel capabilities: a device with few parallel capabilities may run all

the blocks of a grid sequentially, while a device with a lot of parallel capabilities

may run all the blocks in parallel; usually, it is a combination of both.

Constrains: Since the amount of fast shared memory is limited, the programmer has

to avoid expensive reads from and writes to global memory. Instead, she/he has

to try to map the algorithm to use the limited shared memory.

Also, although thread switching is very inexpensive in this architecture, at any

given time only 32 threads are guaranteed to be synchronized in a thread block.

This group of synchronized concurrent threads is called a warp. The penalty

for accessing global memory can be minimized if one can group the access in a

half-warp of 16 sequencial threads that access sequencial memory positions. In

this case, the global memory access is coalesced to one single access that can be

masked if other threads are busy with computational task.

Managing coalescent global memory access is of paramount importance to

achieve high performance in a CUDA. In a way, it is similar to managing the

different levels of cache in more standard CPU processors.



3.4. CUDA Implementation 45

The general processes of launching a computational kernel with CUDA are thus:

1. Transfer data from host memory to device global memory.

2. Determine the block and grid size.

3. Launch the kernel that:

(a) copies the relevant set of data from global memory to shared or local mem-

ory,

(b) computes on that data, and

(c) transfers the results to global memory.

4. After returning to the host, it is possible to launch another kernel using the data

already residing in the device’s memory.

5. When all kernels have finished, the results are copied from the device’s global

memory to the host’s memory.

3.4 CUDA IMPLEMENTATION

As explained in Section 3.2, a PIC code can be viewed as a process that iterates over

the following steps:

Particle push, where each particle’s new position is calculated from the Lorentz equa-

tion:dpdt

= q(E +vc× B). (3.5)

This step involves an interpolation of the EM field values at each particle’s posi-

tion.

Current deposit, where the current is calculated from the path of each particle and

its charge. This step involves a deposition of the current contributed by each

particle at each vertex of the grid.

EM fields update, using the current calculated in the previous step and Equations 3.3

and 3.4.

In the next sections we will analyze in detail the implementation of each step as a

streaming algorithm and in particular in CUDA environments.




3.4.1 PARTICLE PUSH

The computational steps of the Boris pusher are as follows:

1. Starting at a given time t, with a time step ∆t, the values of the velocity are given

for t− ∆t2 : vt− ∆t

2; however, the position of the particles rt and the electromagnetic

fields Et and Bt are time centered at t.

2. Calculate vt+ ∆t2

with:

v− = vt− ∆t2

+q

m

∆t

2Et

v = v− + qm∆t2

(v− × Bt)

v+ = v− +2

qm∆t2

1 + q

m∆t2 Bt

2(v × Bt)

vt+ ∆t2

= v+ +q

m

∆t

2Et.

3. Calculate rt+∆t with:

rt+∆t = rt + ∆t vt+ ∆t2

.

When used in a PIC code, E and B are defined at each cell’s vertices, and they haveto be interpolated at rt.

Due to the lightweight thread architecture in CUDA, it is possible to launch hun-

dreds of thousands of threads at once and let the CUDA implementation serialize the

respective blocks. This means that it is possible to push just one particle per thread

and have as many pusher threads as particles, or to push several particles per thread.

We have implemented a parameter in our code that defines the number of particles

pushed per thread, which allows for a customized balance between calculation, mem-

ory usage, and memory bandwidth.

To make sure that the maximum occupancy of the GPU is reached, we have to esti-mate the number of threads per block, which depends mainly on the amount of the de-

vice’s memory usage. In our implementation, we found 64 to be the best choice. Since

the number of particles might not be divisible by the number of threads per block, we

can either add a test at the beginning of each push kernel that checks if the current

thread number accesses a valid particle index, or do two launches of the kernel: in the

first launch, we launch as many threads blocks as possible without going beyond the

total number of particles, and the next launch processes the rest of the particles in one

incomplete block. We have found the latter approach to be marginally faster than the




former (about 1% better with 131 thousand particles). The following code illustrates

the setting of the block and grid size (PPB is a constant set to the number of particlethreads per block):

/ / H o w m a ny p a rt i cl e t h re a ds p er b l oc k :

m _ p bl o c k_ d i m = m a ke _ u in t 3 ( PP B , 1 , 1 );

m _ pg ri d _d im = m a ke _u in t 3 ( m _N p / P PB , 1 , 1 ) ;

m _ pO f f se t = 0 ;

m _p Re st = m _N p % P PB ;

if ( m _ p R e s t )

m _ pO f fs e t = m _N p - m _ pR e st ;

After launching the kernel, each thread pushes one particle. Since it only needs the

position r and velocity v of that particle, those values are stored in the threads’ local

memory for immediate access. On the other hand, the E and B fields are needed for

interpolation for all the threads, so they are kept in the global memory of the device.

The following code illustrates the main kernel implementation:

/ / F in d p ar ti cl e i nd ex f ro m t hr ea d & b lo ck i nd ex :

in t t h r ea d _i d = t h re a dI d x . x + t h re a dI d x . y * b l oc k Di m . x ;

in t b l oc k _i d = b l oc k Id x . x + b l oc k Id x . y * g r id D im . x ;

in t i dx = b lo ck _i d * b lo ck Di m . x * b lo ck Di m . y + t hr ea d _i d ;

f lo at 3 r , v ; / / T hi s p ar ti cl e p os a nd v el .

/ / C op y f ro m g lo ba l t o l oc al m em or y f or s pe ed :

r = R [ id x ]; v = V [ id x ];

f l oa t 3 B i , E i ; / / I n te r p ol a t ed f i el d s

I n t e r p o l at e F i e l d s _ d ( r , E , B , P o sM i n , P o sM a x ,

C e l ls D i m , C e l ls N u m , & E i , & B i ) ;

C a l cu l a te V e lo c i ty _ d ( v , E i , B i , w t );

C a l c u l a t eP o s i t i o n _ d ( r , v , P o sM i n , P o sM a x , s t e p ) ;

/ / S to re t he r es ul ts b ac k i n s ha re d m em or y ( s ti ll d ev ic e ):

R [ id x] = r ; V [ i dx ] = v ;

After the kernel runs, all particles have been pushed one time step. Several imple-

mentation strategies are worth mentioning.

Precision: CUDA hardware only has one double precision arithmetic unit for each

SMP. This means that calculations in double precision have a performance that

is approximately 9 times lower than in single precision. To allow for the use of




single precision calculations, and thus to fully utilize the computational power

of the CUDA hardware, particle positions cannot be defined using the corner of the simulation box as a reference. Instead, we must keep track of the cell where

the particle is located and define the particle position referenced to the cell cor-

ner. In our case, absolute positions are defined with two arrays: one contains

the cell number the particle is in (integer), and the other stores the normalized

position (a single precision float in the interval [0, 1[) inside the cell, referenced

to the lower left vertex. As particles leave a given cell, their positions are cor-

rected so that they are referenced to the new cell. This extra calculations incur

in small performance penalty, but the overall gain from using single precision is

far greater.

Memory bandwith: In order to increase memory access bandwidth, the copy from

global memory to shared memory of particle quantities like r and u is done in a

coalescent fashion by all threads in a block whenever possible. Memory band-

width from/to global memory further increased by sorting the arrays with the

particles’ data (relative position, cell number, velocity, charge) according to the

cell each particle belongs to. If these arrays are sorted in global memory, then all

accesses have maximal coherence, in particular while accessing the EM field ar-

ray for interpolation. This ensures that accessing particle data is always coherent

if we have one thread per particle.

The code allows for each thread to loop through several particles, pushing one

at a time. In this case, coherence in reading and writing particle data is lost.

However, there is an important gain in reading E and B fields, since it is likely

that the same values will be used by several particles in the same push thread.

Not only less global memory access is needed per particle, but also the compu-

tational operations of each push thread are increased over the memory access

operations (less data access and more computation). Hence for each particle in

a thread a check is done: if the particle belongs to a different cell than the previ-

ous particle, new values for the EM-field are read from global memory to local

(registers) memory; if the particle belongs to the same cell, then the the EM-field

values have already been read and can be used for interpolation.

Hardware interpolation: An interesting feature of modern graphic processors is that

they contain texture units that can do zeroth-order and first-order interpolation.

Thus if we store the EM field as textures, linear interpolation at the particle’s

position can be done by hardware simply as a texture fetch. We have achieved

a three-fold performance increase over a non-texture fetch interpolation. This

approach, however, limits the usage of variables stored as textures to read-only.




It is possible to update a texture with new values by changing its memory buffer

to global memory, but we found the overhead too expensive to be worth it.Nonetheless, and although only linear interpolation is used in this work, it is

interesting to note that higher order interpolations have also been achieved in

GPUs in general [72] and in CUDA enabled products in particular [73].

3.4.2 CURRENT DEPOSIT

Current deposition in PIC codes has currently two widely used approaches: the

Villasenor-Bunemann method [68] and the Esirkepov method [69]. In both cases, a

quasi-particle is viewed as contributing with a charge density ρ to the grid cells in

its vicinity using a weighting function. It is the change in the charge density during

the particle push that originates the current that is assigned to the grid. Hence, the

continuity equation

· j +∂ρ

∂t= 0 (3.6)

is satisfied in both methods. If a particle stays within a cell, both methods are equiv-

alent. But if a particle passes a cell boundary in one time step, each method has a

different approach to calculate the current contribution at each cell.

In the Villasenor-Bunemann method, the particle is split in virtual particles, each

one staying in a single cell (Figure 3.4). This method has the advantage of beingstraightforward to calculate and of not being limited to cartesian geometries. It is,

however, limited to first-order interpolation.

FIGURE 3.4: Illustration of particle splitting in the Villasenor-Bunemann current deposition

method.

The Esirkepov method is a generalization of the Villasenor-Bunemann for higher

order interpolation. There is no particle splitting, hence no dynamic memory alloca-

tion during current deposition (this is important for CUDA implementation), however




it is limited to cartesian coordinates. Since we were working in cartesian coordinates

and specially interested in higher-order interpolation, this method for current deposi-tion was chosen.

Most of the current deposit algorithm explained in [69], where for each particle we

calculate the amount of current to be deposited for each of the cells crossed during the

last time step motion, has a straightforward implementation in CUDA. Nevertheless,

since each particle will deposit current in up to 16 cells in 2D linear interpolation (see

Figure 3.5), particles from many cells may be contributing to the current in any given

cell.

FIGURE 3.5: Illustration of current deposition in the Esirkepov method. The 16 marked cells

have current deposited, although the bottom four and the left four have 0 deposit, as statedin [69].

This step is then likely to produce a significant amount of memory collisions, as

several different threads try to deposit the current in the same cell. These threads can

be from the same block, in which case one might devise some kind of synchronization

between them, but they can also belong to different blocks, in which case synchro-

nization is harder to implement in an efficient way. However, several strategies are

available.

One solution is to serialize the current deposition. This is very simple to implementat the cost of a significant performance penalty, even after parallelizing the deposition

of each current component. The cost depends on the number of threads that would

have to be queued. In CUDA we would be effectively running hundreds of thousands

of threads (from one particle per thread to a few particles per thread) almost in serial,

just three at a time (one concurrent thread for each space component). The perfor-

mance hit is so significant (about 2 orders of magnitude) that it is more effective to

transfer the data back to the CPU just for the current deposition and transfer it back to

the GPU for the rest of the simulation loop.




Another possibility is to take advantage of the particle sorting per cell referred

previously. If the particles are sorted according to the cell they belong to (before thepush) then we can accumulate the current on one thread per cell basis: each thread

loops through all the particles in a cell and deposits the current at each of the 16/25

vertices. If synchronization is guaranteed, then we can be sure that only one cell is

updated at a time. This approach has the drawback of adding a sorting step at every

simulation cycle. The sorting penalty could be minimized by adding more cells per

thread and thus sorting every other time step. A similar approach has been done

in [74] by dividing the simulation space in slabs that get assigned to threads blocks.

Since collisions are avoided in a block [75], no sorting is required in a slab, but only

between slabs.

A third possibility is to use atomic operations. These operations ensure that there

are no memory collisions by locking a memory position being written by a given

thread. CUDA offers a limited set of atomic operations that act both on global and

shared memory. In particular, it provides an atomic operation that allows accumula-

tion of values for integer values, although such operation is not available for floating

point data. To overcome this limitation the atomicExch() function, which is also avail-

able for floating point numbers, can be used. Given a memory address and a given

value, this function exchanges the given value with the one currently in memory.

Using this function, it is possible to devise a simple algorithm that simulates any

atomic operation, as long as that operation is commutative and has a neutral element:

1. Given a work value to accumulate at a certain position in memory:

2. ensure that it is different than the neutral element;

3. atomic-exchange the previous value at that memory position with the neutral

element for the accumulation function;

4. accumulate the previous value with the work value, generating a new value;

5. atomic-exchange the new value at the memory position.

6. If the value we got back is not the neutral element, this means that some other

thread placed this value there as a result of its own calculation. So use this value

as a new work value and go to 3.

7. Otherwise, if it is the neutral element, finish.

Figure 3.6 shows the pseudo-code that illustrates the steps described previously.




function PSEUDOATOMICOPERATION(address, value)

work_value← valuewhile (work_value = neutral_element) do

prev_value← atomicExchange(address, neutral_element)

new_value← operation( prev_value, work_value)

work_value← atomicExchange(address, new_value)

end whileend function

FIGURE 3.6: Pseudo-code for the generalized atomic function.

In our first approach, we started by implementing the pseudo atomic option. There

are several advantages to this approach. First, it allows for a straightforward code

which is easier to port from/to other multi-threaded architectures which might have

a similar atomic exchange operation, but may not suffer from the same global memo-

ry/shared memory issues. Second, this allows us to avoid load balancing issues, since

we can deposit the current per particle and not per cell. On a per cell current deposit,

a non-uniform particle distribution would necessarily cause different loads through-

out the threads: those dealing with cells with less particles would finish first than

those dealing with more particles. With one particle per thread, or with a more gen-

eral n particle per thread approach, this non-uniform thread load is avoided, since allthreads are dealing with the same number of particles and thus have the same amount

of calculations to perform.

However, one disadvantage is that we have 2 atomic exchanges per particle and

per cell, which might slow down the code seriously if the particles are ordered per

cell inside a block of threads. A simple solution is to initialize the array of the par-

ticles’ positions in a way that minimizes the chance that two consecutive particle’s

indexes refer to particles inside the same cell. Even better, all threads in a warp (the

group of 32 threads that are known to be synchronized by CUDA) should be working

with particles in different cells, preferably cells that are far apart so that there are no

collisions inside that warp. Although it is easy to initialize the particles so that this

condition is met, some care has to be taken to ensure that it still holds throughout

the simulation. As the simulation evolves and particles are moved around, this initial

distribution condition will no longer hold and some current deposition conflicts will

occur, causing a degradation in performance. When that degradation is higher than a

certain threshold, a redistribution of the particles over the threads is required, corre-

sponding to a sorting operation with a certain penalty. However, the threshold for the

sorting can be adjusted so that this penalty is minimized in the overall performance

balance.




So our final implementation included the pseudo atomic option, a sorting oper-

ation and a stride distance, that is, the distance in particle index between the parti-cles handled by one thread and the next consecutive thread. This guarantees that

the atomic collisions were minimized during current deposition. In Section 3.6 a dis-

cussion on the sorting penalty and the optimal stride value is presented. Figure 3.7

illustrates the implementation of the stride during particle push and current deposi-

tion.

FIGURE 3.7: Illustration of a stride during particle push and current deposition, so that kernels

in the same warp handle particles in different cells. p is the number of particles per thread and

s is the stride. (For a color version, see Figure C.2 on page 106.)

For the sorting step, we had to consider that, in our implementation, particles’

data is stored as an array, with each value containing all the data corresponding to

one particle: the cell it is in, its position inside the cell, its generalized velocity and its

charge. Hence sorting was done in three steps: the first step collected the cell index

of each particle as an independent array; the second step sorted that array; and the

third step shuffled the full particles’ array the same way as the cell index array. Not

sorting the full particles’ array during the second step has the advantage of avoiding

several non-coalescent memory accesses of particle data. Since the array to sort is just

of unsigned integers, we used the radix sort algorithm and its available implementa-

tion in CUDPP, which is considered to be one of the fastest implementations of then

radix sort algorithm [76].

3.4.3 EM FIELD UPDATE

The kernel that implements this step is launched as one thread per cell — or one thread

per lower left vertex of a cell. Each thread updates its value of the EM field as de-

scribed previously (Equation 3.3 and 3.4), using the staggered grid of Figure 3.2 and a

3 step finite-differences method. Actually, several kernels had to be implemented and

launched in sequence, since this process requires at least two global synchronization




steps: one after the update of the first half of the B field and another after the full up-

date of the E field. Each step is implemented as a kernel, and between them an updateof the guard cells is also necessary. An overview of the sequence of different kernel

launches is illustrated in Figure 3.8. All these kernels are launched as one thread per

cell — or one thread per lower left vertex of a cell.

FIGURE 3.8: Illustration of the sequence call of the three different kernels required for a fullEM field update.

3.4.4 BOUNDARIES AND GUARD CELLS

The algorithm was implemented using Dirichlet periodic boundary conditions, defin-

ing the simulation space as closed and periodic. This means that the neighbors of

the cells lying at the lower boundary of the simulation will be the cells at the upper

boundary of the simulations and vice-versa, and that particles leaving one side of the

simulation box will re-enter on the other side. This is implemented through the useof guard cells, which are extra cells added at the simulation boundaries, where the

values from the grid points at the other boundaries are replicated. This allows for a

much simpler simulation algorithm, where all cells are treated equally, and guard cell

values are updated after the iteration is completed. To this end we have implemented

two different kernels to handle guard cells: one adds the accumulated current density

from the guard cells to its correspondent physical cells in the grid, while the other

copies the EM fields from the physical border cells to the guard cells. It should also be

noted that this technique is similar to what is used in distributed memory parallel PIC

algorithms, where guard cells in one computation device correspond to grid points on

a neighboring computation device, and could, in principle, be used in a system where

multiple CUDA devices operate cooperatively.

3.5 APPLICATION T O WEIBEL INSTABILITY

We have tested our CUDA simulation code by running a well-known benchmark

problem of the Weibel instability [77, and references therein] in electron-positron plas-

mas. In this simulation, two plasma clouds, one positron plasma cloud and one elec-



3.6. Results and Analysis 55

tron plasma cloud, initially moving perpendicularly to the simulation plane with some

temperature distribution, are set to evolve self-consistently using a fixed neutralizingion background. The simulation parameters are as follows:

Generalized velocity: u = γv = (0; 0; ±0.6), with species 1 (electrons) having the

positive velocity and species 2 (positrons) the negative velocity.

Thermal distribution: uniform guassian with 0.1 width.

Grid parameters: 128× 128 cells; simulation space: 12.8 c/ω p × 12.8 c/ω p.

Time step: ∆t = 0.07, which, together with the simulation space size, satisfies the

Courant condition.

In this scenario the evolution of the plasma clouds is governed by the Weibel insta-

bility, leading to the formation of current filaments that then coalesce leaving behind a

set of empty plasma bubbles (in 2D), as can be seen in Figures 3.9 and 3.10. These bub-

bles remain stable, with the plasma pressure being balanced by the radiation pressure

of the EM fields trapped inside the bubble. The initial kinetic energy of the particles is

partially transferred to the EM fields, leading to the formation of magnetic field loops

that slowly merge until the simulation reaches a steady state, as seen in Figure 3.11.

Most of the simulations were done with 6 × 6 = 36 particles per species per cell,

but this parameter could be changed interactively. The simulations completed suc-cessfully in several CUDA enabled systems. The results were benchmarked against

the EM-PIC code OSIRIS [40] in single precision, giving consistent results. The small

differences found were related to a different choice of random number generator, and

differences in the particle order for current deposition, which leads to minor differ-

ences due to numeric roundoff issues.

3.6 RESULTS AND ANALYSIS

The performance of the CUDA PIC code was evaluated using the simulation men-tioned in the previous section. The code was run with two different configurations

of particles: 8 particles per cell (4 for each species), giving a total of 131,072 particles

(further referenced as the 131 k run), and 72 particles per cell (36 for each species),

for a total of 1,179,648 particles (referenced as the 1.2 M run). Table 3.2 presents an

overview of the most relevant results using a Tesla C1060 board for simulation and a

Quadro FX 1800 for rendering.

We also ran our code on the CUDA devices listed in Table 3.1. Our purpose was

to evaluate NVIDIA’s claims on the portability of CUDA code across a wide range




FIGURE 3.9: Charge density of electrons during the formation of a Weibel instability.

FIGURE 3.1 0: Final state of a full 2D PIC simulation of a Weibel instability (positrons in cyan

and electrons in yellow). (For a color version, see Figure C.3 on page 106.)




(1) (2)

(3) (4)

FIGURE 3.11: B21 + B2

2 intensity during the formation of a 2D Weibel instability.




# particlesrender

(ns/particle)

cycle

(ns/particle)

131 k 0.020 72

1.2 M 0.019 39

TABLE 3.2: Time measurements of a full PIC CUDA 2D implementation of a Weibel instability.

Grid dimensions are 128 × 128 cells. The system used a Tesla C1060 board for CUDA and a

Quadro FX 1800 for rendering.

of devices of different computational capabilities. Although we noticed obvious per-

formance differences, our code produced consistent results across all equipment. The

Tesla C1060 was the one that presented the best computational performance, so it is

the one used throughout this chapter.

For comparison, we implemented the same code in C and ran it on the host ma-

chine, an Intel Xeon E5420 2.50 GHz with 6 MB cache. The C code was compiled using

gcc version 4.3.2 with full optimizations enabled (-O3). The results, using one core of

the CPU, were 401 ns per particle on the 131 k run and 379 ns per particle on the 1.2 M

run.

It is interesting to note the difference in performance between the 131 kpart and

the 1.2 Mpart runs. The reason is that below one million particles, the Tesla C1060 is

not quite full yet, and does not have enough threads to hide the memory latency in

transfers between global and shared or register memory.

We emphasize that the rendering time can be considered irrelevant (less than 0.05%

of a cycle). Even if we manage to lower the cycle time by one order of magnitude, it

will still be acceptable if direct visualization is desirable (see Section 4.2).

We have also estimated the timings for the different parts of the algorithm. We note

that only an estimate is available, since it is not possible to completely isolate all the

algorithm parts for benchmark. For example, sorting is required for an effective cur-rent deposition. Since it is possible to time the sorting step without current deposition,

but not the other way around, we estimate the current deposition step by subtracting

the sorting time from the total of current deposition and sorting. Table 3.3 and Fig-

ure 3.12 show the time duration of the different push steps: reading and writing out

the particle data from/to global memory to/from registers, sorting the particles’ array

according to the cell’s index, interpolating the fields at the particle’s position, calculat-

ing the new velocity and advancing the particle, and depositing the current for each

particle.




StepTime

(ns/particle)

Read & write 4.5

Sort 6.0

Field interpolation 0.66

Vel. & part. advance 0.4

Curr. deposition 23

Total 34.6

TABLE 3.3: Time estimations for different parts of the particle pusher algorithm in the

1.2 M run.

R/W

Sort

Interpolaon

Vel.+Pos.

J deposion67%

13%

17%

2%

1%

FIGURE 3.12 : Percentage of time spent at each step of the pusher algorithm. Based on the

values from Table 3.3. (For a color version, see Figure C.4 on page 107.)

The previous timings do not show the field update step, since it depends on the

number of cells and not on the number of particles. The results were 9 ns per cell for

the complete cycle depicted in Figure 3.8. During the complete simulation cycle this

value is not relevant, since usually one has much more particles (millions) than cells

(tens of thousands). For example, in our 1.2 M run, the total time per cycle was 46 ms,

of which just 0.15 ms corresponded to the EM field update.

We have implemented several of the performance parameters exposed in previ-

ous sections and evaluated their impact on code efficiency. The optimal number of

threads per block depends on the kernel in use. We found that a number of 64 or 128

was optimal for the pusher kernel (that also includes field interpolation and current

deposition), and that 128 was best for the fields update. These values do not depend

on the number of particles or of vertices, but on the amount of shared memory and

registers required per kernel.




To avoid collisions during current deposition, two consecutive threads should not

handle particles on the same cell (Section 3.4.2 and Figure 3.7). So we implementeda stride parameter that defines the index distance between the particles handled by

two consecutive threads. Also, to increase the amount of arithmetic operations per

memory access, we have implemented a parameter that defines the number of par-

ticles handled by each thread. Interestingly, we have found these parameters have

similar effects. When handling one particle per thread, a stride of 4× the number of

particles per cell guaranteed best performance and minimal collisions. Similarly, with

a stride of 1, we have found that handling 4× the number of particles per cell per

thread achieved a similar performance. Figures 3.13 (a) and (b) represent the varia-

tion in performance per particle for different particles per thread (different lines) and

different stride sizes (x-axis), for the 1.2 M run. Figure 3.13 (a) refers to particles per

thread that are submultiples of the particles per cell, and Figure 3.13 (b) to particles

per thread that are multiples of the particles per cell. As already said, the best perfor-

mance is achieved with a stride of 1 and 4 times the number of particles per cell per

thread (39 ns per particle). However, a stride of 4 times the number of particles per

cell with one particle per thread gets a similar performance.

Two other simulation parameters were implemented and tested, but did not con-

tribute to a performance improvement. First, the possibility to serialize the launching

of kernels at the CPU level, and second, not to sort at every time step.

60

55

50

45

40Ppt = Ppc = 72 Ppt = Ppc = 72

Ppt = Ppc/2 = 36

Ppt = Ppc/4 = 18

Ppt = 4×Ppc = 288

Ppt = 2×Ppc = 144

Ppt = Ppc/8 = 9 Ppt = 8×Ppc = 576

n s / p a r t i c l e

64321684Stride size

21 321684Stride size

21

60

55

50

45

40

n s / p ar t i c l e

(a) (b)

FIGURE 3.1 3: Evaluation of performance ( y-axis, in ns per particle) with stride size ( x-axis)

and particles per thread (Ppt, different lines), for the 1.2 M run. Each line uses a number of

particles per thread (Ppt) which is either a submultiple (a) or multiple (b) of the number of

particles per cell (Ppc).




It was considered that to further minimize memory collisions during current de-

position, the kernel threads could be launched in batches, thus guaranteeing that theCUDA device was loaded in a manner that minimized the risk of two particles up-

dating the current in the same cell at the same time. This meant that some part of the

thread block control was made at the CPU level and not by the CUDA implementa-

tion. This lead to worse performance than launching all threads at once.

We also considered that not sorting at each time step might save the 6 ns step

without incurring in too much extra memory conflicts. This was not the case. The

memory conflicts by not sorting increased the current deposition by several tens of

nanoseconds, so that, in the end, the best performance was achieved by sorting at

each time step.Based on the values of Table 3.3, we also estimated how efficient the device was

being used. Table 3.4 shows the estimated number of single precision floating point

operations per second (in GFLOPS) of different parts of the pusher algorithm, and

also of the whole algorithm. A comparison with the peak performance of the Tesla

C1060 [78] is also shown. The same code running on a single core of the Intel Xeon

E5420 (theoretical peak performance of 5 GFLOPS) achieved 1.25 GFLOPS. Hence it

can be estimated that for this algorithm one Tesla C1060 can replace 3 Intel Xeon E5420

(or approximately 11 cores).

Step GFLOPS Efficiency

Fields interp. 103 33%

Vel. & part. advance 200 64%

Curr. deposition 14.1 4.5%

Complete pusher 13.6 4.4%

TABLE 3.4 : Performance estimation for some parts and for the complete pusher algorithm.

The simulation used the 1.2 Mpart run and 128× 128 cells running on the Tesla C1060. Based

on the values from table 3.3 and the theoretical peak performance of 311 GFLOPS.

The application of this algorithm to large scale plasma simulations is ultimately

limited by the total memory available on the CUDA device (currently up to 6 GB

on a Tesla C2070), with state of the art simulations [63] requiring ∼ 1010 particles to

be followed for ∼ 106 − 107 timesteps, with total memory requirements going up to

∼ 1 TB. The solution will undoubtedly rely on the use of a (massively) parallel GPU

system, where an ecosystem of CUDA devices operates cooperatively communicating

through some form of interconnect, much like a distributed memory parallel com-

puter. Our implementation of boundary conditions using guard cells, as explained




in Section 3.4.4, is well suited to a spatially decomposed parallel version of our algo-

rithm, where each CUDA device is responsible for a smaller region of the total simu-lation space. Each CUDA device would only need to communicate with neighboring

devices sending guard cell values and particles crossing the device boundary. How-

ever, this implies a device-to-host transfer when sending data to other nodes, and a

host-to-device transfer when receiving to be done at every timestep, which may have

a negative impact on overall performance.

3.7 OVERVIEW OF OTHER WORK ON PIC AN D CUDA

During the duration of this thesis several groups presented results in this area. We will briefly give an overview of their results. Our own effort before CUDA, using GLSL

instead, is reported in Appendix B.

Stantchev et al. [74] published in 2008 the first paper on this subject. Their focus

was just on the current deposition part and they developed an algorithm to optimize

that scatter operation. In that algorithm, particles were grouped in bins and spatial

cells were grouped in clusters. Particles in one bin belonged to the same cluster. This

allowed for an easier mapping of particle bins and cell clusters to hardware SMPs,

and avoided full sorting of particles at each time step. However, as there were still

memory conflicts during current deposition, a “thread tagging” technique was used,like the one described in [75].1

In 2009, Burau et al. [79] developed a lightweight relativistic PIC code in CUDA.

The problem of memory collisions was overcome by allowing for violation of charge

conservation, using the data structure and particle sorting technique of [74] (called

linked particle list), and resorting to atomic operations. They also used texture interpo-

lation to accelerate read access to particle data. Finally, they deployed the code in a

small GPU cluster, interleaving MPI communication calls with CUDA kernels.

Decyk et al. [80] implemented an electrostatic spectral PIC code in CUDA, and

expanded it further with parameters that adapted the same code for different multi-core architectures [81]. They also expanded the notion of the linked particle list by

having empty spaces in the particles’ array. This allowed for minimal sorting at the

cost of more memory usage. Based on this work, Kong et al. [82] implemented a full

relativistic PIC code using an adaptation of the Villasenor-Bunemann method [68] for

current deposition. In this step, they group grid cells in clusters and use an interleave

1In this technique, the value to be deposited has some of the most significant bits reserved as an

unique thread identification. This allows for each thread to detect if they were able to successfully deposit

the value or not.



3.8. Overview and Conclusions 63

technique to avoid conflicts beween clusters; inside a cluster of cells, they also use the

“thread tagging” technique of [75].

As it was explained previously (see Section 3.4.2), in this work we decided to im-

plement the Esirkepov method of current deposition. It has less thread divergence

than Villasenor-Bunemann and allows for higher order interpolation. To avoid mem-

ory conflicts, we developed an atomic operation on floats. This avoids the precision

loss of the “thread tagging” technique, caused by using some bits to store a unique

thread tag. We also used a stride distance between consecutive threads, to minimize

atomic collisions. We found this approach to be more flexible and easier to extend to

3D than the bin/cluster approach. We also use a compact memory scheme to store par-

ticles. This requires less memory and allows for faster direct display results (see Sec-tion 4.2.4).

3.8 OVERVIEW AND CONCLUSIONS

We have implemented a fully relativistic 2D PIC code on a GPU using C for CUDA.

We have validated the implementation using a well-known benchmark problem of the

Weibel instability in electron-positron plasmas. The code performs significantly faster

on a Tesla C1060 than on a single core of an Intel Xeon E5420. The main performance

bottleneck is current deposition (approximately 67% of a simulation cycle), since it in-volves a scattering operation to global memory. We were able to avoid serializing this

step by implementing a pseudo atomic add with floats, which can be extended to in-

clude other kinds of atomic operations, as long as they have a neutral element and are

commutative. To avoid memory access conflicts during the current deposition step,

several strategies where implemented and evaluated. The best results were obtained

with a particle sorting mechanism and by ensuring that consecutive threads deposit

current in different cells. This approach achieved minimal memory conflicts during

current deposition.



CHAPTER 4

MULTI-SCALE VISUALIZATION

4.1 DISTRIBUTED RENDERING WITH VISIT

Visual exploration of data produced by a numerical simulation is a major task in sci-

entific High-Performance Computing. The amount of time a scientist takes to explore

and understand the data generated by a numerical simulation/experiment can be sev-

eral orders of magnitude larger than the time it takes to produce. For example, our nu-

merical simulation infrastructure at GoLP (simulation codes, hardware systems where

this codes run) is able to produce terabytes of data in a few days of simulation. Hence

powerful tools for data exploration are essential.

Our research center has developed a visualization infrastructure called VisXD [63]

which is able to handle most of our current visualization needs. It is based on IDL [83]

and contains routines for manipulation of large data sets and to minimize the effort of

creating high quality graphics.

However, VisXD is still a single-processor tool. Although the Operating System

(OS) might automatically distribute the computational tasks over the available cores,

VisXD is not able to work in distributed memory systems. That is a major limitation

for the exploration of large data sets, since in that case the display data to be processed

by the graphic processor (GPU) might not fit into the graphics RAM of a single system.

To overcome such limitations, we did an evaluation of current distributed-memory

visualization tools. We have evaluated SCIRun [84], Paraview [85] and OpenDX [86],

but finally VisIt [26] was chosen, because it was the only one to offer all the features

needed: parallel data processing and rendering (distributed and shared memory),

scriptable, expandable, high performance and scalability, multi-platform.

We developed a data plugin for VisIt that allows the results produced by our sim-

ulation software (Osiris, dHybrid, QuickPic) to be visualized and analyzed. We have

65



66 Chapter 4. Multi-scale visualization

also set up a set of data post-processing tools in VisIt, like vector intensity and kinetic

energy diagnostics, that tightly integrate with VisIt’s visualization tools, allowing foran easier exploration of data features that are not directly available in the raw output

data from the simulation.

In addition, we have also set up a distributed memory visualization cluster with 10

cores and 5.5 GB of RAM to test the parallel visualization capabilities of VisIt and eval-

uate its usefulness in exploring data with several gigabytes per time step (Figure 4.1).

FIGURE 4.1 : Part of the hardware for distributed visualization testing. These four machines

represent 8 cores and 4.4 GB of RAM. Another similar node (not in the picture) acted both as

the scheduler and as a computing node, and added the remaining 2 cores and 1.1 GB of RAM.

The same system was also deployed on IST Cluster [87] (an AIX system with 280

cores and 2 GB of RAM per core located at IST) and on the ExPP Cluster [88] (a 40

core PowerPC cluster for the internal use of GoLP at IPFN). These efforts showed

the feasibility of expanding our visualization infrastructure to distributed memory

systems without spending too much effort in a migration and re-factoring process.



4.2. Direct visualization with CUDA 67

We are currently developing a plugin for HDF5 data (the previous data plugin

only allowed HDF4) and we are increasing the post-processing tools available, namelyintegrating particle tracking.This work represents a major first step towards a fully-

parallelized visualization infrastructure at IST, which is critical for multi-gigabyte data

analysis.

4.2 DIRECT VISUALIZ ATION WITH CUDA

As with any numerical experiment, visualization plays a critical role in particle-in-

cell simulations. This can be a time consuming and computationally demanding task,

that can benefit greatly from running the simulation on the GPU itself. Since most of

the CUDA enabled devices available are graphic processors (NVIDIA’s GeForce and

Quadro boards), the data is already available in video memory, and can readily be dis-

played, avoiding time consuming memory transfers from CPU to video memory. Even

in CUDA hardware that is not a video card, such as NVIDIA’s Tesla boards, the band-

width throughput to video RAM over PCI Express×16 is very high (up to 1000 GB/s),

allowing for very efficient visualization. There are also several post-processing visual-

ization diagnostics, such as smoothing or energy calculations, that can make use of the

available computational power on the GPU, bringing an added benefit to doing direct

visualization. In this sense, we have developed code in OpenGL that tightly integrateswith our PIC implementation in CUDA (explained in Chapter 3), expanding it to al-

low for the display and exploration of the resulting simulation data. This expanded

system is able to display millions of particles and also to produce several standard

diagnostics (EM field, current, charge density) and other custom ones at interactive

frame rates.

4.2.1 PARTICLE (POINT) VISUALIZATION

As with any particle code, point data is one of the fundamental datatypes to be vi-sualized. As explained in Section 3.4.1, particle positions are stored as normalized

coordinates to the grid cell, using values in the range [0; 1[, together with the particle

cell index, stored as an integer. To display particle (point) data in OpenGL, two ap-

proaches are available: i) for each simulation particle, we could move the coordinate

system to the origin of each cell in the grid and render a point in the particle position

using the corresponding scale factor (the cell dimension length) or ii) we could con-

vert the normalized particle coordinates and cell index to a global position array that

would be used to draw the particles.




In this work, we chose the latter approach, since it offers several benefits over the

first:

— handling different scaling in each dimension is easier and more straight forward

than with OpenGL coordinate scaling;

— the conversion of coordinates can be done very efficiently in CUDA, since all the

data is independent from one another and is already available in the GPU memory;

— after conversion, the data can be displayed very quickly in OpenGL using a Ver-

tex Buffer Object (VBO) [89, chap. 2]. VBOs cannot be used efficiently in the first

method.

Particle visualization is then done using a CUDA kernel that converts from normal-

ized coordinates to absolute coordinates. This conversion is done in a single pass

maximizing memory bandwidth. The resulting data is then used as a VBO to be dis-

played in one single OpenGL render command. All operations are performed in video

memory achieving a minimal overhead. We already showed an example of direct par-

ticle visualization during the simulation of a 2D Weibel instability in Figure 3.10. We

present in Figure 4.2 a similar picture, since that particular simulation step will be

used in the next section as a basis for several diagnostics.

FIGURE 4.2: Direct visualization of a 2D PIC simulation of the formation of a Weibel instability.

(For a color version, see Figure C.5 on page 107.)




4.2.2 GRI D VISUALIZATION AND DIAGNOSTICS

Being a particle-mesh algorithm, visualization tools for PIC codes are also required to

handle grid data, and we have implemented several OpenGL routines for this type

of visualization. They take advantage of the programmer interoperability of textures

between OpenGL and CUDA. The main idea is to use data in CUDA global memory

(GPU’s RAM) as an OpenGL texture to be rendered and displayed. Again, this allows

for fast rendering, since no transfer of data from the CPU to the GPU is necessary.

We have also implemented several grid related diagnostics, which are particularly

adequate for GPU algorithms, such as vector field magnitudes, or EM field energies.

Figure 4.3 shows several different diagnostic possibilites, using the 2D Weibel instabil-

ity of Section 3.5: electron and positron charge density, intensity of the perpendicular

component of the current ( j3) and the transverse magnetic field energy B21 + B2

2.

4.2.3 INTERACTIVITY

Interactive control of direct visualization was also implemented, to simplify interac-

tion with the simulation code, and for easy switching between the several available

diagnostics. The routines implementing interactivity must be unobtrusive, so that

the simulation performance does not suffer too much, but also need to offer enough

interactivity to allow the user to have enough control over the visualization and di-agnostics. Our choice of implementation went to OpenGL’s GLUT [89], providing a

lightweight, multi-platform API, with minimal impact on performance, and a sim-

ple integration with the CUDA GPU simulation code. Figure 4.4 shows a generic

overview of a simulation cycle, including parsing user events and rendering. To min-

imize the impact on performance, it is also possible not to update the visualization

at every timestep, but only at larger intervals, by doing several simulation cycles per

event parsed, or even to skip rendering and user interaction altogether if maximum

performance is required.

We have added the possibility to control interactively both the visualization pa-

rameters, like the zoom level and the rotation of the whole scene, and some of the

simulation parameters, such as the number of particles per cell or the grid spatial

resolution. These later features restart the simulation with the new settings. This is es-

pecially useful in testing phases, as it allows to quickly scan the effect of certain code

changes for various simulation scenarios. Figure 4.5 shows some of the interactive

features implemented.

We have also evaluated the penalty for adding user interaction during a simula-

tion. Table 4.1 shows the time it takes for our system to simulate different numbers




(a) (b)

(c) (d)

FIGURE 4.3 : Examples of different diagnostics during the formation of a 2D Weibel instabil-

ity: (a) electron charge density, (b) positron charge density, (c) intensity of the perpendicular

component of the current ( j3), and (d) transverse magnetic field energy B21 + B2

2.




Visualization

Restart simulation

Start OpenGL context

Initialize simulation

Get user events

Change

visualization

parameters

Change

simulation

parametersIdle

Get positions/diagnostics

Render

Do N simulation cycles

Interaction

Simulation

FIGURE 4.4: Overview of the integration between the simulation code (middle layer, in green),

and the interaction and visualization code (top and bottom layers, in red and blue respec-

tively). (For a color version, see Figure C.6 on page 108.)

(a) (b)

FIGURE 4.5 : Two examples of direct interaction during a simulation: (a) depicting the grid

used in the PIC simulation over the particles, (b) zooming in the center of the image while the

simulation is running. (For a color version, see Figure C.7 on page 108.)




of cycles with and without user interaction, for the 1.2 k run referred in Sections 3.5

and 3.6.

CyclesNo interaction

(seconds)

Interaction

(seconds)Penalty

25 3.764 3.928 4.36%

250 18.533 19.309 4.19%

2500 105.463 112.099 6.29%

5000 195.112 208.265 6.74%

TABLE 4.1 : Estimation of user interaction penalty on the 1.2 kpart run (see Section 3.6).

The measured penalty, although significant larger than the rendering (the render-

ing is considered to be irrelevant, as showed in Table 3.2 of page 58), is also still ac-

ceptable. For simulations that run for a high number of cycles it might be desirable to

lower the user interaction priority, that is, to run several cycles per user event parsing.

The same approach is to be used if the cycle time is made significantly shorter. This

possibility has been included in the code, represented as the third line from the bottom

of Figure 4.4.

4.2.4 3D VISUALIZATION

Integrating visualization with 3D simulations is a simple extension of the work ex-

posed in the previous sections. In fact, we started this work with a visualization of

the 3D version of the particle pusher of Sections 3.2 and 3.4.1. We used forced fields,

so that we could take better advantage of the trilinear hardware interpolation of the

texture units of the GPU. We have already referred to this approach on Section 3.4.1,

page 48. Figure 4.6 shows the result of the simulation. Electrons are initialized in the

presence of a static magnetic field that forces them into a circular movement around

the axis.

4.2.5 PREVIOUS WORK

Direct visualization of simulation results in other scientific areas have been explored

since the first programmable GPUs, for example in fluid and flow simulations [90, 91],

and in rigid-body and molecular dynamics simulations [92]. Visualization of plasma

simulations is also an important topic, with several applications available [93, 63, 94].

However, to our knowledge this is the first approach to a direct visualization of plasma



4.3. Conclusions 73

FIGURE 4.6: Direct visualization of a 3D particle pusher on a static EM field. (For a color

version, see Figure C.8 on page 109.)

simulations, where the user interacts with the GPU-accelerated simulation while it is

occurring.

4.3 CONCLUSIONS

We have developed large-scale and small-scale tools for scientific visualization. At

the large scale, we built a proof-of-concept visualization cluster that allowed us to use

VisIt to achieve parallel rendering and visualization of large data sets. At the small

scale, we integrated a visualization and interaction layer on a plasma simulation code.

This had the important benefits of allowing for a direct visualization and an interactive

exploration of the results. We also estimated the performance penalty of this system

on the overall performance of the simulation code. To overcome the possibility of it

being significant, we have added the option to interactively lower the priority of the

interaction layer, so that the user is able to control the compromise between betterperformance or more user interaction. This system is easily extended from 2D to 3D

and also with more visualization tools, besides the ones already developed.



CHAPTER 5

OVERVIEW AND FUTURE WORK

The relative stability of commodity computing hardware since the mid 90s has led to

the development of highly optimized numerical codes that allow for very high-scale

simulations of physical phenomena. When the power limit was hit in the first half of

this decade, other alternatives for increasing performance had to be found. However,

they must be flexible enough to allow for the reuse of the previous code, that had

already achieved high reliability and performance.

One possibility is to further explore the coarse grain parallelism available with the

Grid. To that goal, we have developed a lightweight library to manage transfers of large sets of data in a way that is transparent to the user and very easy to integrate

with existing applications (see Chapter 2). It is also flexible enough to allow for the

overlap of computation and data transfer. We have shown the use of the library in a

real-case scenario, where it is integrated with the high-performance plasma simulation

code Osiris.

Further development of this approach is desirable. We are already working on

extending the interface of the library to increase its flexibility, and on implementing

an easier file transfer interface that is based on existing tools and exposes a POSIX

compliant interface. This will allow the deployment of non-Grid code without any

modifications. A web portal for easy access of output files is also being implemented.

Another possibility to overcome the power wall is to increase the number of local

computing cores. We have developed a plasma simulation code that achieves high-

performance on systems with hundreds of cores and thousands of threads (see Chap-

ter 3). The memory hierarchy of such systems proved to be general enough (small

and fast local memory together with large and slow global memory) so that the solu-

tions we found can easily be ported to other massively parallel systems. In particular,

we have developed a general algorithm for atomic functions that allowed for an opti-

75



76 Chapter 5. Overview and Future Work

mized scatter operation that would be very costly otherwise.

Work in this area is continuing with the development of a 3D version of the code

and with the evaluation of the integration with existing CPU codes as abstract func-

tions calls. This will allow for acceleration of existing intensive computational parts,

while keeping the more elaborated features intact.

Data is useless without insight. The more data we are able to generate, the more

challenging it is to make sense of it. Visualization is a powerful tool for insight, but it

must be coupled with a direct and immediate feedback, so that exploration is quick,

effective and fun. We have explored both large scale and small scale visualization

applications (see Chapter 4). On one side, we developed a cluster that allowed for

parallel visualization of simulation data. On the other side, we have integrated avisualization layer directly on a plasma simulation code. This allowed for a direct and

interactive exploration of the simulation results as they were being produced, with a

minimal penalty in performance.

A larger scale visualization infrastructure is now being built at GoLP, partially

based on the tools (VTK) we have been using. Also, direct visualization is already

being further developed, with 3D integration and more powerful tools, like interac-

tive color spaces, particle tracking and function diagnostics.

We still have not found a limit for the human understanding. With this thesis and

the tools developed for it, we hope to have contributed to the exploration of Nature’smysteries, adding a bit more to our capacity for grasping, being surprised and in awe.



APPENDIX A

CODE LISTINGS

In this appendix, we present the relevant code developed during this thesis. Sec-

tion A.1 refers to Chapter 2, specially the DataLight library. Section A.2 refers to Chap-

ter 3, specially to the CUDA relevant parts. Finally, Section A.3 refers to Chapter 4,

both to the VisIt plugin and to the direct visualization implementation.

A.1 CODE FROM

CHAPTER

2In this section, we present the relevant C code used to implement the DataLight li-

brary. An overview of the algorithm implemented in the library was already given

in Section 2.3.3, page 26. As illustrated in Figure A.1 (which is a reproduction of Fig-

ure 2.7, placed here for convenience), the library consists mainly of a queue manager

thread, that checks for files to be transferred, and on several transfer threads. An ap-

plication uses the library by adding a file to a waiting queue, using the export function

write_remote.

Listing A.1 shows the implementation of the exported function write_remote,called in the top left of Figure A.1 at the line “Add to waiting queue”. The func-

tion takes as an argument the path of a file to be transferred to a SE. First, it checks

to see if the queue manager thread is already active. If not, it initializes the two main

data structures waiting and writing, which are FIFO lists that store the files to be

transferred and the ones being transferred, respectively. Then, it launches the queue

manager thread, which is mostly implemented in the function lists_manager. Fi-

nally, the function adds the file to the waiting list. The lists waiting and writing are

represented in Figure A.1 as the “Waiting queue” and the “Transfer queue”.

77



78 Appendix A. Code listings

Main thread

Transfer thread

Queue manager thread

Compute results

Write locally

Add to waiting queue

Disk

Waiting queue

Waiting queueempty?

Transfer slotsavailable?

Wait

Move first elementfrom waiting queue to

transfer queue

Initiate transfer thread

Transfer queue

Transfer thread

Transfer first elementfrom transfer queue

Remove transferredelement from

transfer queue

End

N

Y

Y

N

Queue manager

Main program

Queue manager

thread

FIGURE A.1: A simplified fluxogram of DataLight (left and bottom) and its connection with a

simulation code (top).



A.1. Code from Chapter 2 79

LISTING A.1: The main exported function for DataLight. It is called for each file that should

be transferred.

in t w r i t e _ r e m o t e (char * f na me ) {

if ( m a n a ge r _ th r e ad _ a ct i v e = = 0 ) {

/ / I n i ti a li z e t he l i st s :

w r it i ng = l i st _ ne w ( ) ;

w a it i ng = l i st _ ne w ( ) ;

/ / I n i t i a l i ze t h r e ad a t t r

p t h r e a d _ a t t r _ i n i t ( & a t t r ) ;

p t h r e a d _ a t tr _ s e t d e t a c h s ta t e ( & a t t r , P T H R E A D _ C R EA T E _ J O I N A B L E ) ;

/ / L a un c h t he m a na g er t h re a d :

m a n a g e r _ t h r ea d _ a c t i v e = 1 ;

p t h r e a d _ cr e a t e ( & m a n a g e r _t h r e ad , & a t tr , ( void * ) l i s t s _m a n a ge r , \

( void * ) N U LL ) ;

}

/ / A d d t he f il e t o b e w ri tt en t o t h e w ai ti ng l is t :

l i s t _ ad d ( w a i t i ng , f n a m e ) ;

return 0;

}

Listing A.2 shows the implementation of the queue manager function called

lists_manager. This is mainly a loop that, at each iteration, takes a file name fromthe waiting list, places it in the writing list, and creates a transfer thread with that

file.

Listing A.3 shows the implementation of write_thread, the file transfer function

that is run by each transfer thread. The main action is the call to lcg_cr, a gLite data

transfer function (see Section 2.3.1 on page 22 and [1]). First some string parameters

have to be set, like the local and remote filenames and the remote path. Although this

code is mostly not shown due to space constrains, we show the use of make_lfc_dir,

which creates a directory path recursively in the remote SE. Then the actual transfer

takes place and finally the file transferred is removed from the writing queue.

Finally, in Listing A.4 we show the other exported function, write_finished, that

finishes the transfer threads. It simply sets the variable manager_thread_active to

zero and waits for the manager thread to exit.

Not listed here is the implementation of the waiting and writing data structures.

They are instances of a standard FIFO data type with basic functions for creation

(list_new), addition (list_add), deleting (list_del_element) and moving (list_mv).

Examples of usage are shown in the previous listings.




LISTING A.2 : The queue manager thread.

/ / M a na g er t h re a d t h at d i sp a t ch e s t he w r it e t h re a ds :

in t l i s t s _ m an a g e r ( ) {

p t h r e a d _ a t t r_ s e t d e t a c h s t at e ( & a t t r , P T H R E A D _ C R E AT E _ D E T A C H E D ) ;

/ / M ai n l oo p :while ( ( m a n ag e r _t h r ea d _ ac t i ve = = 1 ) | | ( w a it in g - > f i rs t ! = N U LL ) ) {

if ( ( w ai ti ng - > f i rs t = = N U LL ) | | ( n _ t hr e a ds > = M A X _T H R EA D S ) ) {

/ / I f t he re ’ s n ot hi ng t o d o o r w e ’ re t oo b us y , w ai t 1 00 m s :

usleep(1000);

}

else {

if ( w a it in g - > f i rs t ! = N U LL ) {

p t h r e a d _ m u t e x _ l o c k ( & n _ t h r e a d s _ m u t e x ) ;

l i s t _ mv ( w a i t i ng , w r i t in g ) ;

p t h r e a d _c r e a t e ( & ( w r i t i n g _ t h re a d [ n _ t h r e a d s ] ) , & a t tr , \

( void * ) w r i t e _t h r e ad , ( void *)writing ->last);

n_threads++;

p t h r e a d _ m u t e x _ u n l o c k ( & n _ t h r e a d s _ m u t e x ) ;

}

}

}

// I f w e g et he re , t he n we g ot t he s ig na l to s to p.

/ / I f t he re ’ s s ti ll s om e th in g g oi ng on , w ai t .

while ( w r it in g - > f i rs t ! = N U LL ) {

usleep(1000);

}

return 0;

}




LISTING A.3 : The remote file transfer thread.

in t w r i t e _ t hr e a d ( l i s t _ e l e m e n t _ t * e l e m en t ) {

char i n _ n a m e [ L C G _ M A X P A T H L E N ] ;

char o u t _ n a m e [ L C G _ M A X P A T H L E N ] ;

char v o _ n a m e [ L C G _ M A X P A T H L E N ] ;

char l f n _ p a t h [ L C G _ M A X P A T H L E N ] ;

/ / R em ov ed c od e s et ti ng u p t he a bo ve s tr in gs .

/ / C re at e t he l fc d ir s :

lfc_umask(0755);

make_lfc_dir(lfn_path);

/ / A dd t he f il e n am e t o l fn _p at h :

s t r c at ( l f n _ p at h , " / " ) ;

strcat(lfn_path , basename( element ->name ));

/ / T r an s fe r :

lcg_cr(in_name , out_name ,

N U L L , / / G U ID

lfn_path , vo_name ,N U L L , / / r e la t iv e p a th

2 , / / n b s t r ea m s

N U L L , / / c o n f _ f il e

0 , / / i n s e c ur e

1 , / / v e r b os e

NULL / / a c t u a l _ gu i d

);

/ / R em ov e u s f ro m t he w ri ti ng l is t :

p t h r e a d _ m u t e x _ l o c k ( & n _ t h r e a d s _ m u t e x ) ;

l i s t _ d e l _e l e m e n t ( w r i ti n g , e l e m en t ) ;

n_threads --;p t h r e a d _ m u t e x _ u n l o c k ( & n _ t h r e a d s _ m u t e x ) ;

return 0;

}




LISTING A.4: The function that finishes the transfer threads.

in t w r i t e _ f i ni s h e d ( ) {

if ( m a n a ge r _ th r e ad _ a ct i v e = = 1 ) {

m a n a g e r _ t h r ea d _ a c t i v e = 0 ;

p t h r e a d _j o i n ( m a n a g e r _t h r e a d , N U L L ) ;

return 0;

}

else {

f p ri n tf ( s t de rr , " T r yi n g t o k il l a n on - e x i s t en t m a na g er t h re a d . \ n" ) ;

return 1;

}

}

A.2 CODE FROM CHAPTER 3

In this section we will present the relevant CUDA code developed in Chapter 3.

We start with the implementation of the generalized atomic function of Figure 3.6,

on page 52, shown on Listing A.5. This is a particular case of the general function,

where the operation is the addition of floats.

This atomic function was used for the current deposition. Listings A.6 and A.7show the implementation of this step using the Esirkepov method and handling sev-

eral particles per thread. Since the complete function does not fit in one page, we

decided to separate it into these two listings.

Listing A.6 shows the initial and final tests, before the current is calculated. These

tests are necessary since we might be dealing with more than one particle per thread.

LISTING A.5: Applying the generalized atomic function from page 52 to the addition of floats.

__device__ float m y A t o m i c A d d _ d ( float * address , float v al ue ) {

v o l a t il e f l o at n e w_ v al = v a lu e ;

while ( n e w_ va l ! = 0 .0 f ) {

n e w_ v al = a t om i c Ex c h ( a d dr es s , 0 .0 f ) ;

n e w_ v al + = v a lu e ;

n e w _ va l = a t o m i c Ex c h ( a d d re s s , n e w _ v al ) ;

v a lu e = n e w_ v al ;

}

return new_val;

}




LISTING A.6: Entering and leaving the current deposition function.

__device__ void C u rr e nt D ep os i t2 D_ d ( . .. ) {

if ( gc ! = p re v_ gc ) {

u n s i g ne d i n t f ro m = 0 ;

fo r ( in t i y =0 ; iy < 4 ; i y ++ ) {

fo r ( in t i x =0 ; ix < 4 ; i x ++ ) {

in t to = p re v_ gc + ( ix - 1) + ( iy - 1) * ( di v. x +1 +2 );

m y A t o m i c Ad d _ d ( J 1 + t o , J 1 a cc [ f r o m ] ) ;

J 1 ac c [ f r om ] = 0 .0 f ;


J 2 ac c [ f r om ] = 0 .0 f ;


J 3 ac c [ f r om ] = 0 .0 f ;

from++;

}

}

}

/ / C o de r e mo v ed f ro m t hi s l i st i ng : i m p le m e nt a t io n o f t he E s ir k ep o v

/ / a l go r it h m u s ed t o c a lc u l at e t he c u rr e nt J t em p o r i gi n at e d

/ / b y o ne p a rt i cl e , a nd i ts a d di t io n t o t he a c c um u l at e d c u rr e nt J * a cc/ / c al c ul at ed b y t hi s t hr ea d o n t hi s c el l .

/ / A f te r a ll t he p a rt i c le s i n t hi s t h re a d h a ve b e en a c cu m ul a te d ,

/ / d ep os it J * ac c t o t he g lo ba l J * :

if ( ( p ar tT oG o == 1) ) { / / L a st p a rt i cl e i n t hr ea d , s o d e po s it n ow :

u n s i g ne d i n t f ro m = 0 ;

fo r ( in t i y =0 ; iy < 4 ; i y ++ ) {

fo r ( in t i x =0 ; ix < 4 ; i x ++ ) {

in t to = g c + ( i x -1 ) + ( iy - 1) * ( d iv . x +1 +2 );


m y A t o m i c Ad d _ d ( J 2 + t o , J 2 a cc [ f r o m ] ) ; m y A t o m i c Ad d _ d ( J 3 + t o , J 3 a cc [ f r o m ] ) ;

from++;

}

}

}

}




LISTING A.7: Implementation of the Esirkepov method.

float S 0x [ ] = { 0, 1 - p. x , p . x , 0 }; float S 0y [ ] = { 0, 1 - p. y , p . y , 0 };

float S 1x [] = {0 , 0 , 0 , 0 }; float S 1y [] = {0 , 0 , 0 , 0 };

float D Sx [ 4 ] , D Sy [ 4 ] , W x [ 16 ] , W y [ 16 ] , W z [ 16 ] ;

float Jtemp[16];

/ / d ir ec ti on = { 0, 0 } i f p ar t . s ta ye d i n t h e s am e c el l

/ / d ir ec ti on . x = 1 i f m ov ed r ig ht , -1 i f m ov ed l ef t

/ / d ir ec ti on . y = 1 i f m o v ed up , - 1 i f m ov ed d ow n

S 1x [ d i r e ct i on . x + 1 ] = 1 - n e wR . x ;

S 1x [ d i r e ct i on . x + 2 ] = n e wR . x ;

S 1y [ d i r e ct i on . y + 1 ] = 1 - n e wR . y ;

S 1y [ d i r e ct i on . y + 2 ] = n e wR . y ;

fo r ( in t i = 0; i < 4; i + +) {

D Sx [ i ] = S 1x [ i ] - S 0x [ i ]; D Sy [ i ] = S 1y [ i ] - S 0y [ i ];

}

fo r ( in t i y =0 ; i y <4 ; i y ++ ) {

fo r ( in t i x =0 ; ix < 4 ; i x ++ ) {

W x[ i y *4 + ix ] = D Sx [ ix ] * ( S 0y [ iy ] + D Sy [ iy ] /2 ) ;

W y[ i y *4 + ix ] = D Sy [ iy ] * ( S 0x [ ix ] + D Sx [ ix ] /2 ) ;

W z[ i y *4 + ix ] = S 0x [ ix ] * S 0y [ iy ] + D Sx [ ix ] * S 0y [ iy ] /2 +

S 0x [ i x ] * D Sy [ i y ] /2 + D Sx [ i x ] * D Sy [ i y ] /3 ;

}

}

// x

float qv = p .q * d im _d t .x ;

fo r ( in t i y =0 ; i y <4 ; i y ++ ) {

J t em p [ i y *4 ] = - q v * W x [ iy * 4 ];

fo r ( in t i x =1 ; ix < 4 ; i x ++ ) {

J te mp [ ix + i y *4 ] = J te mp [ ix - 1 + i y *4 ] - q v *W x[ i x + i y *4 ];

}

}

fo r ( in t i = 0; i < 16 ; i + +)

J 1 ac c [ i ] + = J t em p [ i ];

/ / C od e r em ov ed f or y a nd J 2a cc

// z

qv = p .q * p . uz * r ga mm a ;

fo r ( in t i = 0; i < 16 ; i + +)

J t em p [ i ] = q v * Wz [ i ] ;

fo r ( in t i = 0; i < 16 ; i + +)

J 3 ac c [ i ] + = J t em p [ i ];




As explained in Sections 3.4.2 and 3.6, we implemented a stride parameter and

a “particles per thread” parameter. This means that one thread might handle morethan one particle, and that those particles might not belong to the same cell. Hence

the initial code checks to see if the particle we are currently handling is the first one

of a cell. If it is, then we must deposit the current accumulated so far and set the

accumulators to zero. The final test checks if we are handling the last particle in this

thread. In that case, a similar action has to be done: deposit the accumulated current.

In both cases we have to use the atomic function of Listing A.5.

Listing A.7 shows the actual implementation of the Esirkepov algorithm. This is a

straightforward implementation. Since this is a 2D code, the z component is handled

slightly different than x and y.

Advancing the EM fields is mostly a direct implementation of the Yee technique

(see Section 3.2), but we use shared memory to optimize bandwidth. Listing A.8

shows the magnetic field update. CTPB is a constant that defines the number of threads

per block for this kernel.

The EM-field interpolation is shown in Listings A.9 and A.10. Because of the stag-

gered grid, offsets have to be taken into account. They are illustrated in Figure A.2,

which is a copy of Figure 3.2.

Listing A.9 shows the interpolation using textures. It is a matter of calculating the

normalized texture indexes and do a texture fetch, taking the offsets of the staggered

grid into account.

For the non-texture interpolation, the Listing A.10 is a bit more complicated. The

main issue is not the interpolation per se, but taking advantage of the fact that each

thread might deal with several particles. In that case, most of the particles might

belong to the same cell, and the EM-field values need to be read only once. So we start

by checking if we’re dealing with a new cell and, if we are, we read in the necessary

values. After that, we decide which values do we need for interpolation. Because

of the staggered grid, not all values are needed by all the particles in a cell. They

depend of which quadrant the particle belongs to. See Figure A.2 for a depiction of

the quadrants in a cell.

Putting this all together, we show in Listing A.11 the main kernel for the particle

push. The first three lines calculate the global index of the thread, called idx. This

is standard CUDA pratice. The next lines, starting with #ifdef STRIDE, adjust idx,

which is actually the particle index, to take the stride into account. The variable Tpt is

the total number of threads and Ppt is the total number of particles pushed per thread.




LISTING A.8: Magnetic field update using shared memory.

__shared__ float E 1 s h [ C T P B ] ;

__shared__ float E2sh[CTPB+1];

__shared__ float E3sh[CTPB+1];

__shared__ float E1Ush[CTPB];

__shared__ float E3Ush[CTPB];

__shared__ float B 1 s h [ C T P B ] ;



/ / R ea d i n t he v al ue s :

E 1s h [ t h r ea d _i d ] = E 1 [ id x ] ;



if ( t h r e a d _ i d = = b l o c k_ s i z e - 1 ) {

E 2 s h [ t h r e a d_ i d + 1 ] = E 2 [ i d x + 1 ] ;

E 3 s h [ t h r e a d_ i d + 1 ] = E 3 [ i d x + 1 ] ;

}

E 1 Us h [ t h re a d _i d ] = E 1 [ id x + ( d iv . x + 1 + 2) ] ;

E 3 Us h [ t h re a d _i d ] = E 3 [ id x + ( d iv . x + 1 + 2) ] ;

B 1s h [ t h r ea d _i d ] = B 1 [ id x ] ;



__syncthreads();

// U pd at e h al f B w it h E

B 1s h [ th r ea d_ id ] + = - d t2 dr . y * ( E 3U sh [ t h re ad _i d ] - E 3s h [ th r ea d_ id ] ) ;

B 2s h [ th r ea d_ id ] + = d t2 dr . x * ( E 3s h [ th re ad _i d + 1] - E 3s h [ th r ea d_ id ] ) ;

B 3s h [ t h r ea d _i d ] + = - d t 2d r . x * ( E 2 sh [ t h r e ad _ id + 1 ] - E 2s h [ t h r ea d _i d ] )

+ d t2 dr . y * ( E 1U sh [ t h re ad _i d ] - E 1s h [ th re ad _i d ] ) ;

/ / W ri te o ut t he v al ue s :

B 1 [ id x ] = B 1s h [ t h r ea d _i d ] ;






FIGURE A.2: The staggered 2D grid used for EM-field interpolation.

LISTING A.9: EM-field interpolation at a particle’s position using textures./ / I n t er p o la t i on w i th t e xt u re s :

/ / g i a nd g j a re t he c el l i nd ex es , i nc lu di n g g ua rd c el ls .

float s , t ; / / N o r ma l iz e d t e xt u re i n de x es .

s = ( ( gi + p .x ) * d im . x - p os _m in . x ) / ( p os _m ax . x - p os _m in . x ) ;

t = ( ( gj + p .y ) * d im . y - p os _m in . y ) / ( p os _m ax . y - p os _m in . y ) ;

float o ff x , o f fy ; / / S t ag g er e d g r id o f fs e ts

o ff x = 0 .5 f / ( d iv . x + 1+ 2) ;

o ff y = 0 .5 f / ( d iv . y + 1+ 2) ;

E i1 = t e x2 D ( m _t e xE 1 , s - o ff x , t ) ;

E i2 = t e x2 D ( m _t e xE 2 , s , t - o f fy ) ;

E i3 = t e x2 D ( m _t e xE 3 , s , t ) ;

B i1 = t e x2 D ( m _t e xB 1 , s , t - o f fy ) ;

B i2 = t e x2 D ( m _t e xB 2 , s - o ff x , t ) ;

B i3 = t e x2 D ( m _t e xB 3 , s - o ff x , t - o f fy ) ;




LISTING A.10: EM-field interpolation at a particle’s position without textures.

if ( g != p re v_ g ) { / / n ew c el l , s o r ea d n ew E & B v al ue s

/ / E 1

E 1 ce n te r = E 1 [ g ];

E 1 le f t = E 1 [g - 1 ] ;

E 1 ri g ht = E 1 [ g +1 ];

E 1 up = E 1 [ g+ v ] ;

E 1 le f tu p = E 1 [g - 1 + v ];

E 1 ri g h tu p = E 1 [ g +1 + v ];

/ / S im il ar c od e f or B1 , E2 , B2 , E3 , B3

.. .}

float * E 1h or , * E 1d ia g ; float * B 2h or , * B 2d ia g ;

float * E 2v er t , * E 2 di a g ; float * B 1v er t , * B 1 di a g ;

float * B 3h or , * B 3v er t , * B 3d ia g ;

if ( p. x <= 0 .5 f ) { / / l e ft v a lu e

w1 = p .x + 0.5 f;

E 1 ho r = & E 1 r ig h t ;

E 1 di a g = & E 1 r ig h tu p ;

B 2 ho r = & B 2 r ig h t ;

B 2 di a g = & B 2 r ig h tu p ;

B 3 ho r = & B 3 r ig h t ;

}

else { / / r i gh t v a lu e s

w1 = 1.5 f - p .x ;

E 1 ho r = & E 1 l ef t ;

E 1 di a g = & E 1 l ef t up ;

B 2 ho r = & B 2 l ef t ;

B 2 di a g = & B 2 l ef t up ;

B 3 ho r = & B 3 l ef t ;

}

/ / S i mi l ar p r oc e d ur e f or y

if ( p. y <= 0 .5 f ) { / / b o tt o m v a lu e s

. .. } else { . ..

}

/ / I n t e r p o l a t e :

Ei1 = w1 * ( (1 -p .y ) * E1 cen te r + p .y * E 1up )

+ (1 - w1 ) * ( (1 - p. y) * * E 1h or + p .y * * E 1d ia g );

/ / S im il ar p ro ce d ur e f or B i2 , E i2 , B i1 , E i3 , B i3




LISTING A.1 1: The kernel for a particle push.

u n s i g ne d i n t t h re a d_ i d = t h re a d Id x . x + t h re a d Id x . y * b l oc k Di m . x ;

u n s i g ne d i n t b l oc k _i d = b l oc k Id x . x + b l oc k Id x . y * g r id D im . x ;

u n s i g ne d i n t i dx = b lo ck _i d * b lo ck Di m . x * b lo ck Di m . y \

+ t hr ea d_ i d + o ff se t ;

#ifdef STRIDEu n s i g ne d i n t s tr id e _i dx = i dx * s tr id e ;

u n s i g ne d i n t s t r id e _ of f s et = s t r id e _i d x / T pt ;

s t ri d e_ i d x = s t ri d e _i d x % T pt + s t r id e _ of f s et ;

#else

u n s i g ne d i n t s t ri d e_ i d x = i dx ;

#endif / / N S T R ID E

i dx = s t ri de _i d x * P pt ;

/ / S o me i n i ti a l iz a t io n s r e mo v ed .

/ / . ..

u n s i g ne d i n t p r ev _ gc = 0 x f f ff f ff f ;

u n s i g ne d i n t p a rt T oG o = P pt ;

while ( p a r tT o Go ) {

/ / R em ov ed c od e : r ea d v al ue s f ro m g lo ba l t o l oc al m em or y .

/ / . ..

/ / C a l cu l at e :

I n t e r p o l a t e Fi e l d s 2 D _ d ( . . . ) ;

C a l c u l a t e V e lo c i t y 2 D _ d ( . . . ) ;

U p d a t e P o s i ti o n A n d C u r r e n t 2D _ d ( . . . ) ;

/ / R em ov ed c od e : w ri te v al ue s f ro m l oc al t o g lo ba l m em or y .

/ / . ..

/ / P r ep a re f or t he n e xt p a rt i cl e :

partToGo --;

p re v_ gc = g c ;

} / / w h il e ( p a r tT o Go )




Also interesting is the code to initialize the CUDA device. Usually, the CUDA

drivers are initialized when the first CUDA function is called. However, in systemswith multiple devices it is important that the user is able to select which device she/he

wants to use. We have implemented a function, shown in Listing A.12, that, when

called without parameters, simply selects the first available CUDA device. However,

it also accepts as parameter a device to be used throughout the simulation.

LISTING A.1 2: Initialization of a CUDA device.

void init_cuda( in t d e vi c e ) {

in t deviceCount;

c u d a G e t D e v i c e C o u n t ( & d e v i c e C o u n t ) ;if ( d e vi c eC ou n t < 1 ) {

p r in t f ( " CU D A :\ t N o d e vi c es f o un d . " );

return ;

}

fo r ( in t i = 0 ; i < d ev ic eC ou nt ; i ++ ) {

c u d a D e v i ce P r o p d e v i c e Pr o p ;

c u d a G e t D e v i ce P r o p e r t i e s ( & d e v i ce P r o p , i ) ;

p r in t f ( " CU D A :\ t D e vi c e % d : % s . \ n" , i , d e vi c e Pr o p . n am e ) ;

float d e v ic e C ap a b il i t y = d e vi c e Pr o p . m aj o r + 0 . 1* d e v i c eP r op . m i n or ;

if ( ( d ev ic e < 0 ) & & ( d e v ic e Ca p ab i li t y > = 1 . 0) ) {

d ev ic e = i ; / / S e le c t t he f i rs t v a li d d e vi c e .

// D ev ic e is n ow > 0. W e d on ’ t b re ak t he l oo p s o t ha t/ / w e l is t a ll t he a va il ab le d ev ic es

}

}

if ( ( d e vi c e < 0 ) | | ( d e vi c eC o un t < d e vi c e ) )

p r in t f ( " CU D A :\ t N o v a li d d e vi c e s e le c te d . \ n " );

else {

p r in t f ( " CU D A :\ t M a ki n g d e vi c e % d a c ti v e . \n " , d e vi c e ) ;

c u d a S e t D e v i c e ( d e v i c e ) ;

}

}

A.3 CODE FROM CHAPTER 4

In this section we will present the relevant parts of the visualization code from Chap-

ter 4. We start with code from the VisIt plugin, explained in Section 4.1, and then move

to the direct visualization code from Section 4.2.




A.3.1 VISIT PLUGIN

Implementing a data plugin for VisIt is mostly straightforward, so we will not list

most of the code. We did add some code to automatically detect if the data is vectorial

or scalar. We used the fact that the output files from Osiris and dHybrid create a well

known directory tree. When the user points VisIt to one file to be opened, the plugin

automatically scans the directory structure looking for other possible components. If

other component files are found, it is assumed that we are dealing with a vector vari-

able. In that case, not only each component is present as a scalar variable for the user,

but also the full vector variable is made available. Listing A.13 shows the vector data

check for the magnetic field files.

Another interesting feature programmed into the plugin is the spatial domain de-

composition. Usually, the output from Osiris, dHybrid or Quickpic have only one

spatial domain. However, VisIt allows for several domains to be defined. In this case,

if all the domains do not fit into memory for visualization, they are processed by VisIt

either in a serial fashion, or, if running in a parallel environment, they are processed

in parallel across the available processors. The plugin has only to detect the available

processors (we did that using the standard MPI MPI_Comm_size and MPI_Comm_rank

functions) and to make the domain decomposition.

Listing A.14 creates the domains for the serial case. First, the current domain is

split in half across the largest axis. This procedure is repeated recursively until allaxis are smaller than maxCellsPerDomain. We are thus halving each domain size and

duplicating the number of domains at each pass.

Domain decomposition in the parallel case adds an interesting feature: we do not

want to divide in more than the number of available processors. Hence the general

algorithm has some more steps and special cases:

1. Initialize of n_cells, n_domains, and biggest_axis, like in the serial case.

2. If the total number of domains is less then the available number of processors,

add one domain to the biggest axis and do everything again (goto 1).

3. If we have more domains then the number of processors, return 0.

4. If we are returning from ourselves, check the returned value. If it is lower than

1, then we divided too much (e.g., we returned from point 3). Decrease one

domain from the biggest axis and return the total number of domains calculated

in 1.

5. If it is 1 or greater, then we finished calculating the number of domains. Return

that value.




LISTING A.1 3: The function to check if the file contains vectorial magnetic field data from an

Osiris simulation.

bool a v t G o L P F i le F o r m a t : : C h e c k I s O s i ri s B ( ) {

s t ri n g f i le _ na m e = m D at a . b a se N am e ;

s t ri n g d i r_ n am e = m D at a . b a se D ir ;

/ / I f i t i s v ec to ri al , t he p at h h as a v ec to r s tr uc tu re :

/ / < . .. / [ B 1| B2 | B3 ] / [ B1 | B2 | B3 ] f il en am e >

if ( ( d i r_ n am e . s u bs t r ( d i r_ n am e . s i ze ( ) - 3 , 3 ) = = " / B 1 ") | |

( d i r _n a me . s u b st r ( d i r_ n am e . s i ze ( ) -3 , 3 ) = = " / B 2 ") | |

( d i r _n a me . s u b st r ( d i r_ n am e . s i ze ( ) -3 , 3 ) = = " / B 3 " )) {/ / T he p at h h as a v ec to r s tr uc tu re .

/ / L oo k f or t he o th er p os si bl e f il es .

d i r _ n am e . e r a s e ( d i r _ n a m e . s i z e ( ) -2 , 2 ) ;

f i l e _ n am e . e r a s e ( 0 , 2 ) ;

i n t3 2 s t at u s ; s t ri n g s ;

s = d ir _n am e + " B1 / B1 " + f il e_ na me ;

s t at u s = S D st a rt ( s . c _ st r ( ) , D F A CC _ RE A D ) ;

if ( s t at u s ! = F AI L ) { / / B 1 c o m po n en t f o un d

S D e n d ( s t a t u s ) ;

m D a t a . f i l e N am e s [ m D a t a . n D i m s ] = s ;

m D a t a . d i m E x is t s [ m D a t a . n D i m s ] = true ;

mData.nDims++;}

s = d ir _n am e + " B2 /B2 " + f il e_ na me ;

s t at u s = S D st a rt ( s . c _ st r ( ) , D F A CC _ RE A D ) ;

if ( s ta tu s ! = F AI L ) {

S D e n d ( s t a t u s ) ;

m D a t a . f i l e N am e s [ m D a t a . n D i m s ] = s ;

m D a t a . d i m E x is t s [ m D a t a . n D i m s ] = true ;

mData.nDims++;

}

/ / . .. S im il ar c od e r em ov ed t o d et ec t t he B 3 c o mp on en t

if ( m Da ta . n Di ms > 0 ) {

m D a t a . b a s e Di r = d i r _ n am e ; m D a t a . k i nd = O s i r is B ;

r e t u rn t r u e;

}

}

r e t u rn f a l se ;

}




LISTING A.14: Domain decomposition for a serial visualization system.

in t a v t G o L P F i le F o r m a t : : C r e a t e D o m a in s ( ) {

/ / T hi s w il l p oi nt t o t he a xi s w it h b ig ge st d om ai n s iz e

/ / ( b i gg e st m A xi s [ ] . s iz e / m A xi s [ ]. n D o m ai n s ) ./ / F or n ow , i t i s s im pl y i n it ia l iz ed .

A x is I nf o * b i g ge s t _a x i s = & m A xi s [ 0 ];

/ / S to re s t he t ot al n um be r o f c el ls :

in t n _c el ls = 1 ;

/ / S to re s t he t ot al n um be r o f d om ai ns :

in t n _ do m ai n s = 1 ;

/ / I n it i a li z e n _c el l s , n _ do m ai n s a nd b i gg e s t_ a x is :

fo r ( in t i = 0; i < m Nu mA xi s ; i + +) {

n _ ce l ls * = m A xi s [ i ]. s i ze ;

n _ d o m a in s * = m A x is [ i ] . n D o m a i n s ;

/ / A c t ua l iz e b i g ge s t _a x i s :

if ( m A x i s [ i ] . s i z e / m A x i s [ i ] . n D o m a in s >

biggest_axis ->size/ biggest_axis ->nDomains )

b i gg e s t_ a x is = & m A xi s [ i ];

}

/ / I f w e s ti ll h av e t oo m an y c el ls p er d om ai n ,

if ( n _ c el l s / n _ d om a in s > m a x Ce l l sP e r Do m a in ) {

/ / s pl it t he b ig ge st d om ai n a nd t ry a ga in .

biggest_axis ->nDomains ++;

n _ d o m a in s = C r e a t e D om a i n s ( ) ; / / r e cu r s iv e c a ll

}

return n_domains;

}




Listing A.15 shows the implementation of the previous algorithm.

LISTING A.1 5: Domain decomposition for a parallel visualization system. We do not decom-

pose in more domains than the number of available processors.

in t a v t G o L P F i le F o r m a t : : C r e a t e N P r o c D o ma i n s ( ) {

/ / . .. C od e r em ov ed : t he s am e i n it i al i za t io n a s i n t he s er ia l c as e .

/ / I f w e s ti ll h av e f re e p r oc es so rs

if ( n _ d om a in s < m N Pr o cs ) {

/ / s pl it t he b ig ge st d om ai n a nd t ry a ga in .

biggest_axis ->nDomains ++;

in t t e m p _ d o ma i n s = C r e a t e N P r oc D o m a i n s ( ) ;

/ / I f w e g et h er e , w e f in is he d c al cu l at in g .

/ / E it he r w e s pl it 1 t im es t oo m an y a nd we h av e t o c le an u p

/ / o r w e a lr ea dy c le an ed u p a nd j us t h av e t o r et ur n .

if ( t e mp _ do ma i ns < 1 ) {

/ / W e s pl it t oo m uc h . G o b ac k o n s t ep a nd f in is h .

biggest_axis ->nDomains --;

return n _ d o m a i n s ;

} else

return t e m p _ d o m a i n s ;

} else { / / N o m or e f re e p r oc es so rs . W e r et ur n .

if ( n _ d om a i ns > m N Pr o cs ) {

/ / F i ni s he d s p li t t in g .

/ / G o b ac k o ne s te p if w e o ve rd id it .

return 0;

} else {

/ / I f n _ d om a in s = m NP r oc s , w e ’ re f i ne .

return n _ d o m a i n s ;

}

/ / W e s ho ul d n ev er g et h er e .

}

/ / W e s ho ul d n ev er g et h er e .

}

A.3.2 DIRECT VISUALIZATION

The implementation for direct visualization was in OpenGL, using its GLUT exten-

sions. The application begins by initializing the simulation parameters and setting the

usual GLUT callbacks. Then it enters the GLUT main loop, where it checks for events

from the user and the operating system, and dispatches them to the callback functions.

The most relevant part of the code is the display function, that is called each time

the display needs refreshing. It is there that we push the simulation a certain number




of cycles and get the display data (particles’ position, EM field diagnostics, etc) to the

screen. We set up the idle function callback (with glutIdleFunc()) so that it is also aredisplay (glutPostRedisplay()). This ensures that the simulation is run also when

no events are detected.

Listing A.16 shows our implementation of the display function. The first lines run

a simulation cycle (called by pusher->Push()) n_cycles times. Then we set up the

OpenGL state and finally draw either particles or diagnostics.

The function drawParticles, called from the display function, is shown in List-

ing A.17. Particles are drawn as points. We also have to take into account if we are

LISTING A.16: GLUT display function for direct visualization.

void display( void ) {

if ( ! p au s e ) {

fo r ( in t i =0 ; i < n _c yc le s ; i + +) {

p u s h e r - > P u s h ( ) ;

total_cycles++;

}

}

g l C l e ar ( G L _ C O L O R _ B U F F E R _ B I T | G L _ D E P T H _ B UF F E R _ B I T ) ;

g l M a t r i x M o d e ( G L _ P R O J E C T I O N ) ;

glLoadIdentity();

g l u P e r s pe c t i v e ( 6 0 .0 , w i n W / w in H , 0 . 01 , b a c k _ p la n e ) ;

g l M a t r ix M o d e ( G L _ M O D E L V I EW ) ;

glLoadIdentity();

/ / V i ew t r an s f .

g l u L o o kA t ( p o s _ f r o m . x , p o s _ f ro m . y , p o s _ f ro m . z ,

p o s _ to . x , p o s _ to . y , p o s _ to . z ,

0 .0 , 1 .0 , 0 . 0) ;

/ / M o de l t r an s f .

G L f l o at m [ 4 ] [ 4 ] ;

b u i l d _ r ot m a t r i x ( m , m _ q u at ) ;

g l M u l t M a t r i x f ( & m [ 0 ] [ 0 ] ) ;

/ / A x is

g l C a l l L i s t ( a x i s _ l i s t ) ;

in t n u m _ p a r ti c l e s = p u sh e r - > G e t N P ( ) ;

if ( d i a g_ m od e = = n o ne )

d r a w P a r t i c l e s ( n u m _ p a r t i c l e s ) ;

else

d r a w D i a g n o s t i c ( ) ;

glutSwapBuffers();

}




using VBOs or a vertex array. Finally, if we keep the species ordered inside the array,

we can draw them with different colors without a significant time penalty.

The other relevant function called from the display function is drawDiagnostics.

It is shown in Listing A.18. It calls ReplaceDiagnostic, which is a function that gets

the correct diagnostic to display from the simulation, turns it into a normalized tex-

ture, and makes this texture active. After that, the staggered grid has to be taken into

account (hence the variable cell_offset) and the texture can be displayed.

LISTING A.1 7: Drawing the simulation particles.

void d r a w P a r t i c l e s (in t n u m _ p a r ti c l e s ) {

g l P o i n t S i z e ( p o i n t _ s i z e ) ;

g l E n a b l e C l i e n t S t a t e ( G L _ V E R T E X _ A R R A Y ) ;

G L u i nt v b o ;

switch ( m od e ) {

case VBO:

v b o = p u sh e r - > G e t V B O ( ) ;

if ( v bo ) {

g l B i n d Bu f f e r ( G L _ A R R A Y_ B U F F ER , v b o ) ;

g l V e r t e xP o i n t e r ( 2 , G L _ FL O A T , 0 , N U L L ) ;

}

break ;

default :

p a r t i cl e s = p u sh e r - > G e t P o s i t i o ns ( ) ;

g l V e r t e x Po i n t e r ( 2 , G L _ FL O A T , 0 , ( float * ) p a r t i c l e s ) ;

}

/ / C o lo r p er s p ec i es :

g l C o l o r3 f ( 0 . 0 , 1 .0 , 1 . 0 ); / / c ya n = s pe ci es 1

g l D r a w Ar r a y s ( G L _ P O IN T S , 0 , n u m _ p a r ti c l e s / 2 ) ;

g l C o l o r3 f ( 1 . 0 , 1 .0 , 0 . 0 ); / / y el lo w = s pe ci es 2

g l D r a w Ar r a y s ( G L _ P O IN T S , n u m _ p a rt i c l e s / 2 , n u m _ p a r ti c l e s / 2 ) ;

switch ( m od e ) {

case VBO:

glBindBuffer (GL_ARRAY_BUFFER , 0);

break ;

default : ; / / N ot hi ng t o d o

}

g l D i s a b l e C l i e n t S t a t e ( G L _ V E R T E X _ A R R A Y ) ;

}




LISTING A.1 8: Drawing the diagnostics as textures.

void d r a w D i a g n o s t i c (void ) {

g l E n a bl e ( G L _ T E X T UR E _ 2 D ) ;

g l T e x E nv f ( G L _ T EX T U R E _E N V , G L _ T E XT U R E _ E NV _ M O D E , G L _ R E P LA C E ) ;

R e p l a c e D i a g n o s t i c ( d i a g _ m o d e ) ;

g l B e g i n ( G L _ Q U A D S ) ;

{

g l T ex C o or d 2 f ( 0 .0 , 0 .0 ) ;

g l V er t ex 3 f ( b o x_ m in . x - c e l l_ o ff s e t .x ,

b o x_ m in . y - c e l l_ o ff s e t .y ,

0 .0 ) ;

g l T ex C o or d 2 f ( 1 .0 , 0 .0 ) ;

g l V er t ex 3 f ( b o x_ m ax . x - c e l l_ o ff s e t .x ,

b o x_ m in . y - c e l l_ o ff s e t .y ,

0 .0 ) ;g l T ex C o or d 2 f ( 1 .0 , 1 .0 ) ;

g l V er t ex 3 f ( b o x_ m ax . x - c e l l_ o ff s e t .x ,

b o x_ m ax . y - c e l l_ o ff s e t .y ,

0 .0 ) ;

g l T ex C o or d 2 f ( 0 .0 , 1 .0 ) ;

g l V er t ex 3 f ( b o x_ m in . x - c e l l_ o ff s e t .x ,

b o x_ m ax . y - c e l l_ o ff s e t .y ,

0 .0 ) ;

}

glEnd();

}



APPENDIX B

A GPGPU IMPLEMENTATION

OF THE BORIS PUSHER ALGORITHM

In this appendix we describe our first approach in implementing a PIC code in GPUs,

referred in the beginning of Chapter 3. This occurred in 2006, before the advent of

CUDA. At the time, several approaches existed to use the computing power of pro-

grammable GPUs for general algorithms. We used the OpenGL Shading Language

(GLSL), version 2.0 [27] and were able to implement the full Boris pusher in about 3

months, after a one-semester course on Computer Graphics. A short report on that

effort follows.

B.1 MOTIVATION

The particle push is one of the most time consuming steps in PIC codes. Since it uses

vectorial quantities and it often needs to do vectorial operations, the Boris pusher is a

very good candidate for high code acceleration, by adapting the algorithm to use vec-

tor and stream instructions (SIMD). Although most processors have vector instruc-

tions available, like Altivec on the PowerPC and MMX/SSE/SSE2/SSE3 on the In-

tel/AMD, they are usually working at the top of their capabilities during a numericalsimulation. On the other hand, Graphics Processor Units (GPUs) in 2006 already had

SIMD capabilities and had shown to provide good performance not only on streaming

applications by performing the same operation over large collections of data as well

as on applications that have sufficient parallelism and computational intensity to hide

memory latency [20]. GPUs of the time, like the ATI X1900 series [95] and the NVIDIA

GeForce 7800 series [96], featured both programmable vertex and fragment processors

and provided support for floating point operations, making them available targets for

streaming computation.

99



100 Appendix B. A GPGPU Implementation of the Boris Pusher Algorithm

These GPU had shown its usefuless for many scientific computations [97, 98].

Work at the time had shown the feasibility of using the stream capabilities of pro-grammable GPU for particle tracing [90], mainly due to the following reasons:

— Programmable shaders offered a programmable environment with highly acceler-

ated vectorial operations.

— It was possible to allocate GPU memory that could be used as a texture map or a

vertex array alternatively. This gave us the ability to run the whole algorithm with

minimal data transfer between CPU and GPU memory.

— GPU support for float precision was improving. GPU processors in 2006 could al-

ready use full 32-bit floats in the entire pipeline, and 16-bit floats (OpenEXR format)

for textures.

Taking advantage of these characteristics and the fact that GPUs are usually idle

during numeric computations, we decided in 2006 to programme a simplified PIC

code in the GPU’s fragment processor as well as take some preliminary steps in a

distributed implementation over a cluster of GPUs.

B.2 THE BORIS PUSHER IN A SIMPLIFIED PIC CODE

A description of the computational steps of the Boris pusher were already given in

Section 3.4.1. However, to use the particle pusher in a relevant scenario, we also had

to implement the interpolation of the EM-field at each particle position. The fields

did not have to be updated with the particles current and charge, thus simulating a

forced-field scenario. This simplified PIC code, where the self-consistent evolution of

the fields is neglected, can be resumed as follows:

1. Define the initial conditions:

t = t0, rt0 , vt0− ∆t2 .

2. Calculate Bt and Et.

3. For each particle:

(a) Interpolate Bt and Et at rt.

(b) Push the particles, getting the new positions, with the Boris pusher.

4. Advance the time by ∆t and go to 2.



B.3. GPGPU implementation 101

B.3 GPGPU IMPLEMENTATION

The implementation starts by allocating and initializing the textures that will hold the

values of the vectorial quantities. The highest precision available for textures is 16-bit

floats in the OpenEXR format [99].

For N particles, r and v are stored in 1D RGB textures of length N (or of the next

power of two, since non power of two textures can still be penalized in performance).

Two textures for each one of these quantities are needed, one that stores rt and vt− ∆t2

,

and another for rt+∆t and vt+ ∆t2

. We call this textures T r0, T v0, T r1 and T v1. Two 3D

RGB textures, T E

and T B

, are also needed, one for E and the other for B. They have the

same size as the number of cells of the simulation space (again, some power of two

ajustment might be needed). Figure B.1 illustrates how this textures are used in the

algorithm.

T x0

T v0

T B

T E

T x1

T v1

T B

T E

Step Step

Interp.

Interp.

Boris Pusher

T x0

T v0

T B

T E

Step Step

Interp.

Interp.

Boris Pusher

FIGURE B.1: The use of textures in the GPGPU implementation of the Boris pusher.




Scalar quantities, like ∆t, m and q, are passed as uniform shader parameters.

For each particle, E and B are interpolated at the particle position, after it is read

from T r0. The new positions are calculated using the Boris pusher: vt+ ∆t2

is calculated

and rendered to T v1, and rt+∆t is calculated and rendered to the T r1 texture. At this

step, T r1 can also be rendered to the screen.

The time is advanced and a new cycle begins. T E and T B are updated, and T r1 and

T v1 are used to calculate the new v and r, that are rendered to T v0 and T r0.

Boundary conditions have to be implemented. In our simulation, we have used a

periodic boundary, so that the particles are re-injected when they leave the simulation

box.

B.4 PARALLELIZATION ISSUES

With this implementation, we were able to have a real-time display of a simple PIC

algorithm with a Boris pusher. However, the memory size of the GPUs of the time

was an important limitation. A small simulation, with 1283 cells and 8 particles per

cell requires 408 MB just for texture memory, which might not be available even on

video cards with 512 MB RAM.

One way to overcome this limitation was to distribute the simulation over several

GPUs. Similar approaches had already been done with other algorithms [90, 100].

With our simplified PIC code, the particles do not influence the fields. The Boris

pusher could be parallelized with very little penalty if we only partition the particles

over each node, but still use the full fields (see Figure B.2). In our previous example,

memory requirements were reduced from 408 MB to 24 + 384/P, where P is the num-

ber of available GPU processors. With just 2 GPUs, this results in 216 MB of texture

memory, which fits well in processors with 512 MB video memory.

More interesting was the possibility to run simulations of a reasonable size, like

2563 cells and 8 particles per cell. This is a 8

×increase on texture memory require-

ments. On a 16 node cluster, 288 MB are needed per node. As long as communication between the nodes was kept to a minimum, this was an acceptable solution.

B.5 CLUSTER IMPLEMENTATION

The techniques described in the previous sections were implemented on the commod-

ity hardware that was used as part of the Grid node for plasma physics simulation that

we were maintaining at the time. Each working node (WN) was configured with an



B.6. Performance and limitations 103

FIGURE B.2: The use of forced-fields allows for easy domain decomposition: each GPU gets

the whole spatial domain (represented by the grid), but the particles are distributed over the

GPUs (circles of different tones). This leads to no particle exchange occurring between the

GPUs and less node communication. (For a color version, see Figure C.9 on page 109.)

AMD Athlon 64 3200+ CPU with 1 GB RAM and a NVIDIA 6600GT PCIe video card

with 128 MB RAM. An user interface node (UI) was also used, which was responsi-

ble for initializing the textures with the data and for launching the GPU processes on

the working nodes. The LAM MPI implementation was used as the communication

interface between the nodes [5].

The application running on the UI detected the WN available and distributed the

particles. After receiving the particles, each WN ran the simulation for a certain sim-

ulation time or number of steps (set by the UI). During this time, no node communi-

cation was needed. The final result (particle position and velocity) was sent from the

WN to the UI, where it was stored and displayed.

B.6 PERFORMANCE AND LIMITATIONS

Simulations run on a system with one UI and two WNs showed good performance,

comparable to the simulation on the CPU (see Table B.1). Overhead caused by the

transfer of textures between CPU and GPU memory was compensated by the SIMD

characteristics of the GPU.

The biggest limitation in this GPGPU implementation of the Boris pusher was the

lack of support for double precision floats in the GPU. To study some very detailed

plasma behavior, half or limited single precision is not enough. However, even in

these cases a GPU Boris pusher with limited float precision is still helpful as a way

to see a real-time evolution of the simulation, that helps to quickly grasp the general

behavior of the particles.

The limited video memory available to most GPUs of the time (usually less than

1 GB) was also a constraint on the application of this algorithm on big-scale simula-




GPUs PPC FPS

1 8 3–5 (comparable to a CPU)

2 8 6–9

2 16 3–5

TABLE B.1: Performance of the GPGPU implementation of the Boris pusher. The GPUs were a

NVIDIA GeForce 6600GT and the CPU an AMD Athlon 64 3200+. Intranode communication

used LAM-MPI over a gigabit ethernet. Simulation parameters included a viewport of 512 ×512 pixels and 643 cells.

tions. Parallelization is a good option for a simplified PIC code, since very little node

communication is required.

Finally, mapping generic PIC algorithms to CG concepts was not an easy task. The

lack of a significant performance gain made the enterprise not worth to follow, at least

until CUDA was introduced one year later.



APPENDIX C

COLOR FIGURES

SP1

Registermemory

SP8

Registermemory

Shared memory

Constant cache

Texture cache

SMP N

SP1

Registermemory

SP8

Registermemory

Shared memory

Constant cache

Texture cache

SMP 1

Constant memory

Texture memory

Local memory

Global memory

Local memory Local memory

Device DRAM

Device (GPU)

Host (CPU)

FIGURE C.1 : CUDA memory architecture, both from the hardware and software level. (Re-

production of Figure 3.3 from page 43.)

105



106 Appendix C. Color figures

FIGURE C.2: Illustration of a stride during particle push and current deposition, so that kernelsin the same warp handle particles in different cells. p is the number of particles per thread and

s is the stride. (Reproduction of Figure 3.7 from page 53.)

FIGURE C.3: Final state of a full 2D PIC simulation of a Weibel instability (positrons in cyan

and electrons in yellow). (Reproduction of Figure 3.10 from page 56.)



107

R/W

Sort

Interpolaon

Vel.+Pos.

J deposion67%

13%

17%

2%

1%

FIGURE C.4: Percentage of time spent at each step of the pusher algorithm. Based on the

values from Table 3.3. (Reproduction of Figure C.4 from page 107.)

FIGURE C.5: Direct visualization of a 2D PIC simulation of the formation of a Weibel instabil-

ity. (Reproduction of Figure 4.2 from page 68.)




Visualization

Restart simulation

Start OpenGL context

Initialize simulation

Get user events

Change

visualization

parameters

Change

simulation

parametersIdle

Get positions/diagnostics

Render

Do N simulation cycles

Interaction

Simulation

FIGURE C.6: Overview of the integration between the simulation code (middle layer, in green),

and the interaction and visualization code (top and bottom layers, in red and blue respec-

tively). (Reproduction of Figure 4.4 from page 71.)

(a) (b)

FIGURE C.7 : Two examples of direct interaction during a simulation: (a) depicting the grid

used in the PIC simulation over the particles, (b) zooming in the center of the image while the

simulation is running. (Reproduction of Figure 4.5 from page 71.)



109

FIGURE C.8: Direct visualization of a 3D particle pusher on a static EM field. (Reproduction

of Figure 4.6 from page 73.)

FIGURE C.9: The use of forced-fields allows for easy domain decomposition: each GPU gets

the whole spatial domain (represented by the grid), but the particles are distributed over the

GPUs (circles of different tones). This leads to no particle exchange occurring between the

GPUs and less node communication. (Reproduction of Figure B.2 from page 103.)



BIBLIOGRAPHY

[1] S. Burke, S. Campana, A. Peris, F. Donno, P. Lorenzo, R. Santinelli, and A. Sciabà.

gLite 3 User Guide, 2009.

[2] D. R. Hartree. The Eniac, an electronic computing machine. Nature, 158:500–506,

1946.

[3] R. Rojas. How to make Zuse’s Z3 a universal computer. IEEE Annals of the

History of Computing, 20(3):51–54, JUL-SEP 1998.

[4] T. Sterling, D. Becker, and D. Savarese. Beowulf: A parallel workstation for

scientific computation. In Proceedings of the International Conference on Parallel

Processing. CRC Press, 1995.

[5] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programmingwith the Message-Passing Interface. MIT Press, 1999.

[6] Federal Coordinating Council for Science, Engineering, and Technology. A re-

search and development strategy for high performance computing. Technical

report, Office of Science and Technology Policy, 1987.

[7] Ian Foster. The anatomy of the Grid: Enabling scalable virtual organizations.

International Journal of Supercomputer Applications, 15(3), 2001.

[8] K. Anstreicher, N. Brixius, J. P. Goux, and J. Linderoth. Solving large quadratic

assignment problems on computational grids. Mathematical Programming,91(3):563–588, February 2002. 17th International Symposium of the Mathemati-

cal Programming Society, Atlanta, Georgia, Aug 07-11, 2000.

[9] C. E. Nugent, T. E. Vollmann, and J. Ruml. An experimental comparison of

techniques for assignment of facilities to locations. Operations Research, 16(1):150,

1968.

[10] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit.

International Journal of Supercomputer Applications, 11(2):115–128, 1997.

111



112 Bibliography

[11] Ian Foster, Carl Kesselman, Jeffrey M. Nick, and Steven Tuecke. The Physiol-

ogy of the Grid: An Open Grid Services Architecture for Distributed SystemsIntegration. In Open Grid Service Infrastructure WG. Open Grid Forum, 2002.

[12] G. E. Moore. Cramming more components onto integrated circuits. Proceedings

of the IEEE, 86(1):82–85, Jan 1998.

[13] Robit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald,

and Ramesh Menon. Parallel programming in OpenMP. Morgan Kaufmann Pub-

lishers Inc., San Francisco, CA, USA, 2001.

[14] Thomas Willhalm and Nicolae Popovici. Putting Intel R

Threading Building

Blocks to work. In IWMSE ’08: Proceedings of the 1st International Workshop on

Multicore software engineering, pages 3–4, NY, USA, 2008. ACM.

[15] Christopher Barton, Calin Cascaval, George Almási, Yili Zheng, Montse Far-

reras, Siddhartha Chatterje, and José Nelson Amaral. Shared memory program-

ming for large scale machines. In PLDI ’06: Proceedings of the 2006 ACM SIG-

PLAN conference on Programming language design and implementation, pages 108–

117, New York, NY, USA, 2006. ACM.

[16] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle,

A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak,

M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and

K. Yazawa. The design and implementation of a first-generation cell proces-

sor. In Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005

IEEE International, pages 184–185 Vol. 1, 2005.

[17] B. Flachs, S. Asano, S. H. Dhong, P. Hotstee, G. Gervais, R. Kim, T. Le,

P. Liu, J. Leenstra, J. Liberty, B. Michael, H. Oh, S. M. Mueller, O. Takahashi,

A. Hatakeyama, Y. Watanabe, and N. Yano. A streaming processing unit for a

cell processor. In Solid-State Circuits Conference, 2005. Digest of Technical Papers.

ISSCC. 2005 IEEE International, pages 134–135 Vol. 1, 2005.

[18] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and

Katherine Yelick. The potential of the cell processor for scientific computing. In

CF ’06: Proceedings of the 3rd conference on Computing frontiers, pages 9–20, New

York, NY, USA, 2006. ACM.

[19] David Luebke, Mark Harris, Jens Krüger, Tim Purcell, Naga Govindaraju, Ian

Buck, Cliff Woolley, and Aaron Lefohn. GPGPU: general purpose computation



Bibliography 113

on graphics hardware. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Course Notes,

page 33, New York, NY, USA, 2004. ACM.

[20] I. Buck, T. Foley, D. Horn, J. Sugerman, P. Hanrahan, M. Houston, and K. Fata-

halian. Brook for GPUs: Stream Computing on Graphics Hardware. In Proceed-

ings of the ACM SIGGRAPH 2004, 2004.

[21] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krueger,

Aaron E. Lefohn, and Timothy J. Purcell. A survey of general-purpose compu-

tation on graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007.

[22] R. W. Hamming. Numerical methods for scientists and engineers. Dover Publica-tions, Inc., New York, NY, USA, 1986.

[23] Chip Walter. Kryder’s law. Scientific American Magazine, 161(8), August 2005.

[24] M. H. Kryder and Chang S. Kim. After hard drives—what comes next? Magnet-

ics, IEEE Transactions on, 45(10):3406–3413, September 2009.

[25] Peter Shirley. Fundamentals of Computer Graphics. A K Peters, 2001.

[26] VisIt: Software that Delivers Parallel Interactive Visualization. https://wci.

llnl.gov/codes/visit/.

[27] Randi J. Rost. OpenGL(R) Shading Language (2nd Edition). Addison-Wesley Pro-

fessional, January 2006.

[28] I. Foster, K. Czajkowski, D. E. Ferguson, J. Frey, S. Graham, T. Maguire,

D. Snelling, and S. Tuecke. Modeling and Managing State in Distributed Sys-

tems: The Role of OGSI and WSRF. Proceedings of the IEEE, 93(3):604–612, March

2005.

[29] The Globus Project. GridFTP: Universal Data Transfer for the Grid. White

Paper, September 2000. http://globus.org/toolkit/docs/2.4/datagrid/

deliverables/C2WPdraft3.pdf.

[30] Data Access with GFAL. http://wiki.egee-see.org/index.php/Data_

Access_with_GFAL.

[31] Peter Kunszt. File Transfer Service User Guide, 2005. https://edms.cern.ch/

document/591792/.



114 Bibliography

[32] Sergio Andreozzi, Natascia De Bortoli, Sergio Fantinel, Antonia Ghiselli,

Gian Luca Rubini, Gennaro Tortone, and Maria Cristina Vistoli. GridICE: a mon-itoring service for Grid systems. Future Generation Computer Systems, 21(4):559–

571, 2005.

[33] Fabrizio Pacini. WMS User’s Guide. EGEE-JRA1. https://edms.cern.ch/

document/572489/1.

[34] Jennifer M. Schopf, Laura Pearlman, Neill Miller, Carl Kesselman, Ian Foster,

Mike D’Arcy, and Ann Chervenak. Monitoring the grid with the Globus Toolkit

MDS4. Journal of Physics: Conference Series, 46:521–525, 2006.

[35] C. K. Birdsall and Langdon. Plasma Physics via Computer Simulation (Series on

Plasma Physics). Taylor & Francis, January 1991.

[36] I. Campos, M. Plociennik, H. Rosmanith, and S. Stork. Application support in

int.eu.grid. In Proceedings of IBERGRID: 1st Iberian Grid infrastructure conference,

2007.

[37] Péter Kacsuk and Gergely Sipos. Multi-grid, multi-user workflows in the

p-grade grid portal. Journal of Grid Computing, 3(3):221–238, 9 2005. DOI

10.1007/s10723-005-9012-6.

[38] Thierry Delaitre, Tamas Kiss, Ariel Goyeneche, Gabor Terstyanszky, Stephen

Winter, and Peter Kacsuk. GEMLCA: Running Legacy Code Applications as

Grid Services. Journal of Grid Computing, 3(1):75–90, 6 2005.

[39] M. Kupczyk, R. Lichwala, N. Meyer, B. Palak, M. Plociennik, and P. Wolniewicz.

Migrating Desktop interface for several grid infrastructures. In Hamza, MH, ed-

itor, Proceedings of the IASTED International Conference on Parallel and Distributed

Computing and Networks, pages 183–188, Canada, 2004. IASTED, Acta Press.

[40] R. A. Fonseca, L. O. Silva, F. S. Tsung, V. K. Decyk, W. Lu, C. Ren, W. B. Mori,

S. Deng, S. Lee, T. Katsouleas, and J. C. Adam. OSIRIS: A three-dimensional,

fully relativistic particle in cell code for modeling plasma based accelerators.

In Sloot, P. and Tan, C. J. K. and Dongarra, J. J. and Hoekstra, A. G., editor,

Computational Science-ICCS 2002, PT III, Proceedings, volume 2331 of Lecture Notes

in Computer Science, pages 342–351. Springer-Verlag Berlin, 2002.

[41] L. Gargaté, R. Bingham, R. A. Fonseca, and L. O. Silva. dHybrid: A massively

parallel code for hybrid simulations of space plasmas. Computer Physics Commu-

nications, 176(6):419–425, 2007.



Bibliography 115

[42] EGEE Fusion Virtual Organization. http://grid.bifi.unizar.es/egee/

fusion-vo/.

[43] Torque Admin Manual. http://www.clusterresources.com/products/

torque/docs/.

[44] Maui Scheduler Administrator’s Guide. http://www.clusterresources.com/

products/maui/docs/mauiadmin.shtml.

[45] E. Laure, S. M. Fisher, A. Frohner, and C. Grandi. Programming the Grid with

gLite. Computational Methods in Science and Technology, 12(1):33–45, 2006.

[46] The HDF Group. HDF 4.2r4 Reference Manual. http://www.hdfgroup.org/doc.

html.

[47] The CERN DataGrid Project. http://www.cern.ch/grid/.

[48] Mehnaz Hafeez, Asad Samar, and Heinz Stockinger. A Data Grid Prototype for

Distributed Data Production in CMS. In 7th International Workshop on Advanced

Computing and Analysis Techniques in Physics Research, 2000.

[49] Mehmet Balman and Tevfik Kosar. Data scheduling for large scale distributed

applications. In Proceedings of the 5th ICEIS Doctoral Consortium, In conjunction

with the International Conference on Enterprise Information Systems (ICEIS?07, 2007.

[50] Caitriana Mairi and Macleod Nicholson. File Management for HEP Data Grids.

PhD thesis, University of Glasgow, 2006.

[51] B. Radic, V. Kajic, and E. Imamagic. Optimization of Data Transfer for Grid Us-

ing GridFTP. In 29th International Conference on Information Technology Interfaces,

2007., pages 709 –715, June 2007.

[52] Jiazeng Wang and Linpeng Huang. Intelligent file transfer protocol for grid en-vironment. In Wu Zhang, Weiqin Tong, Zhangxin Chen, and Roland Glowinski,

editors, Current Trends in High Performance Computing and Its Applications, pages

469–476. Springer Berlin Heidelberg, 2005.

[53] C. Huang, V.K. Decyk, C. Ren, M. Zhou, W. Lu, W.B. Mori, J.H. Cooley, T.M.

Antonsen, Jr., and T. Katsouleas. Quickpic: A highly efficient particle-in-cell

code for modeling wakefield acceleration in plasmas. Journal of Computational

Physics, 217(2):658 – 679, 2006.



116 Bibliography

[54] Mikhail V. Medvedev, Massimiliano Fiore, Ricardo A. Fonseca, Luis O. Silva,

and Warren B. Mori. Long-time evolution of magnetic fields in relativisticgamma-ray burst shocks. The Astrophysical Journal Letters, 618(2):L75, 2005.

[55] S. F. Martins, R. A. Fonseca, J. Vieira, L. O. Silva, W. Lu, and W. B. Mori. Model-

ing laser wakefield accelerator experiments with ultrafast particle-in-cell simu-

lations in boosted frames. Physics of Plasmas, 17(5):056705, 2010.

[56] W. Lu, M. Tzoufras, C. Joshi, F. S. Tsung, W. B. Mori, J. Vieira, R. A. Fonseca,

and L. O. Silva. Generating multi-gev electron bunches using single stage laser

wakefield acceleration in a 3d nonlinear regime. Phys. Rev. ST Accel. Beams,

10(6):061301, Jun 2007.

[57] C. Ren, M. Tzoufras, F. S. Tsung, W. B. Mori, S. Amorini, R. A. Fonseca, L. O.

Silva, J. C. Adam, and A. Heron. Global simulation for laser-driven mev elec-

trons in fast ignition. Phys. Rev. Lett., 93(18):185004, Oct 2004.

[58] J. Tonge, J. May, W. B. Mori, F. Fiuza, S. F. Martins, R. A. Fonseca, L. O. Silva,

and C. Ren. A simulation study of fast ignition with ultrahigh intensity lasers.

Physics of Plasmas, 16(5):056311, 2009.

[59] Ian Blumenfeld, Christopher E. Clayton, Franz-Josef Decker, Mark J. Hogan,

Chengkun Huang, Rasmus Ischebeck, Richard Iverson, Chandrashekhar Joshi,Thomas Katsouleas, Neil Kirby, Wei Lu, Kenneth A. Marsh, Warren B. Mori,

Patric Muggli, Erdem Oz, Robert H. Siemann, Dieter Walz, and Miaomiao Zhou.

Energy doubling of 42 gev electrons in a metre-scale plasma wakefield acceler-

ator. Nature, 445(7129):741–744, Feb 2007.

[60] Christian Hededal. Gamma-Ray Bursts, Collisionless Shocks and Synthetic Spectra.

PhD thesis, The Niels Bohr Institute, University of Copenhagen, 2005. http:

//arxiv.org/abs/astro-ph/0506559.

[61] J. L. Martins, S. F. Martins, R. A. Fonseca, and L. O. Silva. Radiation post-processing in pic codes. In Dino A. Jaroszynski and Antoine Rousse, editors,

Harnessing Relativistic Plasma Waves as Novel Radiation Sources from Terahertz to

X-Rays and Beyond, Prague, Czech Republic, 2009. SPIE.

[62] M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule. Shader algebra. ACM

Transactions On Graphics, 23(3):787–795, August 2004.

[63] R. A. Fonseca, S. F. Martins, L. O. Silva, J. W. Tonge, F. S. Tsung, and W. B. Mori.

One-to-one direct modeling of experiments and astrophysical scenarios: push-



Bibliography 117

ing the envelope on kinetic plasma simulations. Plasma Physics and Controlled

Fusion, 50(12):124034 (9pp), 2008.

[64] J. M. Dawson. Particle simulation of plasmas. Rev. Mod. Phys., 55(2):403–447,

April 1983.

[65] J. P. Boris. Relativistic plasma simulation—optimization of a hybrid code.

In Proc. Fourth Conference on the Numerical Simulation of Plasmas, pages 3–67,

November 1970.

[66] R. W. Hockney and J. W. Eastwood. Computer Simulation Using Particles. Institute

of Physics Publishing, Bristol and Philadelphia, 1988.

[67] Kane Yee. Numerical solution of initial boundary value problems involving

maxwell’s equations in isotropic media. Antennas and Propagation, IEEE Transac-

tions on, 14(3):302–307, May 1966.

[68] J. Villasenor and O. Buneman. Rigorous Charge Conservation For Local

Electromagnetic-Field Solvers. Computer Physics Communications, 69(2-3):306–

316, March 1992.

[69] T. Z. Esirkepov. Exact charge conservation scheme for Particle-in-Cell simula-

tion with an arbitrary form-factor. Computer Physics Communications, 135(2):144–

153, April 2001.

[70] T. Umeda, Y. Omura, T. Tominaga, and H. Matsumoto. A new charge conser-

vation method in electromagnetic particle-in-cell simulations. Computer Physics

Communications, 156(1):73–85, 2003.

[71] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel

programming with CUDA. Queue, 6(2):40–53, 2008.

[72] C. Sigg and M. Hadwiger. Fast third-order texture filtering. In GPU Gems 2:

Programming Techniques for High-Performance Graphics and General-Purpose Com- putation, chapter 20, pages 313–329. Addison-Wesley, 2005.

[73] Daniel Ruijters, Bart M. ter Haar Romeny, and Paul Suetens. Efficient GPU-

Based Texture Interpolation using Uniform B-Splines. Journal of Graphics, GPU,

& Game Tools, 13(4):61–69, January 2008.

[74] George Stantchev, William Dorland, and Nail Gumerov. Fast parallel Particle-

To-Grid interpolation for plasma PIC simulations on the GPU. Journal of Parallel

and Distributed Computing, 68(10):1339–1349, 2008.



118 Bibliography

[75] V. Podlozhnyuk. Histogram calculation in CUDA. Technical report, NVIDIA,

2007. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf.

[76] N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms

for manycore gpus. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE

International Symposium on, pages 1 –10, May 2009.

[77] R. A. Fonseca, L. O. Silva, J. Tonge, R. G. Hemker, J. M. Dawson, and W. B.

Mori. Three-dimensional particle-in-cell simulations of the Weibel instability in

electron-positron plasmas. Plasma Science, IEEE Transactions on, 30(1):28–29, Feb

2002.

[78] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim,

Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennu-

paty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the

100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and

GPU. SIGARCH Comput. Archit. News, 38(3):451–460, 2010.

[79] H. Burau, R. Widera, W. Hö andnig, G. Juckeland, A. Debus, T. Kluge,

U. Schramm, T.E. Cowan, R. Sauerbrey, and M. Bussmann. PIConGPU: A Fully

Relativistic Particle-in-Cell Code for a GPU Cluster. Plasma Science, IEEE Trans-

actions on, 38(10):2831–2839, October 2010.

[80] Viktor Decyk, T. V. Singh, and Scott Friedman. Graphical processing unit-based

particle-in-cell simulations. To be published in Proceedings of the 10th Interna-

tional Computational Accelerator Physics Conference, 2010.

[81] Viktor Decyk and T. V. Singh. Adaptable Particle-in-Cell Algorithms for Graph-

ical Processing Units. To be published in Computer Physics Communications Jour-

nal, 2010.

[82] Xianglong Kong, Michael C. Huang, Chuang Ren, and Viktor K. Decyk. Particle-

in-cell Simulations with Charge-Conserving Current Deposition on GraphicProcessing Units. To be published in Journal of Computational Physics, 2010.

[83] IDL. http://www.ittvis.com/IDL.

[84] SCIRun: A Scientific Computing Problem Solving Environment, Scientific Com-

puting and Imaging Institute (SCI). http://www.scirun.org.

[85] Paraview: an open-source, multi-platform data analysis and visualization ap-

plication. http://www.paraview.org.



Bibliography 119

[86] OpenDX: The Open Source Software Project Based on IBM’s Visualization Data

Explorer. http://www.opendx.org.

[87] M. Marti, L. Gargate, R. A. Fonseca, L. L. Alves, J. P. S. Bizarro, P. Fernandes,

J. P. M. Almeida, H. Pina, F. M. Silva, and L. O. Silva. The IST Cluster: an

integrated infrastructure for parallel applications in Physics and Engineering.

In Ibergrid 2008 — 2nd Iberian Grid Infrastructure Conference Proceedings. Netbiblo,

2008.

[88] EPP: Extreme Plasma Physics at IST. http://epp.ist.utl.pt/.

[89] Dave Shreiner. The OpenGL Programming Guide, The Official Guide to Learning

OpenGL. Addison-Wesley, 2009.

[90] Jens Krüger, Peter Kipfer, Polina Kondratieva, and Rüdiger Westermann. A

particle system for interactive visualization of 3d flows. IEEE Transactions on

Visualization and Computer Graphics, 11(6):744–756, 2005.

[91] Keenan Crane, Ignacio Llamas, and Sarah Tariq. Real-time simulation and ren-

dering of 3d fluids. In GPU Gems 3, chapter 30, pages 633–676. Addison-Wesley,

2008.

[92] Takahiro Harada. Real-time rigid body simulation on GPUs. In GPU Gems 3,

chapter 29, pages 611–632. Addison-Wesley, 2008.

[93] I. Campos, F. Castejón, G. Losilla, J. M. Reynolds, F. Serrano, A. Tarancón, R. Val-

lés, and J. L. Velasco. IVISDEP: a Fusion plasma application ported to the Inter-

active European Grid e-Infrastructure. In Proceedings of IBERGRID: 1st Iberian

Grid infrastructure conference, 2007.

[94] Tech-X. Vorpalview, 2010. http://www.txcorp.com/products/VORPAL/

vorpalview.php.

[95] ATI. Radeon x1900 product site, 2006. http://www.amd.com/us/products/desktop/graphics/other/Pages/x1900-oveview.aspx.

[96] NVIDIA. Geforce 7800 product site, 2005. http://www.nvidia.com/page/

geforce_7800.

[97] A. Lastra, M. Lin, and D. Manocha. GP2: ACM Workshop on General Purpose Com-

putation on Graphics Processors. The Organizing Committee of the 2004 ACM

Workshop on General-Purpose Computing on Graphics Processors, October

2004. http://www.cs.unc.edu/Events/Conferences/GP2/proc.pdf.



120 Bibliography

[98] Nico Galoppo, Naga K. Govindaraju, Michael Henson, and Dinesh Manocha.

Lu-gpu: Efficient algorithms for solving dense linear systems on graphics hard-ware. In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing ,

page 3, Washington, DC, USA, 2005. IEEE Computer Society.

[99] Industrial Light & Magic. OpenEXR, 2010. http://www.openexr.com/.

[100] Daniel Reiter Horn, Mike Houston, and Pat Hanrahan. ClawHMMER: A

Streaming HMMer-Search Implementation. In SC ’05: Proceedings of the 2005

ACM/IEEE conference on Supercomputing, page 11, Washington, DC, USA, 2005.

IEEE Computer Society.

PAbreu Thesis

Documents

Transcript of PAbreu Thesis