Realtidsrendering av försimulerade partiklar

60
Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet g n i p ö k r r o N 4 7 1 0 6 n e d e w S , g n i p ö k r r o N 4 7 1 0 6 - E S LiU-ITN-TEK-A-14/002-SE Realtidsrendering av försimulerade partiklar Nathalie Ek 2014-03-14

Transcript of Realtidsrendering av försimulerade partiklar

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet

gnipökrroN 47 106 nedewS ,gnipökrroN 47 106-ES

LiU-ITN-TEK-A-14/002-SE

Realtidsrendering avförsimulerade partiklar

Nathalie Ek

2014-03-14

LiU-ITN-TEK-A-14/002-SE

Realtidsrendering avförsimulerade partiklar

Examensarbete utfört i Medieteknikvid Tekniska högskolan vid

Linköpings universitet

Nathalie Ek

Handledare Joel KronanderExaminator Jonas Unger

Norrköping 2014-03-14

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –under en längre tid från publiceringsdatum under förutsättning att inga extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat förickekommersiell forskning och för undervisning. Överföring av upphovsrättenvid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning avdokumentet kräver upphovsmannens medgivande. För att garantera äktheten,säkerheten och tillgängligheten finns det lösningar av teknisk och administrativart.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovanbeskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådanform eller i sådant sammanhang som är kränkande för upphovsmannens litteräraeller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press seförlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possiblereplacement - for a considerable time from the date of publication barringexceptional circumstances.

The online availability of the document implies a permanent permission foranyone to read, to download, to print out single copies for your own use and touse it unchanged for any non-commercial research and educational purpose.Subsequent transfers of copyright cannot revoke this permission. All other usesof the document are conditional on the consent of the copyright owner. Thepublisher has taken technical and administrative measures to assure authenticity,security and accessibility.

According to intellectual property law the author has the right to bementioned when his/her work is accessed as described above and to be protectedagainst infringement.

For additional information about the Linköping University Electronic Pressand its procedures for publication and for assurance of document integrity,please refer to its WWW home page: http://www.ep.liu.se/

© Nathalie Ek

Abstract

This master thesis presents a method for real-time streaming of pre-simulated

particle systems. The particle systems are simulated offline in any software

and then saved as an Alembic file. This Alembic file is then imported into

Frostbite and can be loaded at run-time. The result of this master the-

sis work is the implementation of a streaming and rendering framework

for pre-simulated particles. The implementation contains only basic shad-

ing and lighting due to the time constraints of the work but the streaming

part features an advanced solution to get predictable and manageable CPU

and memory overhead. The implentation performs well and works satisfac-

tory.

Contents

1 Introduction 6

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 The Company . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 EA Digital Illusions CE - DICE . . . . . . . . . . . . . 7

1.2.2 Frostbite . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Third Party Tools . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.1 Real Flow . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.2 Alembic . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Work 10

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Background 13

3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 A particle system is born . . . . . . . . . . . . . . . . 13

3.2 Hardware evolution . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 CPU evolution . . . . . . . . . . . . . . . . . . . . . . 16

3.2.2 GPU evolution . . . . . . . . . . . . . . . . . . . . . . 16

3.3 SIMD - Single Instruction Multiple Data . . . . . . . . . . . . 17

3.4 Sorting Particles . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.1 Bubblesort . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.2 Combsort . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.3 Insertion sort . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.4 Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.5 Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . 22

1

2

3.4.6 Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Third Party Software . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.1 RealFlow . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.2 Alembic . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Particle Systems 26

4.1 Particle Systems . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1.1 A particle system in general . . . . . . . . . . . . . . . 26

4.1.2 The Particle . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.3 Different types of Particle Systems . . . . . . . . . . . 27

4.1.4 The Emitter . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.6 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.7 Performance . . . . . . . . . . . . . . . . . . . . . . . 32

5 Method 35

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Alembic importer . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.1 Particle information . . . . . . . . . . . . . . . . . . . 36

5.3 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4 Runtime entity . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.5 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.5.1 Streaming Cache . . . . . . . . . . . . . . . . . . . . . 37

5.5.2 Streaming State Machine . . . . . . . . . . . . . . . . 38

5.5.3 Reading from the Cache . . . . . . . . . . . . . . . . . 40

5.6 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.7 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.7.1 Sorting using a key-index stream . . . . . . . . . . . . 43

5.7.2 Sorting the Particle Stream directly . . . . . . . . . . 44

5.7.3 CPU sorting . . . . . . . . . . . . . . . . . . . . . . . 44

6 Result 45

6.1 Pipeline results . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2 Runtime Results . . . . . . . . . . . . . . . . . . . . . . . . . 45

Contents 3

6.3 Workflow Results . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.4 Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Discussion 50

7.1 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . 51

7.1.1 Compression . . . . . . . . . . . . . . . . . . . . . . . 51

7.1.2 Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . 51

List of Figures

3.1 Particle System in Genesis . . . . . . . . . . . . . . . . . . . . 14

3.2 Spacewar (1962) . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Movie Texture . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 State machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Chunk states . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 Ringbuffer example . . . . . . . . . . . . . . . . . . . . . . . . 40

5.5 Information split over chunks . . . . . . . . . . . . . . . . . . 41

6.1 Particle Stream simulation . . . . . . . . . . . . . . . . . . . . 47

6.2 Particle Stream simulation . . . . . . . . . . . . . . . . . . . . 48

6.3 Combined CinemaStream and ParticleStream simulation. . . 48

6.4 Combined CinemaStream and particle simulation. . . . . . . 49

4

List of Tables

3.1 LSD - Soring by least significant digit. . . . . . . . . . . . . . 23

5

Chapter 1

Introduction

This chapter will give a short motivation for the work. It will also go through

the third party tools involved in this master thesis work.

1.1 Motivation

Particles are of great value when it comes to giving the viewer/user a be-

lievable experience, whether it concerns movies or games. Hence, particle

systems are an important tool for visual effects in movies and games.

The challenge with particles in the movie industry is not efficiency. It does

not matter if one frame takes two days to render as long as the end result

is rewarding. In games, the situation is the opposite - rendering time has

to be less than 33 ms (for 30 fps) to be usable. With the fourth generation

of consoles (Xbox One and Playstation 4), Frostbite actually runs in 60 fps

which cuts the frame time in half to 16 ms.

The quality of particle simulations and the quantity of particles that can

be simulated have increased enormously since the introduction of GPU-

accelerated particle systems.

In modern games, particle systems tend to use a small amount of particles

6

Chapter 1. Introduction 7

but large ones in size, combined with shading instead of a large number of

small-sized particles. The particles are usually rendered as point sprites and

an artist provides a painted texture, sometimes along with a normal map to

integrate better with the light in the scene.

Since the number of particles increases with the evolution of hardware, shad-

ing becomes more and more important. Previous defects that could be

concealed with textures become more obvious as the number of particles

increases. One of the important aspects of shading dense particle systems

is self-shadowing, i.e. that particles can cast shadows on other particles.

Self-shadowing gives important cues to the density and shape of a particle

cloud.

1.2 The Company

1.2.1 EA Digital Illusions CE - DICE

EA Digital Illusions CE - DICE, was founded by Ulf Mandorff, Olof Gustafs-

son, Fredrik Liliegren, Andreas Axelsson and Markus Nystrom in Alvesta

in the early 1990s. The first games released were Pinball Dreams (1992)

and Pinball Illusions (1995). The release of Battlefield 1942 resulted in a

great break through within the game industry. Today the Battlefield se-

ries is among the most popular first person shooter games out there on all

platforms.

In 2004 the company entered collaboration with Electronic Arts (EA), re-

sulting in DICE being an affiliated company to EA.

1.2.2 Frostbite

Frostbite (currently Frostbite 3) is a game engine developed by DICE. As of

today, the engine is designed for use on Microsoft Windows, Playstation4,

Xbox One, Playstation 3, Xbox 360 and mobile platforms. It is also adapted

to a wide range of video game genres.

Chapter 1. Introduction 8

“Frostbite empowers game creators around the world to shape

the future of gaming. Together with our game teams, we provide

a smooth, beautiful, and collaborative development experience.”

[1]

1.3 Third Party Tools

1.3.1 Real Flow

RealFlow is a fluid and dynamics simulator for the 3D industry and is used

within the movie industry for creating realistic fluid simulations. RealFlow

fluids has appeared in, among others, Resident Evil Retribution, Ice Age

4, The Avengers and The Girl with the Dragon Tattoo. RealFlow has also

been used in games like Crysis 2 and Mass Effect 3.

RealFlow Intuitive Fluids is an industry-standard, out-of-the-box fluid sim-

ulation software. It is fast and easy to use and it is compatible with all

major 3D platforms (Maya, 3DS Max, Lightwave, Softimage, Houdini, Cin-

ema4D).

In January 2008 RealFlow won the Academy Technical Achievement Award

2007 granted by The Academy of Motion Picture Arts and Sciences.

1.3.2 Alembic

Alembic is an open computer graphics interchange framework which con-

tains tools for collaboration management and a generic, extensible, data

representation scheme. It includes a C++ library, a file format, client plu-

gin and applications. It was initially developed in 2010 by teams from Sony

Pictures Imageworks and Industrial Light & Magic.

Alembic was made to create an open standard for scene data sharing and

to support a baked-data workflow. It enables easy hand-off between disci-

plines and to enable fast workflows leading to greater productivity. Alembic

Chapter 1. Introduction 9

development continues on a daily basis.

Chapter 2

Related Work

This chapter presents a summary on related work in the field of particle

simulation and also sorting algorithms.

2.1 Related Work

In 2004 A.Kolb and P.Kipfer [2], [3] introduced particle systems on the GPU

for real-time animation and rendering of particle systems in OpenGL since

the CPU was too slow. The GPU was subsequently used for simulating fluid

motion with smooth particle hydrodynamics [4].

Different application purposes require different parallelization strategies.

The particle system used by A.Kolb and P.Kipfer back in 2004 did not

require any knowledge about the surrounding particle neighbors and were

easily parallelized (i.e. one thread per for each particle). The SPH imple-

mentation used in [4] as well as P.Kipfer [3] required information about the

local neighbors due to collision detection and advection of the particles. Both

are Forward-Euler solutions where the particle velocity could be adjusted by

using a small uniform time step to allow the system to converge.

The implementation used in Dynamic Particle System for Mesh Extraction

on the GPU [5] does not have a uniform time step. Instead, each particle

10

Chapter 2. Related Work 11

determines its step size based on its energy and the local curvature. This

allows for faster convergence for the purpose of the mesh extraction.

To extract high quality meshes for isosurface computation one recent ad-

vance is based on a dynamic particle system. Particle placement techniques

requires a significant amount of time to produce a satisfactory mesh and to

address this problem Kim et al [5] have studied the parallelism property of

particle placement and the use of CUDA, a parallel programming technique

on the GPU. The approach significantly improves the performance of the

particle placement. Kim et al [5] presents their curvature dependent sam-

pling method and the implementation using CUDA on the GPU to extract

high quality meshes. They also devise and efficient implementation of a par-

ticle system on the GPU to reduce the runtime of the particle system.

Drone [6] presents different methods for creating an advanced interaction

particle system. The computations and data reside entirely on the GPU

and Drone [6] uses non-parametric particle systems on the GPU to display

the complex behavior of a particle.

Non-parametric particle systems must react to their environment as well

as to each other giving the non-parametric systems a true advantage over

parametric systems. To handle the particle interaction in the system Drone

[6] outlines a method of dealing with N-body problems on the GPU – Force

Splatting for N2 Particle Interactions. The goal is to project the force from

one particle onto all other particles in the system during a single operation.

The algorithm exploits the alpha blending capabilities and the fast rasteriza-

tion of the modern graphics hardware without the constant need to recreate

complex space partitioning structures on the GPU.

Another important aspect when it comes to particle systems is how to sort

the particles.

In the past, many sorting algorithms have been proposed (Knuth [7], Martin

[8] and and Quicksort is one of the fastest algorithms used in practice. Since

often used there are also many optimized implementations of the Quicksort

available. A Quicksort implementation using SIMD instructions would be

Chapter 2. Related Work 12

ideal, but according to Inoue [9] there is no known technique to implement

Quicksort algorithm using existing SIMD instructions.

The performance of sorting is often dominated by pipeline stalls caused

by branch miss predictions according to Sanders and Winkel [10]. The al-

gorithm presented by Inoue [9] makes it possible to take advantage of data

parallelism of SIMD instructions while avoiding pipeline stalls cause by miss

predictions.

Chapter 3

Background

This chapter will go through the history of particle systems and the hard-

ware evolution. It will also describe the third-party software used in this

report and provide a comparison for sorting algorithms suitable for particle

systems.

3.1 History

3.1.1 A particle system is born

In the early 1980s, William Reeves, who worked on Star Trek II: The Wrath

of Khan, started to do research methods to create realistic natural phenom-

ena in real time, in particular for creating realistic fire in the Genesis Demo

sequence (see figure 3.1). Reeves realized that conventional modeling would

not do the trick even though it was the best at creating objects that had

smooth, well-defined surfaces [11].

He used the term “fuzzy” when referring to the objects and thought it would

be better if they were modeled as a system of particles that behaved within

a set of dynamic rules.

Reeves was not the first who wanted to use particles and niether the first

13

Chapter 3. Background 14

Figure 3.1: An early particle system used in Genesis).

to use them. Particles had been used before to create natural effects such

as smoke and galaxies of stars. However, particles turned out to be hard to

control. Even though it was not referred to as such, particle systems were

used in some of the very first video games, for example Spacewar (see figure

3.2) which used particles to display explosions as early as in 1962.

Reeves came to the conclusion that by applying a system of rules to the

particles, a chaotic effect could be achieved while maintaining some creative

control.

During the past thirty years there has been a lot of particle systems making

people grasp for air; watching buildings blow up, giant cities get flooded or

other visually stunning effects. Notable uses in the movie industry includes

Lord of the Rings, Transformers, The Avengers, Ice Age.

Chapter 3. Background 15

Figure 3.2: Two spaceships, one goal - blast each other out of the sky.Built by a small team at MIT led by Steve Russel. The relatively action-packed

space shooter was outrageously ahead of its time.

Chapter 3. Background 16

3.2 Hardware evolution

The evolution in both CPU and GPU hardware has been progressing ex-

tremely fast over the last couple of decades.

3.2.1 CPU evolution

In the late 1970s, Intel released the first x86 CPU (Central Processing Unit)

with the introduction of the Intel 8086. This instruction set gradually be-

came the industry standard and Intel processors were the most popular

processors going into the 1990s. This decade saw a race where clockspeeds

of the processors increased at a rapid pace. However, in the early 2000s,

processor manufacturers hit what would become known as the power wall.

This meant that it was not possible to create processors with higher clock

frequencies without using large amounts of power. This led to the intro-

duction of multicore processors. This shift was not a shift only in hardware

since a new type of programming needed to be used to fully utilize the power

of the new processors. This is where the future of processors lie. Multiple

cores and threads running concurrent software.

3.2.2 GPU evolution

GPU (Graphics Processing Unit) started emerging in the early 1990s. An

early example of mass-market dedicated GPUs was the Playstation and Nin-

tendo 64. The first GPUs were dedicated to accelerating 3D functionality

with separate discrete boards. The most notable example of this setup be-

ing 3dfx with their Voodoo cards. The 1990s also saw the introduction of

OpenGL and quite soon, the API influenced hardware development. Dur-

ing the late 1990s, Microsoft introduced an API similar to OpenGL, called

DirectX. This API eventually gained popularity and became the industry

standard among Windows game developers.

The next step in GPU evolution came with the introduction of GPGPU

Chapter 3. Background 17

(General Purpose computing on GPU). The term was coined by Mark Harris

in 2002 when he noticed a trend of using the GPU for non-graphics work

[12]. The GPGPU evolution is still ongoing and it is still not as common as

might have been expected.

Even more complex systems can be simulated by creating a new particle

system when each particle dies. The technique of using more than one

particle system was used as early as in the Genesis sequence in Star Trek II.

Up to 400 particle systems consisting of 750 000 particles were used.

3.3 SIMD - Single Instruction Multiple Data

SIMD (Single Instruction Multiple Data), see figure 3.3, is a method which

lets one microinstruction operate at the same time on multiple data items.

A single computer instruction perform the same identical action (retrieve,

calculate or store) simultaneously on two or more pieces of data. Typically

this consists of many simple processors, each with a local memory in which

it keeps the data which it will work on.

The advantage of the SIMD format is that for the cost of doing a single

instruction, N instructions worth of work are performed. This results in

large speedups for data-parallelizeable algorithms.

Each processor simultaneously performs the same instruction on its local

data progressing through the instruction in lock-step, with the instruction

issued by the controller processor. SIMD concepts are also applicable to

GPUs since they are massively parallell units that are capable of vector

operations on wide registers.

Particle systems are in general data-parallel, i.e. the same operation is

performed on a large amount of data. Therefore, SIMD programming is

beneficial for particle systems.

Chapter 3. Background 18

Figure 3.3: SISD - Single Instruction Single Data.SIMD - Single Instruction Multiple Data.

DirectCompute

DirectCompute is an API from Microsoft that supports general-purpose

computing on graphics processing units. It is part of the Microsoft Di-

rectX collection of APIs and was initially released with the DirectX 11 API.

DirectCompute allows for vendor-independent development of GPGPU al-

gorithms which can be used for particle system computations.

Movie Textures

To simulate very complicated particle effects in real time, reality has to be

faked. One way to simulate such an effect, which could be a big explosion,

is to use movie textures.

Movie textures are animated textures that are created from a video file. It

is a technique where a movie is played back on a simple polygon. They can

be used for cut scene movie sequences or to render movies into the scene

itself.

Chapter 3. Background 19

Figure 3.4: Example of a movie texture used in Battlefield 4.

3.4 Sorting Particles

Since particles are often transparent to some degree, sorting has to be per-

formed. This is done to be able to guarantee that the order that particles

are rendered, corresponds to the depth. The fundamental operation of many

sorting algorithms is to compare two values and swap them if they are out

of order. Since sorting is such a fundamental part of particle systems, it

is important to conduct experiments to get performance numbers for the

different algorithms before making a choice.

3.4.1 Bubblesort

Bubblesort compares each element to the next element and makes a swap

if sorting is needed. The gap between the two elements being compared is

always one. The complexity for Bubblesort is Θ(n2).

3.4.2 Combsort

Combsort is an extension to Bubblesort. It compares and, if needed, swaps

two non-adjacent elements. Performance is drastically improved by com-

paring two values with large separations. This is because each value moves

Chapter 3. Background 20

toward its final position more quickly. Unlike Bubblesort the gap between

the two comparing elements can be more than one. The inner loop of the

Bubblesort that handles the actual sort requires a modification. The gap

between the two elements reduces for each iteration of the outer loop in

steps of a shrink factor.

[inputSize / shrinkFactor,

inputSize / shrinkFactor^2,

inputSize / shrinkFactor^3,

...]

The length of the list being sorted divided by the shrink factor decides the

value of the gap and the list is then sorted with that gap. During the sorting

the gap is divided by the shrink factor again and the process repeats until

the gap is 1. If the list is not fully sorted by this point, the sort continues

using a gap of 1 until sorting is completed. The final step is thus equivalent

to an efficient Bubblesort since most problems have been dealt with.

The computational complexity of combsort approximates Θ(n · log(n)) on

average.

3.4.3 Insertion sort

Insertion sort is an algorithm that works relatively efficient for small lists

and mostly unsorted lists (Shellsort is a variant of Insertion sort that is more

efficient for larger lists). Being a simple sorting algorithm it is often used as

part of a more sophisticated algorithm. It builds the final sorted array (or

list) one item at a time.

Shellsort is not as efficient on large lists or as more advanced algorithms

(Quicksort, Heapsort or Mergesort), but it provides several advantages - it

is simple to implement and is efficient for small data sets. Shellsort is also

stable and can sort a list as it receives it and only requires a constant amount

Θ(1) of additional memory space.

Insertion sort consumes one input element each repetition and the output

Chapter 3. Background 21

list, which is sorted, grows. For each iteration one element from the input

data is removed, finds the location it belongs to in the output list and is

inserted at the position.

The sorting is typically done in-place by growing the sorted array behind the

array that is being iterated. The value of each array element is compared

against the largest value in the sorted array. If the value is larger than the

largest one in the sorted array it leaves the element in place and inserts it

in the next. If the value in the sorted array is smaller, it finds the correct

position in the sorted list and shifts all the larger values to make place, then

inserts it at the correct position.

3.4.4 Mergesort

Mergesort, see figure 3.5, is a divide and conquer and comparison-based

algorithm invented in 1945 by John von Neumann.

Mergesort first divides the unsorted list into the smallest unit possible (a

single element) and then compares each element with the adjacent list to

sort, i.e. it compares every two elements (the first with the second, the third

with the fourth, etc.) and makes a swap if the first element should come

after the second. Each of the resulting lists of two are merged into lists of

four and then merged into lists of eight, and so on; until at least two lists

are merged into the final sorted list.

Merge sort scales well to very large lists and its worst case running time is

Θ(n · logn). In the worst case, merge sort does ca. 39 % fewer comparisons

than Quicksort does in the average case.

Stable sorting in-place is possible but causes the algorithm to be a bit slower,

even though still managing Θ(n · logn) time. Stable sorting can be achieved

by merging the blocks recursively.

Chapter 3. Background 22

Figure 3.5: Mergesort

3.4.5 Quicksort

Quicksort, see figure 3.6, is a divide and conquer algorithm that relies on

a partition operation, it is also a comparison sort since the elements are

compared to each other. Quicksort is a fast sorting algorithm which has,

on average, Θ(nlogn) complexity, which makes it suitable for sorting large

data.

A pivot point, an array element, is selected and then used when doing the

partition. With the pivot point comes two sublists that are re-ordered -

smaller values are moved to the left of the pivot point and larger elements

are moved to the right. This can be done efficiently in linear time and

in-place.

The two sublists are then recursively sorted, each sublist getting its new

pivot point and sublists.

The quicksort algorithm is suitable for CPU sorting.

Chapter 3. Background 23

Figure 3.6: Quicksort

3.4.6 Radix Sort

Radix sort is a sorting algorithm that sorts numbers by processing individual

numbers, which makes this a distribution sort. It is a simple sort that is

both easy to understand and easy to use.

Radix sort sorts numbers by the least significant digit, the next least sig-

nificant digit, and so on. If two numbers are equal, the first integer must

always stay before the second integer.

LSD (Least Significant Digit) preserves the relative order of the digits by

using a stable sort (requires the use of a stable sort).

Table 3.1: LSD - Soring by least significant digit.

Unsorted Sorted by 1s Sorted by 10s Sorted by 100s

123 123 123 123583 583 625 154154 154 154 456567 625 456 567689 456 567 583625 567 583 625456 689 689 689

Chapter 3. Background 24

MSD (Most Significant Digit) does not require the use of a stable sort and

the in-place MSD radix sort is not stable.

While LSD radix sort sorts after the least significant digit, MSD sorts after

the most significant digit.

3.4.7 Summary

The choice of sorting algorithm was based both on measurements and the

suitability for a SIMD implementation. Since quicksort is recursive, it was

discarded quite early even though it has good performance characteristics

in the single threaded case. The choice was made to use radix sort due to

the possibility for a SIMD implementation.

3.5 Third Party Software

3.5.1 RealFlow

RealFlow is used to simulate fluid, water surfaces, fluid-solid interactions,

soft bodies, rigid bodies and meshes. It uses particle based simulations

which can be influenced in a multitude of ways by point-based nodes. These

nodes can do anything from simulating gravity to recreating the vortex-like

motion of a tornado. [13]

Large Scale

A cutting-edge hybrid grid/particle solver, Hybrido, provides endless pos-

sibilities for large-scale simulations, such as floods or oceans with breaking

waves.

RealFlow automatically creates particles by calculating the conditions for

splashes, foam and mist information. It is possible to generate millions of

particles using this advanced feature.

Chapter 3. Background 25

Small-scale simulations

The SPH (Smoothed-Particle Hydrodynamics) solver in RealFlow is ideally

suited for highly-detailed fluid simulations with tiny splashes and turbulent

surfaces.

Particles are generated by emitters and their total amount represents the

fluid. Each particle can be handled as a point in 3D space with certain

properties (velocity, position or mass). The emitters can interact with solid

or soft bodies and RealWave objects. The emitters can also be completely

customized and it is even possible to write your own fluid engine if de-

sired.

Oceans

When an ocean surface has to be simulated quickly and effectively, Real-

Wave is ideal since it is a powerful simulation toolset for small to medium

ocean surfaces. To achieve certain wave forms and structures it uses the

displacement of a the vertices in a mesh.

3.5.2 Alembic

Alembic distills complex, animated, scenes into non-procedural, application-

independent, baked geometric results. Alembic will efficiently store the an-

imated vertex positions and animated transforms that result from an ar-

bitrarily complex animation and simulation process. It will not attempt

to store any representation of the network of computations which were re-

quired to produce the final animated vertex positions and animated trans-

forms.

Chapter 4

Particle Systems

This chapter will present the concept of a particle system. The chapter will

furthermore describe different types of interaction, simulation and rendering

for particle systems.

4.1 Particle Systems

“In physical sciences, a particle is a small localized object

which can be described by physical properties such as volume or

mass. The word is rather general in meaning, and is refined as

needed by various scientific fields.” [14]

4.1.1 A particle system in general

A particle system is a collection of 3D points in space where each point

represents a single particle. Compared to standard geometry objects, which

are static, particle systems are not. Each particle goes through a complete

life cycle - they are born, change over time and then die off. Particle systems

tend to be chaotic since a given particle does not have a pre-determined

path. Also, the particles can each have a random element, called a stochastic

process, which modifies its behavior and makes the effect look organic and

26

Chapter 4. Particle Systems 27

natural.

There are many use cases for particle systems in games and other computer

graphics applications - water, foam, smoke, dust, fire, blood splashes, sparks,

even hair and cloth simulation. Worth mentioning is that these kinds of sys-

tems also have scientific applications - cosmological simulations use particle

systems with tens of millions of particles for studying the creation of the

universe. Particle simulations are also used in research for large and costly

fusion reactors.

4.1.2 The Particle

When building a particle system, the particles can have a number of prop-

erties. The minimal set of properties are typically:

• A position as well as the previous position.

• The direction in which the particle is currently traveling is stored (can

be stored in a direction vector).

• The speed of the traveling particle can simply be combined with the

direction vector by multiplication.

Since a particle goes through a lifetime, in addition to the above, the life

count of the particle needs to be stored. This is the number of frames that

the particle has existed which is compared to a set limit on the lifetime of

particles.

4.1.3 Different types of Particle Systems

There are many different types of particle systems. The different kinds

of systems can be categorized by the level of interaction between parti-

cles.

Chapter 4. Particle Systems 28

No Interaction

No interaction refers to the fact that the particles are independent of each

other. The computational complexity of such a system is Θ(n) per step and

places non-interacting particle systems as the least computationally complex

type.

for i = 1 to numParticles do

move particle i

end for

Limited Interaction

This kind of system has particles that interact with neighboring particles,

that is, other particles within a short range. A use case for this type of

interaction would be collision between hard spheres that bounces off each

other.

There are two ways to compute this type of particle system - brute force

or using spatial data structures. The computational complexity for the

brute force approach is Θ(n2). However, by using spatial data structures as

mentioned above, and a neighborhood size a, the computational complexity

can be reduced.

for i = 1 to numParticles do

move particle i

for all j in neighbours(particle[i]) do

check collision between particle i and j

end for

end for

Chapter 4. Particle Systems 29

Full Interaction

In this kind of system all the particles affect each other as occurs with a

gravitational or electrostatic force. The brute force approach is required and

this type of particle system has a computational complexity of Θ(n2).

for i = 1 to numParticles do

for j = 0 to numParticles where j != i do

compute interaction between j and i

end for

move particle i

end for

Particle Data Structure

All the properties are stored in a structure of some kind and if a more

complex particle is wanted there is no problem to add more properties.

Complementary additions/changes could be adding a size to animate the

size of the particles, add the mass of the particle or adding transparency by

adding an alpha component to the color.

A restriction to the paricle structure is that the size should be kept as small

as possible to be able to handle huge amounts of particles but still require

reasonable amounts of memory.

struct Particle

{

Vector position;

Vector velocity;

float mass;

// ...

};

Chapter 4. Particle Systems 30

4.1.4 The Emitter

Once a particle system is created, the system itself has to be created in

the world to make any sense. The particle emitter is an entity responsible

for creating the particle system and it is this object that is placed in a 3D

world.

The number of particles and the general direction in which they should be

emitted as well as all the global settings are controlled by the emitter. This

could for example be the above mentioned lifetime setting for the parti-

cles.

4.1.5 Simulation

For the particles to be visually interesting, they have to move. There are

different ways to move the particles, either through real-time simulation

or off-line simulations that are baked in different formats and played back

later.

Physics

The physics model in a particle system could handle attributes such as the

mass of the particle which could be randomized, causing gravity to affect

each particle individually. Friction could be added to force some particles to

slow down while animating. Other local spatial effects such as wind gusts,

magnetic fields and rotational vortexes would make the particles stand out

from each other even more. Collision handling can also be added to the

particles to have them interacting with the surrounding world.

External Influences

It is important to consider all of the possible parameters that might be

wanted when creating a particle system and build that flexibility into the

system.

Chapter 4. Particle Systems 31

With wind as a parameter there might be a need of changing the wind

direction vector. For example, when a car drives on a snowy road, the

snowflakes are affected by a new wind direction generated by the car and

the snow responds to the wind as the car passes.

Updating the Particles

For each cycle of the simulation, each particle needs to be updated. To make

sure not to waste valuable time, the status of the particle is inspected to see

if the lifetime of the particle has expired. If the particle is marked as dead it

is removed from the emitter and returned to the global particle pool.

Point Cloud Vertex Animation

To enable complex animations, for example facial animations, point clouds

can be used. Point clouds are pre-rendered particle data that is exported

and played back later. This data is also attached to underlying vertex data

in a mesh, resulting in vertices being ”skinned” to the point cloud data.

This way, it is possible to create detailed animations with the help from

particles. In this case, the particles are only used indirectly and not rendered

on screen.

4.1.6 Rendering

There are many ways to render a particle system depending on what the

system represents.

In a violent game with a lot of blood splatter, there might be a need of having

multiple blood systems - blood pool, blood splat, blood squirt and camera

lens blood splat. Each blood system contain suitable particles, all requiring

their own rendering technique, creating a chain of called effects resulting in

a final effect. The blood squirt would render blood squirts flying through

the air and when the squirts collides with an object (a wall, the ground, etc)

Chapter 4. Particle Systems 32

the blood splat function would be called. This would create messy blood

splats on the object.

Since a particle system is basically a collection of 3D points in space, it can

be rendered as just that - a set of colored 3D points. There is always the

option to calculate a polygon around the 3D point which always faces the

camera like a billboard. Perspective can be created by scaling the polygon

with the distance from the camera. Another option is to draw a 3D object of

any type at the position of the particle - the possibilities are endless.

As said above, the particles can be represented by a polygon when rendering.

This polygon is most often a quad, i.e. four vertices.

Vector vertices[]

{

Vector(1.f, 1.f, 0.f),

Vector(-1.f, 1.f, 0.f),

Vector(-1.f, -1.f, 0.f),

Vector(1.f, -1.f, 0.f)

}

4.1.7 Performance

There has always been a great difference between spectacular effects in games

and movies and until recently this has mostly been due to hardware limita-

tions. There is simply no way to fill a Playstation 3 with millions of rendered

fluid particles or skyscrapers getting blown up into millions of pieces or hav-

ing a city completely flooded.

To get an accurate and realistic simulation most use cases require a large

number of particles. The more particles, the higher computational complex-

ity, resulting in a fundamental impact on performance and hence limits the

size of the particle system.

Games need to have smooth animation and responsive interaction, therefore

fast execution times are required. As the need for larger and more realistic

Chapter 4. Particle Systems 33

particle systems increases, so does the computational complexity.

Before the introduction of GPGPU computations and the game physics en-

gine PhysX, fluid simulations in games did not use particle systems. Today

substantial real-time fluid simulations can be performed.

Memory

As mentioned in section 4.1.3, it is important that the memory requirements

for a single particle is kept at a minimum to be able to have huge amounts

of particles. To further manage the memory requirements for a particle

system, it is common to use a pool for the particle memory. Since the

pool will consist of blocks with the same size, managing fragmentation also

becomes easier.

When it comes to allocations and releases, memory operations, there should

be as few as possible due to the performance overhead inherent in the re-

quired context switches. If a particle gets old and dies it should not be

released from the memory. It should instead be flagged and marked as dead

and re-initialized. It is not until all the particles in the particle system are

marked as dead that allocated memory for the entire system is released.

This is done by using the pool design mentioned above, so that all mem-

ory in the fixed size pool is pre-allocated, and just flagged as free when the

particle is released.

Rendering

When a particle, a 3D point in space, is supposed to correspond to for ex-

ample a snowflake, the image of the snowflake has to be drawn on a polygon.

A particle most likely needs four vertices which creates two polygons. Thus,

with 3000 visible snowflake particles, 6000 visible polygons are added for

the snow alone. Since most particles in a particle system moves, the vertex

buffer cannot be pre-calculated and needs to be changed every frame. Most

particle rendering methods also use hardware instancing to lower the CPU

Chapter 4. Particle Systems 34

overhead that comes from setting states and dispatching draw calls.

Chapter 5

Method

This chapter will describe in detail how the implementation of a pre-simulated

particle system was made in Frostbite. The chapter will describe implemen-

tation of pipelines, streaming and rendering for the particle system.

5.1 Introduction

An implementation of a pre-simulated streaming particle system was made

in the Frostbite game engine. It reads particles pre-simulated in RealFlow,

exported as Alembic files. The system then uses a custom streaming solution

to minimize memory usage. A basic sorting algorithm and rendering system

for the streamed particles were also implemented. The data flow for the

system is illustrated in figure 5.1.

5.2 Alembic importer

The first step of creating a ParticleStream is authoring the simulation. This

is done inside the software Realflow (could be any software capable of ex-

porting Alembic files) and the result is exported to an Alembic file. Alembic

is an open source exchange format for digitally created assets, developed by

35

Chapter 5. Method 36

Figure 5.1: Figure illustrating the data flow from artist creation to Frostbiteruntime.

Sony Imageworks. The format supports high level constructs such as meshes

and points (particles). It is also possible to access the low-level parts of an

archive to attach and store any data such as particle scales, colors and vertex

colors for meshes.

To get the particle data into the Frostbite asset pipeline, a custom importer

was written that reads particle data from an Alembic file and stores it in-

side a custom binary file in the Frostbite asset pipeline, called a sandbox

file.

5.2.1 Particle information

The particle information contained in the Alembic file format is organized in

a Point structure inside Alembic. This structure contains information about

the position and has a numerical identifier for each particle. As stated above,

it is also possible to access lower-level constructs of the Alembic archive to

attach arbitrary data to each particle.

5.3 Pipeline

After the particle data has been read from the Alembic file, the pipeline

processes the custom binary file to create streamable chunks of particle

data. These chunks are 2MB each which is the standard chunk size for free

Chapter 5. Method 37

streaming chunks in Frostbite. The pipeline goes through all frames in the

imported particle simulation and writes them down into the chunk in the

following format

[position.x position.y position.z scale]

5.4 Runtime entity

To be useful in the game, there must be a way to place the particle simula-

tion in the game world. This is made possible by the implementation of a

ParticleStreamEntity which has a world space transform. This world space

transform is used to transform all the particles in the stream from object

space to world space.

5.5 Streaming

The amount of memory required to keep all the particles for the whole

simulation in main memory is too large and some sort of streaming is needed.

The above mentioned chunks are used for this streaming.

5.5.1 Streaming Cache

The particle simulation is then created in runtime and meta-data for it is

read. This meta-data stores how many particles there are per frame in

average in the imported file and how many streaming chunks there are. The

meta-data is then used to determine how many chunks are needed to be kept

in memory. For a higher framerate-simulation it is natural that more chunks

need to be kept in memory. The number of chunks needed also depends on

the average number of particles in a frame. The chunk cache is allocated as

a large array for maximum data locality.

The algorithm to calculate the cache size is as follows:

Chapter 5. Method 38

s = t ∗ fps ∗ T/p (5.1)

where s is the size of the cache expressed in streaming chunks, t is the target

time to have in cache in seconds, fps is the frame rate of the simulation, T

is the max number of particles in any frame in the simulation and p is the

number of particles per streaming chunk.

Each chunk in the chunk cache has a state which can be one of loading, empty

and ready. The streaming process is then controlled by a state machine

described in section 5.5.2.

5.5.2 Streaming State Machine

A state machine (rather a finite state machine) is a system that can, at any

given time, be in exactly one of a pre-defined set of states as shown in figure

5.2.

The particle stream is updated once per frame. In this update, the streaming

state machine is updated. The update goes through all chunks in the chunk

cache and take the appropriate action depending on their state.

If a chunk is currently loading, no action is taken. Chunks that have finished

loading from disk is set to the loaded state. To not create a too big load on

the IO-system, only one load is active at any given time. This means that

whenever a finished load is detected, a new one can be started if needed.

This request will always be created for the next chunk unless the previously

loaded chunk is the last chunk in the animation. Also, when a chunk has

been rendered completely, the state for it is set to empty, meaning it is

ready to store a new chunk. In this way, the chunk cache acts as a ring

buffer which is illustrated in figure 5.3 and 5.5

Chapter 5. Method 39

Figure 5.2: Figure illustrating a (finite) state machine with states and statetransitions.

Figure 5.3: Chunks have three states - empty (E), loading (L) and ready (R). F0,F1 and Fx shows frame 0, 1 and x.

Chapter 5. Method 40

5.5.3 Reading from the Cache

The cache is essentially a ring buffer. This means that there are methods

to be able to transparently start reading from the start of the buffer when

the end is reached as illustrated in figure 5.4.

Figure 5.4: A ringbuffer is a standard array in memory but is conceptually treatedas a ring.

The contents of the buffer is treated as raw bytes just as the chunks. This

means that particle information can be split over chunks, as illustrated in

figure 5.5, and it can also be the case that one particle starts at the end of

the ring buffer and the rest of the information is in a chunk placed at the

beginning. To allow for this, a given frame is read from the ring buffer into

an intermediate buffer where particle information is aligned to not be split

in memory. The reading from the ring buffer into this intermediate buffer is

handled by support routines. Example code for these support routines are

given below.

void incrementAndWrap(u32 bytesToIncrement, u8* const chunkCache)

{

currReadPos += bytesToIncrement;

// u8 is an 8-bit unsigned integer

u8* const chunkCacheEnd =

chunkCache + (CHUNK_SIZE_BYTE * chunkCount);

Chapter 5. Method 41

Figure 5.5: Particle information can be split over chunks. Fx and Fy shows framex and y.

// Cross chunk border?

if (currReadPos - chunkCache >=

(activeChunkIndex + 1) * CHUNK_SIZE_BYTE)

{

chunkStates[activeCacheIndex] = LoadingState_Empty;

activeChunkIndex = (activeChunkIndex + 1) % chunkCount;

}

// Wrap around

if (currReadPos >= chunkCacheEnd)

currReadPos = chunkCache + (currReadPos - chunkCacheEnd);

}

Below is the algorithm for handling reads from the cache into the interme-

diate buffer described above.

void safeRead(void* dest, size_t size)

{

u8* const chunkCache = static_cast<u8* const>(m_chunkCache);

u8* destPtr = static_cast<u8*>(dest);

Chapter 5. Method 42

u32 bytesToRead = static_cast<u32>(size);

while (bytesToRead > 0)

{

// Read a safe amount of bytes

// s64 is a 64-bit signed integer

s64 rem = max<s64>(chunkCount * CHUNK_SIZE_BYTE -

(currReadPos - chunkCache), 0);

// u32 is a 32-bit unsigned integer

u32 remainingBytesInCache = static_cast<u32>(rem);

u32 bytesRead = min(remainingBytesInCache, bytesToRead);

memoryCopy(destPtr, currReadPos, bytesRead);

destPtr += bytesRead;

incrementAndWrap(bytesRead, chunkCache);

// Do we have bytes left to read?

bytesToRead -= bytesRead;

}

}

5.6 Rendering

As with most particle systems, the particles are rendered as screen-aligned

quads. Since the buffer at this stage contains information about particles

in the form [pos.x pos.y pos.z scale] it contains everything needed to

place the quads at the correct world space location.

To handle rendering of the particles, a separate ParticleStreamRenderer

was created. This renderer is responsible for creating the needed GPU re-

sources and copying particle buffers into their GPU counterparts. A buffer

with fixed size is created for all particles and in the beginning of each frame

Chapter 5. Method 43

the renderer calls the ParticleStream with the GPU buffer as an argument.

The ParticleStream then copies the internal CPU buffer over to the GPU

buffer sent in as an argument.

The particles are rendered with hardware instancing on the platforms where

it is applicable. The instancing uses the per-instance data for each particle

from the buffer mentioned above. The fixed quad is transformed according to

the world space position and also scaled according to the embedded particle

scale.

5.7 Sorting

Particle systems that use additive and multiplicative blending can be ren-

dered in any order, there are, however, particle systems where ordering

need to be imposed on the system. These particle systems require sort-

ing. One reason to sort particles is for visual correctness. In cases where

non-commutative blending mode is used, such as alpha blending, sorting

is needed to ensure the correct order of operations. For non-commutative

operations care must be taken that blending happens in the correct order -

back to front order.

Rendering alpha blending particles in the wrong order is extremely notice-

able in motion as the particle system loses all sense of shape. To be able to

use any sorting algorithm it has to be applied to the particle data.

5.7.1 Sorting using a key-index stream

One way to sort the particles is to produce a key-index pair for each particle.

The key contains the value on which the sorting acts (the distance to the

viewer) and the index simply points out the position of the particle in the

particle stream.

The approach results in less bandwidth usage and better cache coherency

since there is no need to fetch a lot of data.

Chapter 5. Method 44

5.7.2 Sorting the Particle Stream directly

Another approach is to sort the particle stream directly without using a

key-indexed stream.

The downside of this approach is that the sorting itself has to read and write

the amount of data twice resulting in worse cache performance. The sorting

metric (the distance) needs to be computed for every sorting pass rather

than just doing it once per frame.

5.7.3 CPU sorting

The first implementation was a simple insertion-sort as a proof-of-concept.

This sorting algorithm has bad time complexity characteristics so something

faster is needed. A radix sort algorithm is what is currently used due to the

suitability for a SIMD implementation as described in section 3.4.6.

GPU sorting could also be used but was out of the time scope for this

implementation.

Chapter 6

Result

The result of the implementation is a pipeline that can import Alembic

files containing pre-simulated particle data into Frostbite. Furthermore, a

real-time free-streaming solution for runtime was implemented. A simple

renderer was also implemented for the particle streams.

6.1 Pipeline results

The pipeline implementation is a pipeline that reads the Alembic file frame

by frame. Since there is no real-time requirements on the pipeline, it does

not have to be very efficient and many times the easier solution was chosen

during this thesis. The pipeline is still very fast and only with very large

datasets will performance be a problem.

6.2 Runtime Results

The runtime implementation consisted of handling streaming of the particle

stream chunks. This procedure is described in chapter 5. The solution to

the streaming problem is the most advanced part of this work and works

very well with good performance. Sorting and rendering the particles are

45

Chapter 6. Result 46

also part of the runtime implementation. Sorting is currently implemented

with a simple quicksort algorithm.

The rendering is implemented with instancing on supported platforms which

gives good performance even with large amounts of objects. The rendering

implementation is currently missing proper lighting and shading support due

to time running out during the implementation phase. The rendering part

of the implementation is also where the bottleneck of the implementation

lies. This is since it is expensive to render alpha-blended geometry due to

overdraw costs.

6.3 Workflow Results

The workflow for placing a particle stream in the world is straightforward.

First, the particle stream data is exported in Alembic format from any

authoring tool wanted. This Alembic file is then imported into FrostEd (the

Frostbite editor). When this import is done, the particle stream pipeline

runs and stores the particle stream data in an internal format (described in

chapter 5 inside the game database. The artists then place a special particle

stream entity somewhere in the world and attach a particle stream asset to

it. The particle stream entity can then be controlled by the visual scripting

language in Frostbite to start, stop and in other ways control the playback

of the particle stream. The particle stream is then streamed and played

back in runtime.

This workflow is streamlined apart from the fact that the authoring of con-

tent happens outside Frostbite and the iteration times for making small

changes can be large.

6.4 Screenshots

This section present four screenshots (figures 6.1, 6.2, 6.3 and 6.4) from the

result of the streaming implementation. These screenshots do not represent

Chapter 6. Result 47

any finished lighting or shading.

The screenshots also show the debug rendering used to debug the streaming

process. This debug rendering illustrates the size of the buffer and each

chunk is shown as a block. The color of a block illustrates the status of

the block and there is also a marker to show the current position that is

read from in the buffer. Furthermore, some statistics on the current particle

stream are shown.

Figure 6.1: Screen capture showing a particle stream simulation and the relateddebugging tools.

6.5 Performance

The particle stream pipeline performs well and the performance is largely

dependent on the size of the data set. The particle stream pipeline currently

does no compression which will be a problem for more practical use-cases

where the amount of particles can be very large.

The runtime implementation also runs at good performance. The rendering

is implemented with instancing on supported platforms and can handle a

lot of particles since it never handles single particles. The streaming is also

fast and the major cost for the runtime implementation is still sorting and

Chapter 6. Result 48

Figure 6.2: Screen capture showing a particle stream simulation where time haspassed since figure 6.1.

Figure 6.3: Screen capture showing a combined CinemaStream and ParticleStream simulation.

Chapter 6. Result 49

Figure 6.4: Screen capture showing a combined CinemaStream and ParticleStream simulation.

rendering the particles in the stream.

Experimentation with a GPU-based sorting algorithm was made, but due

to shortage of time it was cancelled.

When it comes to memory usage, the runtime implementation has pre-

dictable memory overhead. This is since the size of the cache is static and

the memory usage is thus also static. This characteristic is desirable for

streaming buffers in general. However, even though the size of the stream-

ing buffer is constant during streaming it is not constant for all particle

streams. There is logic for deciding the size of the particle streaming buffer

(described in chapter 5).

Chapter 7

Discussion

To summarize, I think the implementation turned out well and uses good

schemes for handling runtime streaming and rendering.

I am very satisfied with the simplicity and efficiency of the implementation.

The ring-buffer used for caching streamed chunks makes the performance

overhead very predictable. Furthermore, since it contains heuristics for de-

termining the size, the intent is that it should work well without manual

tuning of buffer sizes.

The pipeline implementation is also simple and turned out to work well. It

basically consists of reading an Alembic archive and then packing the data

into runtime chunks for streaming. The simplicity of the implementation

makes it easy to plug in compression at a later point.

The workflow is also a part that turned out to work well. I think this

is mostly since the concepts are simple. A ParticleStream simulation is

simply placed somewhere in the world as a game entity and everything

in the simulation is then transformed in relation to the transform of the

ParticleStream entity.

A presimulated particle system has a lot of advantages in game applica-

tions and the only real drawback is that there is no gameplay control of the

particle system. This is quite a big drawback though and it can be hard

50

Chapter 7. Discussion 51

to fit a presimulated particle system in a massively dynamic world. The

biggest advantage is performance since a CPU-based particle system with

gameplay elements are orders of magnitude heavier than pre-simulating par-

ticles.

7.1 Future Improvements

To make ParticleStream production-ready, a few key improvements has to be

made. Most notably, compression and support for more advanced lighting

has to be implemented.

7.1.1 Compression

In the current implementation, the particle data is stored uncompressed in

the Frostbite data storage. This is not sustainable for real-world use cases

that uses gigabytes of data. The idea is that the ParticleStream system

would be integrated in the more general CinemaStream compression infras-

tructure in the future.

7.1.2 Lighting

What is obviously lacking in the implementation is proper lighting which

had to be skipped due to time constraints. Proper lighting of the particles

was not really the goal for this work. The work was instead focused around

the problem of getting source data into the Frostbite runtime and to be

able to utilize streaming efficiently. The goal was also to create an efficient

implementation that could handle large amounts of data.

The particles are simply rendered as alpha-tested quads with regular shaders

and support for artist-created custom shaders. One area in particular where

the lighting has to be improved is the shadowing. This is since volume

shading is a key aspect of high quality particle rendering.

Chapter 7. Discussion 52

Shadowing

The most obvious way to achieve shadowing for a particle system is to use

volume rendering techniques. To be able to do this, it is required that

the particles are converted into a discrete volume representation by being

rasterized into a 3D volume (voxelization). Once the particle is represented

as a volume, there are several ways to render it with shadowing.

Self-Shadowing

There has been a lot of research in this area. Some of the topics are Fourier

Shadow Mapping, Half-Angle Slice Rendering, Opacity Shadow Maps, but

none of them are scalable enough to use in large-scale in-game scenarios.

They are, however, perfect for cut-scenes and contained environments.

Half-Angle Slice Rendering

The key idea of half-angle slice rendering is to calculate a vector which is half

way between the light direction and the view direction. The volume is then

rendered as a series of slices perpendicular to this half-angle vector.

The half-angle vector enables the possibility of rendering the same slices

from both the light’s and camera’s point of view, since they will be facing

towards both. As a result of this the shadowing can be accumulated from

the light at the same time as the slices are being blended.

The main advantage of this technique is that it only requires a single 2D

shadow buffer.

Opacity Shadow Maps

Opacity Shadow Maps samples visibility at regular intervals and there are

numerous variants optimized to handle special cases such as hair. The

Bibliography 53

method is also suitable for generation of self-shadows in discontinuous vol-

umes with explicit geometry, such as fur and foliage, but continuous volumes

such as smoke and clouds may benefit from the approach.

With a set of planar opacity maps the light transmittance inside a complex

volume is approximated.

A volume made of standard primitives (points, lines, and polygons) is sliced.

The volume is then rendered with graphics hardware to each opacity map

that stores alpha values instead of traditionally used depth values.

Each primitive point is enclosed by the alpha values sampled in the maps

and then interpolated for shadow computation.

The algorithm is memory efficient and extensively exploits existing graphics

hardware

Bibliography

[1] Frostbite. http://www.frostbite.com, 2014. Accessed: 2014-01-18.

[2] A. Kolb, L. Latta, and C. Rezk-Salama. Hardware-based simula-

tion and collision detection for large particle systems. In Proceedings

of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics

Hardware, HWWS ’04, pages 123–131, New York, NY, USA, 2004.

ACM. ISBN 3-905673-15-0. doi: 10.1145/1058129.1058147. URL

http://doi.acm.org/10.1145/1058129.1058147.

[3] Peter Kipfer, Mark Segal, and Rudiger Westermann. Uber-

flow: A gpu-based particle engine. In Proceedings of the ACM

SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware,

HWWS ’04, pages 115–122, New York, NY, USA, 2004. ACM. ISBN

3-905673-15-0. doi: 10.1145/1058129.1058146. URL http://doi.acm.

org/10.1145/1058129.1058146.

[4] Andreas Kolb and Nicolas Cuntz. Dynamic particle coupling for gpu-

based fluid simulation. In In Proc. of the 18th Symposium on Simulation

Technique, pages 722–727, 2005.

[5] Mark Kim, Guoning Chen, and Charles Hansen. Dynamic particle

system for mesh extraction on the gpu. In Proceedings of the 5th

Annual Workshop on General Purpose Processing with Graphics Pro-

cessing Units, GPGPU-5, pages 38–46, New York, NY, USA, 2012.

ACM. ISBN 978-1-4503-1233-2. doi: 10.1145/2159430.2159435. URL

http://doi.acm.org/10.1145/2159430.2159435.

54

Bibliography 55

[6] Shannon Drone. Real-time particle systems on the gpu in dynamic

environments. In ACM SIGGRAPH 2007 Courses, SIGGRAPH ’07,

pages 80–96, New York, NY, USA, 2007. ACM. ISBN 978-1-4503-

1823-5. doi: 10.1145/1281500.1281670. URL http://doi.acm.org/

10.1145/1281500.1281670.

[7] Donald E. Knuth. The Art of Computer Programming, Volume 3: (2Nd

Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co.,

Inc., Redwood City, CA, USA, 1998. ISBN 0-201-89685-0.

[8] W. A. Martin. Sorting. ACM Comput. Surv., 3(4):147–174,

1971. ISSN 0360-0300. doi: 10.1145/356593.356594. URL

http://portal.acm.org/citation.cfm?id=356593.356594&coll=

Portal&dl=GUIDE&CFID=89172762&CFTOKEN=95662085.

[9] Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, and Toshio

Nakatani. Aa-sort: A new parallel sorting algorithm for multi-core

simd processors. In Proceedings of the 16th International Confer-

ence on Parallel Architecture and Compilation Techniques, PACT ’07,

pages 189–198, Washington, DC, USA, 2007. IEEE Computer Soci-

ety. ISBN 0-7695-2944-5. doi: 10.1109/PACT.2007.12. URL http:

//dx.doi.org/10.1109/PACT.2007.12.

[10] Peter Sanders and Sebastian Winkel. Super scalar sample sort. In Su-

sanne Albers and Tomasz Radzik, editors, ESA, volume 3221 of Lecture

Notes in Computer Science, pages 784–796. Springer, 2004. ISBN 3-540-

23025-4. URL http://dblp.uni-trier.de/db/conf/esa/esa2004.

html#SandersW04.

[11] W. T. Reeves. Particle systems&mdash;a technique for modeling a class

of fuzzy objects. ACM Trans. Graph., 2(2):91–108, April 1983. ISSN

0730-0301. doi: 10.1145/357318.357320. URL http://doi.acm.org/

10.1145/357318.357320.

[12] Gpgpu.org - general purpose computation on graphics hardware. http:

//gpgpu.org, 2014. Accessed: 2014-01-18.

[13] Realflow. http://www.realflow.com, 2014. Accessed: 2014-01-19.

Bibliography 56

[14] Wikipedia - particle. http://en.wikipedia.org/wiki/Particle,

2014. Accessed: 2014-01-19.