Scenes on Desktop GPUs Optimisation of Voxel Rendering ...An octree has eight voxel children for...

Patrick GordonSelwyn College

pjg56

Computer Science Tripos Part II

Optimisation of Voxel Rendering for Large Scenes on Desktop GPUs

May 13, 2016

2

ProformaName: Patrick GordonCollege: Selwyn CollegeTitle: Optimisation of Voxel Rendering for Large Scenes on Desktop GPUsExamination: Computer Science Tripos Part II, July 2016Word Count: 10000Originator: Patrick GordonSupervisor: Erroll Wood

Original Aims● Render SVOs using the GPU with good performance (at least 720p 30fps).● Construct SVOs using the GPU, with multiple different scenes.● Implement keyboard and mouse controls to allow moving through the 3D volume.● Evaluate the performance with metrics such as iteration count, iteration complexity,

rays per second and seconds per frame.

Work CompletedAll of the success criteria have been fulfilled. Starting from no pre existing code, two major systems have been designed and implemented: SVO Construction, SVO Raycaster. Also the SVO data structure was designed and implemented from scratch. Both of the systems either implement or open the possibility for major speedups on previous work in this area.

Special DifficultiesNone.

DeclarationI, Patrick Gordon of Selwyn College, being a candidate for Part II of the Computer Science Tripos, hereby declare that this dissertation and the work described in it are my own work, unaided except as may be specified below, and that the dissertation does not contain material that has already been used to any substantial extent for a comparable purpose.Signed Date May 13, 2016

AcknowledgementsNone.

3

Table of ContentsProformaTable of ContentsIntroduction

MotivationVoxelsPrevious WorkProject Overview

PreparationPrerequisite KnowledgeRequired KnowledgeRay CastingOctree DesignOctree ConstructionStarting PointSoftware Engineering

Requirements AnalysisSystem DesignTesting ProcessBackup Process

ImplementationSparse Voxel Octree Data StructureTree Builder and SVO GeneratorSVO RaycasterOptimisations

EvaluationTesting

Test ObjectivesTest ProgramTest HardwareRender Output

SVO Construction SpeedTree Broadening Memory UsageTree Broadening Render PerformanceCPU vs GPU Render Performance1080p vs 720p Render Performance

ConclusionBibliographyAppendix A - Screenshots

4

Appendix B - Code SamplesAppendix C - Project Proposal

Introduction, The Problem To Be AddressedStarting PointResources RequiredWork to be doneSuccess Criterion for the Main ResultPossible ExtensionsTimetable: Work Plan and Milestones to be achieved.

5

IntroductionMake it clear in the first paragraph what your project is about & how well you’ve done it e.g. “My project concerns the creation of a new operating system. My OS is based on quantum uncertainty. I have successfully implemented the heart of the new OS, which I have demonstrated running a range of key operations. This implementation fulfils the requirements of my core project proposal and one proposed extension: recovering deleted files through a time-warp mechanism.”

The Introduction should explain the principal motivation for the project. Show how the work fits into the broad area of surrounding Computer Science and give a brief survey of previous related work. It should generally be unnecessary to quote at length from technical papers or textbooks. If a simple bibliographic reference is insufficient, consign any lengthy quotation to an appendix.

MotivationCurrent games and visual applications use triangle meshes and GPUs to draw billions of triangles per second. Modern GPUs are fast enough to handle this. But with finer and finer detail this approaches a point where the triangle data is unwieldy, and using volumetric data is more appropriate. Voxels are one way to store volumetric data. They are 3D cubes in a uniform grid-like structure of varying scales. One problem with this approach is that to store a 3D scene in fine detail takes an enormous amount of memory. This problem can be solved by using scale-varying detail. For fine detail, a small scale voxel is used, and for more coarse detail a larger scale voxel is used. One implementation of this scale varying voxels is a Sparse Voxel Octree (SVO), illustrated in the figure below. This is a tree structure that represents the voxels as nodes and leaves in a tree. Each non-leaf node in the tree exactly contains eight nodes in the layer below. And each leaf node contains only one voxel. So, In sparse areas, the tree is shallow and large voxels are used. In dense areas, the tree becomes deeper and finer voxels are used to store the data. In many applications, the data is gathered or produced already in a volumetric form. For example, in medical imaging the data comes in a point cloud form. SVOs apply very well to these applications. The topic of this project is to explore, implement and optimise rendering methods for large voxel datasets.

VoxelsA voxel is a 3D cube, of any size. Usually they are part of a larger structure such as an octree. An octree has eight voxel children for each internal voxel node. The children take up the exact same geometric space as the parent voxel, but split into eight parts. In a sparse voxel octree, some of the internal nodes can be leaves without any children.

6

Figure: The Structure of a Sparse Voxel Octree (SVO).

Previous WorkThe current state of the art is to cast the rays into the SVO using the DDA method, and optimise this. In [Laine2010], they implement basic raycasting along with a simple implementation of beam casting, which combines rays into 8x8 blocks which can be conservatively ray-casted together. In [Crassin2011], they use a similar approach but also generate the voxel data on the fly from triangles. Also they use data about the voxels such as the original normal of the face to influence the rendering. In [Baert2014], they render the voxels in a similar way, but generate them out of core. This means that huge scenes can be generated that would be much larger than the available memory if they were all present in memory simultaneously.

Figure: Examples of Previous SVO Raycasters. (From GigaVoxels, [Crassin2011])

7

Figure: Final Output Screenshots of my Renderer.

Project OverviewIn this project I will implement and optimise an algorithm for constructing SVOs from voxel input. Also I will implement and optimise an algorithm for rendering SVOs from memory. The learning and research necessary to develop these such as learning GLSL and OpenGL will be a part of the work. Developing the optimisations will both be a substantial part of the effort and improve the final performance.

8

PreparationThis chapter outlines what work needs to be completed in order to begin the implementation. This includes becoming familiar with a language and framework for interfacing with the GPU, becoming familiar with new algorithms and data structures such as DDAs and SVOs.

Prerequisite KnowledgeIn addition to these multiple new frameworks and concepts, there are resources from earlier in the course which I will rely on. These include the following.

● Computer Graphics and Image Processing, I learned the main algorithms involved in rendering, such as rasterisation and, more importantly, raycasting, matrix and vector methods.

● Concurrent and Distributed Systems, I learned important ways of ensuring that data stays consistent when accessed by multiple processes, as in a graphics card with a single large main memory.

● Programming in C and C++, I learned how to use the language in order to implement and properly organise medium size software projects.

● Unix Tools, I learned how to use the main tools in a Unix environment, which I will be using for development, including make, git, and various others.

Required KnowledgeTo interface with the GPU, it was necessary to choose and learn how to use an appropriate framework. For this I chose OpenGL, because it is supported across a wider range of platforms compared to alternatives such as DirectX or Mantle. OpenGL is a graphics framework that allows access to important features of the GPU, such as compute shaders and fast, direct rendering to the screen. To program the compute shaders, it was necessary to learn to program in GLSL (a C like language), which is an application specific language for running code in massive parallelism on the GPU. There was a tradeoff here, in that I decided to learn GLSL instead of using OpenCL, and OpenCL C. I chose this because OpenGL had better graphics integration across all platforms, which meant that I could use graphics shaders and compute shaders together. This also meant that I also had to write a C version of the raycaster to test CPU performance.

Ray CastingThe algorithms which are required for this project needed work to fully understand and be able to modify them for the purposes of this project. The main algorithm needed was the raycast method. This is outlined in slightly different ways in many papers, the most recent of which are [Crassin2011], [Laine2010], [Baert2014]. Whilst retaining a fundamentally similar approach, I wrote my own implementation for this which enabled much better performance, and the ability to add the optimisations. I will outline the broad algorithm in the next chapter.

9

Octree DesignOctrees are used in most other voxel renderers, [Laine2010], [Crassin2011], [Baert2014]. In this project I take the same idea, and re-apply it to work better for the optimisations. The memory layout of the octree was redesigned to enable better performance and memory usage in the SVO Raycasting and Construction. To start with, the structure was stripped down to the most basic, removing neighbour pointers, removing data blocks. This created the simplest possible tree structure, where every node is the same, and has 8 (or 64 or 512) children pointed to by an offset. This drastically reduces the complexity of accessing and changing the structure, which in turn increases the performance of the algorithms. The lack of neighbour pointers is compensated for by keeping a stack of the ancestors of the current node, and using a fast common ancestor method. The layout in memory of this basic structure is also very flexible, although the neighbour nodes need to be in consecutive memory locations, this is the only restraint, so the tree can be used, constructed and replaced in any order in memory.

Octree ConstructionThe main data structure involved in this project is the SVO, which stores the scene data. The data structure itself is simply a tree in which, at every internal node, there are either eight children and every other node is a leaf. The complexity lies in the algorithms which construct the SVO.

The construction algorithm this project uses is based on work in [Baert2014], however I also reimplemented this, and took a different approach in order to improve performance. The details of this are in the next chapter.

Starting PointTo fully realise the potential performance gains, the project takes the ambitious step of recreating the major algorithms from scratch. Hence the codebases in [Laine2010], [Crassin2011], [Baert2014] is not required (other than for reference). The only libraries the project makes use of are the OpenGL library, and the GLFW library. The GLFW library is very useful for working with OpenGL and reduces the complex initialisation of an OpenGL program to a few lines of code. GLFW also provides simple keyboard and mouse interfaces, which allows more effort to be focused on the essential work.

Software Engineering

Requirements AnalysisThe main aim of the project was to render large voxel scenes with good performance on a desktop GPU. In order to achieve this the system must be able to:

● Render the voxels stored in the SVO with good performance.● Generate voxels and construct SVOs for a range of scenes.● Replace unseen voxels with seen voxels to reduce memory usage.

10

System DesignBy preparing early and researching the important algorithms, I was able to plan a very modular system before starting the implementation. This system layout is shown in the following diagram.

Figure: The Structure of the Subsystems.

Testing ProcessTo facilitate more effective testing, the program was split into small modules which could be fully tested independently. This methodology is picked from Code Complete, which suggests having functions/modules/systems around 100-200 lines has a positive impact on error proneness. This also made development easier, as the work on one module would either affect only a small number of other modules or none at all. The modules were mostly implemented as functions in both C and GLSL, which had generally unchanging outputs and inputs over the history of the project. Another point which minimised bugs, and made testing easier was the minimisation of global variables and state. This, and ensuring as few side effects in a module as possible, substantially reduced the effort in debugging (further allowing the project to focus on optimisation).

Backup ProcessGit was used for version control to allow easy rollbacks and backups. Git repositories are also very easy to host online, in the case of this project, the online hosting was used merely for backup, not for collaboration. The git repository was also backed up to an external USB drive on a weekly basis. This means that at all times during the work, there were three copies of the entire project history, one online backed up daily (or more regularly), one offline backed up weekly, and the local copy on the development machine.

11

ImplementationThis chapter outlines the implementation of the work that was necessary to meet the criteria. This includes the data structure to describe the SVO in memory, the SVO generator, the SVO raycaster, and the optimisations that go along with both of these. This also includes a CPU implementation of the generator and raycaster to benchmark the performance and to compare with the speedup from the GPU.

Sparse Voxel Octree Data StructureTo begin with, the rendering algorithm worked with a uniform grid of voxels. The main problem with this is that, although each ray-iteration was very simple, there were too many iterations needed in order to reach the data in the grid.

To solve this problem, a sparse tree structure was used, so that similar, adjacent voxels could be combined into larger voxels. This means that the ray can pass a large space in a single iteration if the space is contiguous.

The tree structure was implemented to be as simple as possible, particularly reducing memory accesses, so as to perform well on the GPU. For these reasons, the SVO is implemented as follows. Each node has a data value and an offset which points to its children. The children are contiguous in memory so that only one offset value is necessary and they can be indexed by the position. The order of the nodes in memory does not matter, the child offset can be ahead of or behind the current offset. This allows the generator to work in a fast manner.

12

Figure: Screenshots Highlighting SVO Structure.

Tree Builder and SVO GeneratorThe octree is constructed by generating or loading all of the smallest voxels, eight at a time, and either combining them if they are the same, or building a node for them if not. This naive method is very slow. In order to speed up the process, this project uses a stack like structure and Morton ordering. If we step through the regular 3D grid in Morton order and add voxels one at a time to the stack, we generate the octree correctly. This method has two significant advantages. Firstly, it is very parallelisable, this will be discussed in the implementation section. Secondly, it is very economical in its use of memory, only requiring the loading or generation of one voxel at a time. This method can be enhanced so as to massively reduce the memory accesses, by collecting contiguous sets of voxels and pushing them into the stack all at once, rather than one at a time.

13

Figure: Morton Order used in SVO Construction. (2D analogue)

In order to generate the voxels for the octree, there are multiple possibilities. 1. using procedural generation2. voxelisaton from triangle data3. or loading a voxel model from memory

The simplest of these is using procedural generation, because the function to output voxels can produce very complex models from only a few instructions. Loading either voxels or triangles from memory can be much slower. Using procedural generation allows us to use mathematically perfect objects, not approximations from triangles or voxels. Using procedural functions that output a value for every point in space, rather than a collection of all of the points, allows us to render possibly infinite detail. However, using pre generated voxels from triangle meshes allows us to make comparisons with other rendering techniques that use triangles. For this reason, that they are most comparable to previous rendering methods, this project focuses on voxel models generated from triangle meshes.

Figure: Model Voxel Rasterisation and SVO Construction Diagram. (2D analogue)

14

The SVO generator takes advantage of the simple and flexible SVO structure to produce the tree very rapidly. The generator can even be parallelised on the GPU. The main data structure in the generator is an array of stacks. The stacks have 0-7 members corresponding to cells in the tree, and the array has as many elements as there will be levels in the final tree. Voxels are generated in Morton order. This is the order necessary for the stack data structure to produce the correct tree. The voxels are never all stored in memory, they are only stored in the stack for as long as necessary. This means that huge scenes can be voxelised with a very small memory footprint.

The memory usage for the stacks is 8 * log8(scenesize). In the example scenes I generate trees that contain 4096^3 voxels, this would be equal to roughly 550GB of data if it were all stored at the same time, whereas in my method the memory usage is only 8 * 12 * sizeof(struct voxel) = 768 bytes an improvement of 0.7*10^9. A further advantage of the method developed in this project is that the voxels can be streamed into the tree construction method in a way that only the data and the length of the contiguous space is necessary. Hence the voxel generation produces the voxels and compares them for similarity, then produces a conceptual list of lists of voxels. This project successfully achieved this and opened the possibility of further optimisation in the tree building stage.

The voxels could be generated and converted into a list of contiguous spaces completely in parallel. This would massively speed up the tree construction, because the tree builder only needs to consume a single item for a possibly huge space of voxels. This, of course, relies on the voxels not being pathological (ie every voxel is different). The speedup works on the basis that a large number of the voxels will be the same as the voxels spatially next to them.

The tree builder works by streaming the voxels into the stacks, as a “lower” stack fills up, it is replaced by a single cell on the stack “above”. If the eight voxels are the same, then they are made into a cell of one size larger that has no children. If they are different, they are pushed out onto disk/memory and a single cell is added to the stack above which has the offset pointing to the children. This process eventually stops when the last node is added to the top stack, this will be the root node, which can then be written at the end of all of the nodes. In order to take advantage of the large similar spaces in the input stream, the builder can insert a large node into a high stack which covers the similar space. This saves a lot of time pushing in voxels individually and looping over the stacks. This speedup means that constructing the tree can take minutes instead of hours. The detail of the SVO Construction is presented in the following pseudocode and the structure of the SVO Construction subsystem is shown in the diagram.

PseudoCode: MultiStack Octree ConstructionDmax = largest octree depthS[Dmax][8] = empty array of stacks, Dmax stacks of length 8V = current voxel dataVnext = next voxel dataL = Morton order of current voxelLnext = Morton order of next voxelm = maximum number of contiguous voxelswhile Lnext < 8 << 3 * Dmax

//collect a group of contiguous voxels

15

while V == Vnext and Lnext - L < mLnext = Lnext + 1Vnext = voxel(mortondecode(Lnext))//voxel generating function takes a position and gives a voxel

m = updateT = Lnext - LD = starting octree depth, 0while T > 0

//fill the stacksfor i = 0 to i < (T >> D) & 7 S[D][i] = V

D = D + 1//if the current stack is full, pop the voxels and push one parent nodeif S[D].pos == 8 if S[D][0] == S[D][1 -> 7]

S[D + 1].push(V)else

S[D + 1].push(V)

Figure: Construction of an SVO from a Voxel Input Stream.

The SVO generator is oblivious to the data, except that it needs a means to test for equality. The data can be a colour, or a normal, or contain multiple surface values packed into the data. The data is restricted to four bytes in the current method, but this could easily be expanded, as necessary. This is enough for three colours and a transparency, or a normal vector discretised into 255 values. The data in the non-leaf nodes in the tree can be either not used or can be generated from an average of the child nodes, this provides a quick method for doing “level of detail” optimisation. This would work by once a ray is far enough from the camera, it does not need to traverse all the way to the lowest level. The smallest voxel will be smaller than the size of a pixel at that range.

For a powerful demonstration of the capability of the system developed in this project, I downloaded four standard models from the [Stanford] online 3D scanning repository and the [Williams] CS department. These are Sponza, Sibenik, Angel and Dragon. They all have a large amount of detail and a large number of triangles, ranging from 60 thousand to 28

16

million. These were voxelised by a free, open source program by [Baert2014]. The voxels were then streamed through the SVO generator, compressing either the colour or normal vectors into 4 bytes for my format. These scenes give a good representation of real world non-pathological scenes. They have both large scale and small scale detail which is captured by the SVO methods. A pathological case for the octree would be where none of the adjacent voxels are the same, so the octree structure would be as large and complex as possible. This worst case is not realistic.

The position indexing is simple. The position is converted to a uint 3 vector so that the smallest level in the octree will be a cell of size 1. This means that the bits in the uint 3 vector can be taken to directly index the tree structure. For example, to find the tree cell at a position (x, y, z), start with the root node. Produce a value from 0-7 by combining the nth bits of the x, y, z values and add this to the child offset to get the next cell in the tree. Repeat this until the cell has data or has no more children. This is shown in the following diagram.

Figure: The 3D Position used to Index into the Octree.

SVO RaycasterThe SVO Raycaster is based around a simple, fast ray-cube intersection. This has been simplified and sped up as much as possible, for example by pre-multiplying as many components as possible. The aligned ray cube intersection test returns the t value of the ray at the near and far cube intersections. The ray is pre-aligned to the basis vector of the cube to reduce the complexity. The raycaster works by intersecting the ray with the current node, starting at the root node, and traversing through the children to find the first solid voxel the ray intersects.

17

Figure: The Iterative SVO Raycasting Algorithm. (2D analogue)

In order to speed this up and reduce memory accesses drastically, a stack is kept with the ray which contains the tree nodes that are the parents of the current node. This means that for jumping between two space adjacent voxels the tree only have to traverse from their common ancestor, instead of restarting from the root every time. Due to the hierarchical nature of the octree the common ancestor is usually within 2-3 levels, whereas without this, the procedure would have to traverse the octree from the root. This massively reduces memory accesses, especially in large scenes with up to 12 levels from the root node to the deepest child. Because of the use of a simple uint 3 vector for the position inside the SVO, the common ancestor between two positions p0 and p1 can be found very simply by findMSB(p0.x ^ p1.x + p0.y ^ p1.y + p0.z ^ p1.z). The pseudocode for this algorithm is also presented below.

PseudoCode: Stack Raycasting AlgorithmP = first intersection between the ray and nodeDmax = largest octree depthS[Dmax] = stack of nodes containing only the root node//S[] stores the valid common ancestorsD = starting octree depth, 0while D < Dmax S[D + 1] = S[D].child[x]

//x is the child containing PN = S[D + 1]D = D + 1if N is not empty then

return colour of Nelse

Pnext = next intersection between the ray and nodeD = log2(Pnext xor P)//log2(Pnext xor P) is the depth of the common ancestor

return black

18

In further detail, the ray-cube intersection method works by calculating the ray's t value of each of the six planes which makeup the cube. Of these six values of t, the near value is taken from the minimum value that is on the surface of the cube, and the far value is taken from the maximum value that is on the surface. This is shown best by a diagram in 2D. The near and far values are then compared to check that the ray actually hit the cube, and that the intersection point is in front of the camera. If the far value is less than the near value then the ray did not intersect the cube, and if the far value is less than zero then the cube is behind the camera.

Figure: The Ray/Voxel Intersection Algorithm. (2D analogue)

All of the rays cast into the scene are independant of each other in terms of the variables they use, and the raycaster only reads from the SVO data structure and does not write. This means that the rays can be run in parallel and take advantage of the huge number of processors on the GPU.

I used the OpenGL 4.5 Compute Shaders for this purpose. The compute shaders offer an advantage compared to the alternatives like CUDA and OpenCL, that the memory can be shared between the compute shaders and the graphical shaders very simply and efficiently. In OpenCL, it is possible to share the memory, but is not as simple or available in a cross platform manner. For the hardware and software combination I was using to develop the system (nvidia proprietary drivers on linux) it also seems like OpenGL has better support and more frequently updated drivers than OpenCL.

Through simple experimentation, blocks of 16x16 rays work the fastest. This means that 256 rays are cast in parallel in a single work-group. Though there may be multiple work-groups executing at one time. This provides a massive advantage compared to a CPU renderer, because although there may be four very fast cores in a desktop CPU, the large number of GPU cores provides a better fit for this work. This is due to the fact that the work is not performing particularly complicated variable length calculations, but rather many thousands of simple iterations on a tight loop. I compare the performance of the GPU versions and the CPU version more in the evaluation section.

19

The position and direction vectors of the rays to be cast are determined by the camera. The camera is an object with a position and a matrix indicating its direction, the matrix is derived from an angular position. The ray position is simply equal to the camera position, the ray direction is equal to the camera direction matrix multiplied by the rays screen position. The vector and matrix data structures and methods were written specifically for this project, so they are implemented as fixed length arrays of floats, with simple operations such as multiply and inverse. The camera is controlled as if it were a physical object with mass, this was done to provide a better experience for the user as the controls are quite intuitive. It is achieved by applying forces and torques onto the camera object when the input keys are pressed. The forces and torques are applied simply to the momentum vectors of the camera, using cross products and additions. This gives the movement in the scene a more physical, realistic feel. Especially when viewing the SVOs generated from scans of physical models.

Figure: The Camera and Octree Objects in 3D space.

The rays store the returned data or colour value in a buffer in GPU memory. In the fragment shader (separate from the compute shaders) this data is rendered from the buffer onto the screen, this is a simple one to one translation. At this point the data can be converted to a screen colour.

A lighting model could be applied, which would take from the normal value and maybe the colour value. In my project I have implemented multiple different output modes, depending on the data stored in each node. If the data is a normal value, this can be used in a [CookTorrance] lighting model as shown in some of the screenshots in appendix A. There are also modes to display the raw Normal-Vector value, and to display the number of iterations for each ray/pixel as a grey value.

20

Figure: Rendering of the Angel Model using Cook Torrance Lighting.

OptimisationsThere are multiple possible optimisations that can be applied to both the algorithms and the data structure in the project. On the algorithms side, these will apply to either the SVO generation or the SVO raycasting. With the SVO generation, the main optimisation is to implement the space skipping as discussed previously, to drastically reduce the number of iterations of the tree builder. For the SVO raycasting, there are multiple possibilities, but the simplest is beamcasting. This would be to combine a block of rays into a single group, and do an initial simple pass to get a conservative starting point for the finer pass. In the end, I did not have time to complete this optimisation, although I got most of the way through it.

The main optimisation I focused on was tree broadening. This is because the optimisation has subtle drawbacks rather than being a very easy tool to apply. The evaluation is much richer for having chosen this. Tree broadening works by expanding the number of child nodes at each level in the octree, so instead of being an octree, it would be a 64-tree or a 512-tree or more. This is an advantage in two ways. Firstly, it reduces the number of memory accesses necessary for traversal, this is very important for shaders running on the GPU because pointer chasing can be very slow. Secondly, it The size of the SVO in memory can be reduced even further by only storing the voxels that are currently being seen, or which have recently been seen. One algorithm to implement this would be a random cache replacement algorithm, so that when a new voxel needs to be loaded, a voxel to be replaced is selected randomly, with smaller voxels being more common, and root voxels being very rare. This will mean that when a voxel is replaced, it is either seen, or just moved out of view. If it is seen, then it will be replaced again. The ratio of these two must be large enough that a randomly selected voxel is likely to be not seen. There are other cache replacement algorithms, but they are more complex, and likely will not be fast enough to implement.

21

Random replacement is very fast, so long as there is enough spare memory. To make raycasting at a reasonable resolution and framerate feasible, multiple optimisations have to be implemented. Two common features of all the optimisations are that they slightly increase the complexity of each ray iteration but also drastically reduce the total number of ray iterations necessary. As such, these are the main two values that this project will measure. The total number of ray iterations can be easily measured by a simple counter. The iteration complexity can be calculated by dividing the total time taken to render all of the rays and divide it by the number of ray iterations.

One optimisation works by splitting the screen space into batches of rays, that all travel in the same cell. In more detail, this optimisation would be to cast many rays simultaneously, as a beam. This works because the beam is equivalent to casting the same rays, by detecting if the extents of the beam are inside the same voxel as the primary ray, we can tell if the beam needs to split or not. Once we have detected a split, there are two possible options depending on the algorithm being used. The first, called beam tracing, which is simpler, but not as effective, is to simply spawn rays for every pixel covered by a beam when it splits and then continue raycasting each of the rays. Thus, there is only one level of beams, and the initial pattern of beams must be carefully decided. In the second option, called adaptive beam tracing, splitting involves detecting the line/lines along which the previous beam should be split, and then spawning multiple child beams which together contain the same screen area as the previous beam. So, there are multiple levels of beams, and the initial beam pattern does not need to be so carefully chosen, because any beam pattern will quickly become equivalent after a few iterations. In the second option, there must also be a test to detect if a spawned beam is smaller than a single pixel area, because then the beam should not split any further, and will become equivalent to a ray.

A second optimisation which is very simple to implement, is to detect when a ray or beam gets to a voxel which is smaller than a pixel area. When this happens, the ray/beam will not have any more effect on the rendered image, so it can simply be discontinued. This is very helpful in reducing the iteration count, especially in very large scenes, with large voxels.

Another optimisation that can be implemented is to detect which rays/beams actually need to be reprojected on the current frame. For example, if the camera has not moved, and the scene has not changed, then no rays/beams need to be reprojected. If only the camera has moved, then the rays/beams can be transformed by the opposite transformation which the camera underwent. Then the pixels which have become empty must also be raycast. If some part of the scene has changed, then the pixels which cover that object must be recast. This can be done simply by calculating the difference in camera and object matrices from one frame to the next.

22

EvaluationThis chapter will describe the measurements and hypotheses used to evaluate the project implementation. There were five original proposal success criteria. These have been fulfilled and surpassed. I have developed five tests in order to demonstrate this and show the capability of the system, and the performance of the optimisations.

Testing

Test ObjectivesThe motivation of the project is to show that voxel raycasting can achieve good performance with large scenes, necessary for real world applications. So, the test objectives are built around this. They will test the performance and memory usage of the SVO raycaster and the SVO constructor. These tests will determine:

1. Whether the SVO construction optimisations had a significant positive impact on performance. To do this, I measure the system, user, and real time taken to construct the SVOs with and without the optimisations.

2. What impact the SVO construction optimisations had on memory usage. To do this, I measure the file size of the different models with the different optimisation levels applied, after the SVO construction.

3. What impact the SVO raycaster optimisations had on performance. To do this, I measure the number of frames rendered per second for a large, varied set of frames.

4. How big the impact of running the SVO raycaster on the GPU compared to the CPU is. To do this, I measure the time taken, and frames per second, to perform the same series of renderings on both the CPU and GPU.

5. What impact the resolution has on the SVO raycaster. To do this, I limited the frame width and height to different standard resolutions such as 1080p, and 720p (as mentioned in the original success criteria) and measure performance in frames per second.

Test ProgramThe tests will all operate on the standard, pre-voxelised models mentioned in the implementation section. This will keep external factors such as model complexity out of the measurements. The tests are all composed of bash scripts that will run without any user intervention. This means I can run the tests for a longer period of time, in order to get more accurate results. The bash script for each test runs the main program for a range of inputs and will record some measurement from the program.

Test HardwareThe hardware used was an Intel 4 core i5-3470 CPU clocked at 3.2GHz with 4GB RAM, the GPU was an Nvidia GTX 760 Ti also with 4GB of RAM. The CPU specification should not matter for most of the tests as the important part of the programs run on the GPU, however for the CPU comparison it may be necessary. The system was running under Linux, which should have no impact on performance since the system was using the proprietary drivers.

23

Render OutputAs the GPU SVO raycaster is performing exactly the same calculations every frame, the output render should be exactly the same between different frames. There will be no visual degradation from any of the optimisations. However, there could be a small difference between the CPU and GPU versions because they rely on different floating point hardware. This was not seen in any of the comparison images. Not all of the frames were compared but a large enough sample so as to make sure they were the same.

Figure: Screenshots Produced by the Optimised Raycaster.

SVO Construction SpeedThis section will answer whether the SVO Construction Optimisations had an impact on the SVO Construction Speed.

The following table lists the construction speeds of the four models with and without optimisation. It is quite clear from this that the optimisations make a dramatic impact on the speed. They reduce the time taken from 4.75 - 6 minutes to 3 - 30 seconds.

These measurements were taken using the linux time command. The times are the sum of the user and sys values, which measure the amount of time the process spent in userspace and in the kernel respectively. This will also include the time taken to load the model and save the voxels from and to disk, which can take a long time given the size of the datasets. Even though the results include the system time (the time to write to disk) which was roughly constant between the optimised and non optimised version, they still show a dramatic improvement and a fulfillment of the second success criterion.

24

Model Non-Optimised / seconds Optimised / seconds

Angel 285.18 3.94

Dragon 286.51 4.38

Sponza 358.52 48.54

Sibenik 336.63 33.31

Tree Broadening Memory UsageI performed an experiment to determine how large the negative impact of the tree broadening optimisation was on the data size. Although Tree Broadening will have a positive impact in some cases on rendering speed, it also can have a large negative impact on the memory usage, as shown in the following graph. The important line on this graph is at 2000Mb, which is the maximum size that will fit into GPU memory. So, all four of the models at 8-width will fit, but at 512-width only the smaller angel and dragon model will fit. Although I cannot measure the rendering performance directly for these two models, I can predict that it will follow a similar pattern, ie, that 64-width is a large improvement, and 512-width is slightly less of an improvement in rendering performance.

This situation could be improved, to allow even larger models, by implementing a method to stream the SVOs into GPU memory from RAM or even disk. However this was beyond the scope of the project, and would take time away from implementing the core optimisations.

25

Tree Broadening Render PerformanceThis test does not explicitly fulfill one of the (slightly) conservative original success criteria, however it does measure an important aspect of the project: how much effect does an optimisation such as tree broadening have on rendering performance. Through these measurements, I have shown that the optimisations I implemented improve the performance by over 50% in every case.

As an interesting note: the 512-width version, although performing worse than the best, still performs better than the original. From small experiments, I expect this is because although the depth of the tree is decreased from 12 to 4, it also greatly increases the width of the tree and so overall the average number of iterations increases. The 64-width performs the best because it has the best tradeoff between reducing the tree depth (from 12 to 6) and increasing the tree width (from 8 to 64).

Model Width 8 / frames per second

Width 64 / frames per second

Width 512 / frames per second

Angel 40.1 63.4 57.8

Dragon 25.2 42.1 39.9

Sponza 22.0 36.6

Sibenik 19.2 35.4

26

CPU vs GPU Render PerformanceAlthough this test does not explicitly fulfill one of the criteria, it is an interesting comparison. For this test I had to rewrite the GLSL Raycaster to a C Raycaster, with all the same optimisations. Although the CPU version does not take advantage of the multi core nature of the processor, this would at most improve the result by a factor of 4 (for a quad core machine). The values in the following table show just how clearly the GPU is better suited for this task, performing over 100 times faster.

This test also took the longest to run, because of the huge amount of time needed to run the CPU version to get a reliable value from.

Model CPU / frames per second GPU / frames per second

Angel 0.341372 40.079346

1080p vs 720p Render PerformanceThis test was designed to determine if the first criterion was successfully completed. This was that the renderer should be able to run consistently at at least 720p and 30fps. This is shown to be complete by analysing the following graph. All of the 720p bars (represented by orange) are over the 30fps line. For the angel model, the 1080p bar is also well above 30fps. However, for the other three models (dragon, sponza, sibenik) the 1080p bar falls just short of 30fps.

27

The 900p test was included simply for curiosity, to see if the performance results would always be proportional to the number of pixels being raycasted every frame. They are not proportional. The 1080p version performs up to 90 million ray casts per second, whereas the 720p version performs only 76 million, I am unsure why this difference exists, but it does not affect the positive results.

28

ConclusionThis project successfully explored, implemented and optimised an SVO Constructor and Raycaster. All of the five original success criteria have been fulfilled and surpassed. The project implemented very successful optimisations to both parts, improving the performance of the raycaster by 50% consistently and improving the construction speed of the SVOs by over 100 times in some cases.

If I was starting the project again, I would focus on fewer optimisations, because although I explored and partly implemented multiple different optimisations, I only got to completion and successful results with one. The other optimisations I explored were either too time consuming to implement completely or from initial testing did not provide significant benefits.

If the system were to be used in a real world application, the feature with the most immediate benefit would be to stream SVOs from RAM or disk into GPU memory. This would mitigate the need to compress the data structure as much, and also allow much larger models. This was beyond the scope of the project.

29

Bibliography

[Laine2010] Samuli Laine, Efficient Sparse Voxel Octrees - Analysis, Extensions, and Implementation, 2010, https://mediatech.aalto.fi/~samuli/publications/laine2010tr1_paper.pdf

[Crassin2011] Cyril Crassin, GigaVoxels: A Voxel-Based Rendering Pipeline For Efficient Exploration Of Large And Detailed Scenes, 2011, http://maverick.inria.fr/Publications/2011/Cra11/CCrassinThesis_EN_Web.pdf

[Baert2014] Jeroen Baert, Out-of-Core Construction of Sparse Voxel Octrees, 2014, http://graphics.cs.kuleuven.be/publications/BLD14OCCSVO/BLD14OCCSVO_paper.pdf

[Williams] Williams College CS Department, 2016, http://www.cs.williams.edu/

[Stanford] Stanford 3D Scanning Repository, 2016, http://graphics.stanford.edu/data/3Dscanrep/

[CookTorrance] Cook Torrance, 2016, http://ruh.li/GraphicsCookTorrance.html

https://mediatech.aalto.fi/~samuli/publications/laine2010tr1_paper.pdf

http://maverick.inria.fr/Publications/2011/Cra11/CCrassinThesis_EN_Web.pdf

http://graphics.cs.kuleuven.be/publications/BLD14OCCSVO/BLD14OCCSVO_paper.pdf

http://www.cs.williams.edu/

http://graphics.stanford.edu/data/3Dscanrep/

http://ruh.li/GraphicsCookTorrance.html

30

Appendix A - Screenshots

Figure: Dragon Rendering showing the number of Iterations per Pixel.

Figure: Angel Rendering in Normal-Vector Mode.

31

Figure: Sibenik Rendering in Iterations Mode.

32

Appendix B - Code SamplesGLSL Code: The Raycasting Algorithm (with all optimisations).

layout(local_size_x = 16, local_size_y = 16) in;

//because the cube is always axis aligned, the test simplifies to this method

//this returns the value of the nearest and farthest t values of the ray which

intersect the cube

vec2 aligned_ray_cube_intersect(vec3 rp, vec3 rd, ivec3 pos, float size) {

vec3 t0 = (pos - rp) / rd;

vec3 t1 = (pos - rp + vec3(size)) / rd;

vec3 t2 = min(t0, t1);

vec3 t3 = max(t0, t1);

return vec2(

max(max(t2.x, t2.y), t2.z),

min(min(t3.x, t3.y), t3.z)

);

}

//iteratively step through the octree, using the intersect method to find the next

position

voxel raycast(voxel v, vec3 rp, vec3 rd, vec3 cp, mat3 cd) {

const uint tree_num = 1;

const uint cellsize = uint(pow(2, tree_num));

const uint treesize = uint(pow(cellsize, 3));

const uint maxlevel = 12 / tree_num;

const uint maxiter = 256;

rp += -cp;

rp *= inverse(cd);

rd *= inverse(cd);

vec2 ti = aligned_ray_cube_intersect(rp, rd, ivec3(0), pow(cellsize,

maxlevel));

ivec3 pos;

vec3 fpos = vec3(0);

if (ti.x < ti.y - 0.001) {

if (ti.y <= 0) {

return voxel(0, vec3(0), vec3(0));

} else if (ti.x <= 0) {

pos = ivec3(rp);

} else {

pos = ivec3(rp + rd * (ti.x + 0.25));

}

} else {


}

uint level = maxlevel;

node stack[maxlevel + 1];

stack[maxlevel] = nodes[0];

int i;

for (i = 0;i < maxiter;i++) {

33

while (stack[level].offset != -1) {

stack[--level] = nodes[stack[level + 1].offset

+ uint(dot((pos >> (tree_num * level + 0)) & ivec3(1), ivec3(1, 2,

4)))

//+ uint(dot((pos >> (tree_num * level + 1)) & ivec3(1), ivec3(8, 16,

32)))

//+ uint(dot((pos >> (tree_num * level + 2)) & ivec3(1), ivec3(64,

128, 256)))

];

}

pos = (pos >> (tree_num * level)) << (tree_num * level);

if (stack[level].data != 0) {

if (mode == 0) {

return voxel(1,

light(vec3(

float((stack[level].data >> 0 ) & 255) / 256 * 2 - 1,


float((stack[level].data >> 16) & 255) / 256 * 2 - 1

), rd, vec3(1)),

vec3(rp + rd * (aligned_ray_cube_intersect(rp, rd, pos,

pow(cellsize, level)).x + 0.01) + 0.001 * sign(rd))

);

} else if (mode == 1) {

return voxel(1, vec3(i) / maxiter, vec3(0));


return voxel(1, vec3(



float((stack[level].data >> 16) & 255) / 256 * 2 - 1

), fpos);

}

}

uvec3 dpos = uvec3(pos);

fpos = rp + rd * (aligned_ray_cube_intersect(rp, rd, pos, pow(cellsize,

level)).y + 0.01) + 0.001 * sign(rd);

pos = ivec3(floor(rp + rd * (aligned_ray_cube_intersect(rp, rd, pos,

pow(cellsize, level)).y + 0.01) + 0.001 * sign(rd)));

dpos ^= uvec3(pos);

level = findMSB(dpos.x | dpos.y | dpos.z) / tree_num + 1;

if (level > maxlevel) {

if (mode == 0) {



return voxel(1, vec3(i) / maxiter, vec3(0));



}

}

}


};

34

void main() {

//output the colour to the array for use by the fragment shader

uvec2 screen_pos = gl_GlobalInvocationID.xy;

if (screen_pos.x < screen.x && screen_pos.y < screen.y) {

uint index_new = screen_pos.x + screen_pos.y * int(screen.x) + currentimage

* int(screen.x) * int(screen.y);

image[index_new] = raycast(

voxel(0, vec3(0), vec3(0)),

camera_pos.xyz,

//map the screen position to 3D camera space

mat3(camera_dir) * vec3((screen_pos.xy + vec2(0.5) - screen.xy / 2) /

screen.xx, 1),

scene_pos.xyz,

mat3(scene_dir)

);

}

}

C Code: The SVO Construction Algorithm (with all optimisations).void insert(uint32_t data, uint32_t level) {

uint32_t value = 0;

if (data != 0) {

struct data_old d = payload_nodes[data];

value =

((uint8_t)((d.n[0] + 1) / 2 * 256) << 0) |

((uint8_t)((d.n[1] + 1) / 2 * 256) << 8) |

((uint8_t)((d.n[2] + 1) / 2 * 256) << 16);

}

stack[level][pointer[level]] = (struct node_new){.data = value, .offset = -

1};

pointer[level] += 1;

while (pointer[level] == treesize) {

uint32_t combine = 1;

for (int i = 0;i < treesize;i++) {

combine = combine &&

stack[level][0].data == stack[level][i].data &&

stack[level][i].offset == -1;

}

if (combine) {

stack[level + 1][pointer[level + 1]] = (struct node_new){.data =

stack[level][0].data, .offset = -1};

} else {

stack[level + 1][pointer[level + 1]] = (struct node_new){.data = 0,

.offset = output_index};

for (int i = 0;i < treesize;i++) {

output_nodes[output_index] = stack[level][i];

output_index += 1;

}

}

pointer[level + 1] += 1;

35

pointer[level] = 0;

level += 1;

}

}

void traverse(uint32_t input_index, uint32_t level) {

struct node_old n = input_nodes[input_index];

if (*((uint64_t*)&n.offsets) == (uint64_t)-1) {

uint32_t num_voxels = 1 << (3 * level);

uint32_t l = level * 3 / (uint32_t)log2(treesize);

for (int i = 0;i < num_voxels / (uint32_t)pow(treesize, l);i++) {

insert(n.data, l);

}

} else {

for (int i = 0;i < 8;i++) {

if (n.offsets[i] == (uint8_t)-1) {

uint32_t num_voxels = 1 << (3 * (level - 1));

uint32_t l = (level - 1) * 3 / (uint32_t)log2(treesize);

for (int j = 0;j < num_voxels / (uint32_t)pow(treesize,

l);j++) {

insert(n.data, l);

}

/*for (int j = 0;j < num_voxels;j++) {

insert(n.data, 0);

}*/

} else {

traverse(n.base + n.offsets[i], level - 1);

}

}

}

}

36

Appendix C - Project ProposalPatrick GordonSelwyn College

pjg56

Computer Science Project Proposal

Optimisation of Voxel Rendering for Large Scenes on Desktop GPUs

22 Oct 2015

Project Supervisor: Erroll WoodDirector of Studies: Richard WattsProject Overseers: Anil Madhavapeddy & Simone Teufel

Introduction, The Problem To Be AddressedGraphically intensive games and realtime applications generally render by rasterising triangles. This has worked well for some years, but there is a limit to the detail and complexity of a scene that is described by triangles. This limit is because the renderer must loop through some non trivial fraction all of the triangles in the scene. This fraction can be improved only so much, by clever culling and clipping of the list of triangles and the triangles themselves. Another method to render 3D scenes is raycasting. This method works by looping through every pixel in the screen casting a ray from that pixel to the scene. This does not have the same limit as triangle based rendering. The limit for raycasting is based on the screen resolution and the framerate. Compared to rasterising, the algorithms to improve the performance of raycasting can achieve much higher gains by taking advantage of the processing power of compute shaders and modern graphics cards. In this project, I implement voxel raycasting with accelerating data structures, explore the possible optimisations and show that raycasting can achieve similar results to triangle rasterising on current hardware.

Starting Point

37

Some of the most recent developments in voxel rendering include out of core construction of Sparse Voxel Octrees (SVOs), beam tracing of SVOs and including normal data. The state of the art in SVO construction and rendering is outlined in papers by Samuli Laine (2010), Cyril Crassin (2011), Jeroen Baert (2014). No existing code will be used from these projects as they differ in design. Libraries that will be used are OpenGL, for interfacing with the graphics card, and GLFW which is used to handle the setup of OpenGL applications, including keyboard, mouse, graphics card and monitor management.

Resources RequiredFor this project I shall mainly use my own quad-core computer with a graphics card (nvidia gtx 760) that runs Arch Linux. Backup will be to github and a weekly copy of the git repository will be made on a usb drive. I will rely on software libraries to interface with the graphics card.

Work to be doneThe project breaks down into the following sub-projects:

● Set up a graphical interface, using GLFW, which renders a basic test image using OpenGL compute shaders. This should also test the interaction between the shader and keyboard and mouse input. This should include the ability to output performance metrics like rays per second somehow.

● Implement the SVO ray casting which the optimisations will be based on.● Implement the SVO construction.● Design and Implement the various algorithms to improve the performance of the

construction and rendering, including tree broadening and beam casting.

Success Criterion for the Main ResultThe main success criterion for the project is to

● Render SVOs using the GPU, with good performance (At least 720p 30fps).● Construct SVOs using the GPU, with multiple different scenes.● Implement keyboard and mouse controls to allow moving through the 3D volume.● Allow the selection of multiple different voxel test scenes.● Evaluate the performance with metrics such as iteration count, iteration complexity,

rays per second and seconds per frame.

Possible ExtensionsThere are multiple possible extensions to the two different parts of the implementation. For the SVO generation, random cache replacement with logarithmic distribution could be

38

implemented in order to dramatically reduce the memory usage for large scenes. For the SVO rendering, adaptive beam tracing, would reduce the number of rays needing to be cast, and so improve the overall performance. Another possible extension to the SVO rendering would be reprojection from frame to frame, this would greatly reduce the number of rays needing to be cast, without affecting the image quality. Another possible extension is tree broadening to speed up traversal of the tree on graphics hardware which typically has slow memory access.

Timetable: Work Plan and Milestones to be achieved.Fortnight Work To Be Completed22/10/15 Preparation Research software libraries necessary for simple gpu programming and keyboard/mouse programming. Research current SVO rendering approaches. Milestone: Be ready to start programming in C.05/11/15 Preparation Setup environment and libraries, including the opengl library, keyboard/mouse library, version control program, latex editing program, figure drawing program. Milestone: Be ready to start programming shaders.17/11/15 Programming Experiment with shaders, get a basic compute shader setup that draws to the screen. Setup compute shader that draws a cube by intersecting rays with it.03/12/15 Programming Write the shaders to render an SVO from memory inside the cube. Milestone: Be able to render a sparse voxel octree.17/12/15 Programming Write the Progress Report and Write the shaders to construct an SVO on the gpu. Milestone: Both render and construct SVOs.31/12/15 Programming Continue writing the render/construction shaders. Milestone: Finish the primary shaders and be ready to start on the optimising shaders.14/01/16 Programming Implement some optimisations to the construction and ray casting. See the possible extensions section. In particular, tree broadening.28/01/16 Programming Implement some optimisations to the construction and ray casting. See the possible extensions section. In particular, probabilistic cache replacement. Milestone: Possibly implement some of the mentioned optimisations, finish programming.11/02/16 Programming Continue writing the optimisation shaders.25/02/16 Programming Milestone: Finish programming and be ready to start writing the dissertation.10/03/16 Dissertation Evaluate the project implementation and write the dissertation.24/03/16 Dissertation Continue writing the dissertation.07/04/16 Dissertation Milestone: Finish writing the dissertation and be ready to submit.

Scenes on Desktop GPUs Optimisation of Voxel Rendering ...An octree has eight voxel children for...

Documents

Transcript of Scenes on Desktop GPUs Optimisation of Voxel Rendering ...An octree has eight voxel children for...