07 Implementing a Parallel JPEG Encoder on Compute

7/26/2019 07 Implementing a Parallel JPEG Encoder on Compute

http://slidepdf.com/reader/full/07-implementing-a-parallel-jpeg-encoder-on-compute 1/72

Document serial number: 000005566724


= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5





= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



In the previous console generations we’ve been used to think about the GPU as a fully parallel machine where the

programming model was well structured and where the interactions between parallel execution threads were minimal.

Bending this concept is the key to exploit the full power of our GPU.

There’s a significant amount of ALU power that awaits to be used out there. Most games shipped up to today use only the

25% of the ALU capability of the PS4 GPU.

Unless your problem is embarrassingly parallel porting straight away from CPU code is not a good idea: it might work, but

quite likely it will be pretty slow.

Your algorithms must be re-engineered to exploit the nature of the GPU. So let’s embrace a new programming model: we

need to see the GPU as a 64-wide SIMD machine and start thinking in terms of wavefronts and lanes. Also, we now have a

number of cross-thread communication primitives that we can use to share data between threads at wavefront,

threadgroup and dispatch level.

Remember that besides a vector ALU our GPU also has a scalar ALU. Balancing the use of both ALUs is a key factor to

achieve great performance.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



When I started developing this JPEG encoder over one year ago I had no other choice than writing most of my shaders in

assembly. At that time, I did not even have a debugger.

Things improved quite a lot since then, and now from SDK 2.500 we have a new shader compiler which improves

dramatically the code generation quality and that exposes almost all the GPU instructions as intrinsics.

Thanks to the new compiler I could move away from the assembler and start writing high level code again, with many

advantages in terms of productivity and code readability.

Also, since last year we have a GPU debugger which allows mixed mode debugging, so that you can see your original

source code mixed with the disassembly in the debugger.

Finally, in SDK 2.5 we released orbis-shaderperf, a command line tool that performs static analysis on your shaders and

provides some high level info in terms of theoretical performance and code generation. For instance, this tool can tell you

how many texture sampling instructions you’re executing or which is the CU occupancy with a given shader and so on.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Many of you might be thinking: why are you writing a JPEG encoder on GPU?

This encoder is part of the new RtSecondScreen library. This library can be used to stream audio and video from a

secondary render target to a PSVita, or to an Android or iOS device.

For internal reasons we could not use the HW encoder we use for the other video services when we started the

development.

We went for a simple Motion JPEG encoder, like in the PSX days. If you never heard of MJPEG video, well, that’s just a

sequence of JPEG frames.

Of course we needed an encoder which could use as less game resources as possible, as we did not want to limit you

guys.

Luckily, after some extensive optimization work this encoder is now super fast! A PSVita resolution frame with default

quality settings can be encoded in 220 microseconds, which means that we have almost 7Gigabites per second of

encoding bandwidth.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Screenshot of the sample of the RtSecondScreen library.

If you’re interested in using this functionality in your game, please let us know!



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



JPEG is a standard lossy image compression format which has a number of applications in digital cameras, phones, web,

etc.

Technically the JPEG standard handles lossless compression as well, but that’s not covered by this presentation.

The lossy algorithm in the JPEG format exploits limitations of the human eyes to reduce the compressed image size.

Our eyes are more sensitive to brightness variations than color variations. For this reason JPEG images are stored in a Y Cb

Cr color space (luminance, blue difference chroma, red difference chroma).

For the same reason JPEG optionally supports chroma subsampling, which means that the chrominance channels are

stored at a different resolution than the luminance channel.

Moreover, our eyes tend to give less importance to high frequency information. For this reason JPEG uses a numeric

transformation called “Discrete Cosine Transformation” or DCT to convert the image into the frequency domain and

applies a quantization step to filter the high frequency components.

Finally, JPEG uses a Huffman encoder to further compress the resulting data.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



In the JPEG format images are arranged in so called “minimum coded units” or MCUs.

The layout of the MCUs depends on the chroma subsampling scheme used to encode the image. My encoder supports

only a 4:2:0 subsampling scheme which means that chrominance is sampled at quarter resolution.

In our case, a MCU is made by 4 luminance blocks, 1 chroma red block and 1 chroma blue block.

Each block contains 8x8 pixels, which means 64 pixels per block. This magic number, 64 is also the number of threads we

have in a single wavefront and this is an incredibly helpful factor because it means that each 8x8 block can be processed

by a single wavefront.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



In the JPEG format, the discrete cosine transformation is applied to retrieve a frequency domain representation of an 8x8

block.

The DCT is a very interesting transformation which expresses a set of points as a sum of cosines oscillating at different

frequencies.

This transformation has strong energy compaction properties, which means that the signal data is concentrated in low

frequency components.

In the JPEG standard, the top left DCT component of a 8x8 block is called DC, in yellow in the image on the right. The DC

represents the average of all the values in the 8x8 block.

All the other components are called AC and are numbered from 1 to 63. They store the low to high frequency DCT

coefficients.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



That’s an overview of all the steps required by a JPEG encoder.

I’m not going into the details of each step, we’ll have a look at the most interesting ones later.

For now it’s important to say that...



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



…these blocks are embarrassingly parallel and can be ported quite easily on GPU.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



While the Huffman encoding is a quite serial problem, as we’ll see later.

Before analysing the details of the most important passes let’s stop for one second on the high level overview.

There are at least 2 big problems with the current algorithm.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Problem number one, the original algorithm is made by a significant amount of steps, 8 to be accurate.

Each step requires data produced by the previous steps and that’s potentially bad as to implement this behaviour we

need to use GPU fences. With a fence what happens is that the GPU waits the termination of all the existing wavefronts

before starting the wavefronts of the next step.

Also, dividing an algorithm in different steps normally requires the use of some staging buffers, which basically means

increasing the amount memory bandwidth used by your shaders.

Finally, there’s a subtle problem which relates to the way the GPU creates wavefront. An algorithm with a high number of

short steps might suffer performance issues as short living wavefronts might reduce the occupancy.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Let’s have a quick look at the short living wavefronts issue

Wavefronts are created by a hardware block called Shader Pipe Interpolator or SPI.

There is one SPI in each shader engine.

Each shader engine contains 9 CU.Each CU can keep up to 40 wavefronts in flight.

This means that each SPI needs to service 360 wavefront slots.

The number of cycles that the SPI requires to create a wavefront depends on the shader. The SPI is in fact responsible for

initializing things like user SGPRs, system SGPRs and VGPRs, some LDS values in pixel shaders etc.

So, if the number of SPI cycles multiplied by the shader engine occupancy is higher than the average number of execution

cycles of your wavefronts, then boom! You have a bottleneck.

In other words, the SPI cannot create waves quickly enough to fill up the shader engine.

This topic is quite complex but it’s very well explained in the SDK document mentioned in the slide.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



This is GPU trace of a synthetic VALU test.

In this test I have split my workload in 9 steps and I added a GPU fence between each step.

As you can see there are some synchronization stalls in the GPU activity between each step.

The shader I’m using is purely VALU code which uses just a bunch of VGPRs. In theory this shader should have no

occupancy issues. However, because each wavefront lasts just a few hundred cycles, the SPI cannot create wavefronts

fast enough. The result is that each SIMD has only 4 wavefronts in flight instead of 10.

On a side note, the first step does not have occupancy problems. That’s because of the initial instruction cache misses,

which slow down wavefronts and the SPI has enough time to fill up the SIMD.

In this case, the total GPU time is around 47 microseconds.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



In this second test we have only 3 steps and each wavefront performs 3 times more work than in the previous

experiment.

In total we’re performing the same amount of work, but we’re just splitting it in a different way.

Each wavefront now lasts roughly 3 times longer and we’re not SPI limited any more, so we have 10 wavefronts in flight

at the same time.

Moreover, we now have only 2 GPU fences instead of 8.

The net result is that this test runs roughly 20% faster than the previous one.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



There’s another problem with the current algorithm and this time is even less intuitive.

As I said before a shader engine can host up to 360 wavefronts in flight at the same time.

However, because of a hardware limit, if you dispatch compute jobs from a single asynchronous compute pipe you onlyget 320 wavefronts in flight on the same shader engine, reducing the theoretical occupancy by an 11% factor. Of course,

this is not always an issue, for instance this is not worrying if you’re VALU bound. In my case this was causing a ~10%

slowdown in the encoding.

To workaround this problem, we need to dispatch from at least 2 compute queues on different pipes.

Of course, this means re-thinking the algorithm in a way so that the work load could be split on different dispatches

easily.

And of course, we want to dispatch in a balanced way, to make sure that we keep active each compute pipe for the whole

duration of the encoding process.

Another interesting thing we might try is to overlap different passes across different compute queues to fill the GPU

synchronization gaps.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Here you have a GPU trace taken from the actual JPEG encoder.

In this test we’re dispatching our compute jobs from a single pipe, as you can see in the “batches” section of the RTTV

timeline.

The statistics panel on the right says that we have up to 320 wavefronts per shader engine, which is less than optimal.

In this case the total GPU time is 179 microseconds.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Luckily for me, for totally unrelated reasons the JPEG encoder splits the image into 3 vertical slices.

That’s required to parallelize the JPEG decoding on PSVita, which has 3 CPU cores available to the game.

Otherwise even splitting in 2 would be fine to workaround the hardware limitation we have on PS4.

So, each slice is encoded independently and uses its own compute ring, as shown in the image at the bottom of the slide.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



That’s a GPU trace of the JPEG encoder using 3 compute pipes.

The use of at least two compute queues on different pipes improves launch rate. For instance, with 3 different compute

pipes we have 360 wavefronts in flight per shader engine.

The net result is that now the encoding is 9% faster.

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



So, to workaround the issues we just discussed, I merged the steps of the original JPEG algorithm in only 3 steps.

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Let’s have a look at the first step, which is responsible for

color conversion

chroma subsampling

and block splitting

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



From an algorithmic point of view, this step is embarrassingly parallel and quite simple.

The original color conversion and chroma subsampling steps were SPI limited, so aggregating these steps together was a

big win.

In general, the aggregated step is memory bound.

This is due to the high number of texture operations: each thread performs 4 RGB texture samples and 6 single channel

texture writes.

The shader I’m using does not have enough ALU operations to cover all this latency. For this reason, we could try

combining this step with the DCT step which is supposedly VALU bound and see what happens.

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



It’s interesting to note that the 8x8 block splitting is implicitly defined by the wavefront layout.

As you can see in the code snippet I’m using a 8-8-1 wavefront layout, which means that the threads within each

wavefront are already arranged in a 8x8 block. This is nice because I can avoid some VALU stuff to rearrange my inputs.

The take away of this slide is that you should try to use the wavefront layout to simplify your problem.

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Let’s move to the second and more interesting step, which is responsible for:

DCT

zig-zag scan

and quantization

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Again, this sums up to some embarrassingly parallel shader code.

The discrete cosine transformation is a 2D convolution. Here you have its formula just for reference.

The zig-zag scan and the quantization are quite cheap and simple operations instead.

Let’s have a look at these operations in order.

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



My first implementation of the DCT was based on the formula in the previous slide.

Basically I got one compute dispatch for each color channel.

Each wavefront computes the DCT of a 8x8 block

and I’m using LDS memory to cache the 8x8 input values.

Of course this naïve approach is incredibly slow.

First of all it’s VALU intensive because of the high number of floating point instructions. And remember that cosine

instructions are quarter rate, which means they are 4 times slower than normal floating point instructions.

Also, because of the LDS scratch area each wavefront produces 64 LDS accesses with many bank conflicts.

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Luckily, there’s a ton of literature about the DCT transform used in JPEG. If you are even only remotely familiar with JPEG

encoding, you probably heard about the Chen’s fast DCT transformation.

http://www.cin.ufpe.br/~vak/tg/papers/Chen-dct.pdf

Basically, the 2D DCT formula I showed you before is a separable transform. We can apply a 1D DCT to rows first and to

columns then.

For the 1D DCT I used a fast cosine transform which is very similar to a FFT in terms of concept. Basically it eliminates

many redundant operations due to symmetry, reducing the complexity.

Also, this fast transform uses pre-computed cosines, which means I’m not using any more quarter rate cosine

instructions.

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



The fast DCT works on rows first and on columns then.

This slides focuses on rows only for simplicity.

Each thread in a wavefront loads a color component from its 8x8 block and writes it to LDS memory in row-major order

(highlighted in blue in the slide)

Each thread now processes a whole row, so we only have the first 8 threads enabled. Using some LDS operations I re-load

the LDS rows in 8 different VGPRs (highlighted in red in the slide).

These VGPRs are then fed to the fast DCT calculator and the intermediate results are finally written back in LDS

At this point I need to process the columns in a very similar way, so reloading my 8 VGPRs from LDS and feeding them to

the fast DCT code, of course with a different access pattern this time.

This new DCT implementation is 3.6 times faster than the original one!!

However, we’re using only 8 lanes over 64, which means that our VALU unit is used at the 12.5% of its capability.

2



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



So, how can we improve the VALU utilization?

Well, we could process a whole MCU in a single wavefront.

In case you’re lost in this acronym hell, this means processing six 8x8 blocks in parallel.

We loading the content of each luminance and chroma blocks in LDS. With this modified approach, we need around 1.5

kilobytes of LDS.

Then we read back from LDS swizzling the block rows to 8 VGPRs.

The swizzling pattern is chosen so that 8 rows can be processed in parallel by the fast DCT block, which I’m omitting for

simplicity

In this way the VALU utilization jumps from the 12.5% to the 75%.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



With this approach we’re increasing the memory and LDS operations per wavefront, but at the same time we’re

dispatching 6 times less wavefronts and we have a better VALU utilization.

The net result is a 2x performance gain on the DCT pass!



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Quantization is the lossy part of the JPEG encoding.

We are dividing each DCT coefficient by a well-known value, de-facto throwing away some bits of precision.

At the end of this process, our 8x8 blocks will hopefully look as in the image on the right, some non zero coefficients in

the top left corner of the block and a long series of zero-s towards the high frequencies.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Here’s where the zig-zag scan makes its appearance.

The idea is to reorder the 8x8 DCT block so that low frequency components come first. Ideally we want to arrange our

data so that we have all the non zero components first and a long set of zero-s then.

The zig-zag scan is trivially implemented with two LDS accesses and using an offset matrix stored in a constant buffer.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Let’s move to the last and most interesting step, which is responsible for:

Delta coding for the DC components

Zero-run length encoding for the AC components

Huffman encoding



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



At this point of the encoding our inputs are 8x8 DCT quantized blocks.

Each non-zero DCT value stores the number of preceding zeros in the zig-zag scan.

The final Huffman bit stream encodes pairs made by a zero run length and a DCT component.

Each input block contains a significant amount of zero AC components, depending on the amount of information that

survived the quantization process.

For this reason, removing the zero components with run length encoding produces a significant saving in storage.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



In this slide I’m explaining how to calculate the zero run length in parallel for all the threads of a wavefront.

input = dctBuffer[dispatchThreadID]

Let’s start by reading the DCT buffer in a VGPR called input.

dctMask = ballot(input != 0)Then we use the ballot intrinsic to generate a 64-bit mask in which each bit is set if the relative DCT coefficient is different thanzero. The 64-bit mask is a scalar value and it’s important to note that here we’re using ballot as a form of cross threadcommunication mechanism. Through the bits of the scalar bitmask each thread can know the state of all the other threads inthe wavefront. The DCT mask is a very important value which will be used later in the Huffman shader to determine whichthreads contribute to the final bit stream.

threadMask = (1<<((ulong)groupThreadID)) - 1

Now I calculate a per-thread bit mask which can be seen as a selection all the previous threads. Given a thread X, a given bit Y ofthe mask will be set it Y < X. E.g.: for thread 4 threadMask will be …00111b = 7.

rleMaskPerThread = dctMask & threadMask

This value gives us the grouping of the DCT coefficients != 0. For each thread, the highest bit set indicates the right mostpreceding and active thread (active means, DCT != 0).

firstBitSet = firstBitSetHi(rleMaskPerThread)

Then I search for the first bit set, which basically indicates the ID of the previous thread with a DCT value != 0. Note thatfirstBitSetHi returns -1 when no bits are set.

rle = max((int)groupThreadID - firstBitSet - 1, 0)

Finally, calculate the run length encoding as the difference between the current thread ID and the previous thread with a DCTvalue != 0.

The yellow threads are the contributing ones, or in other words, the ones defined by the DCT mask.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



The JPEG standard allows zero runs of at most 15 zero-s. Longer runs must be broken down in some kind of “virtual pairs”

made of a run of 15 zeros and a zero DCT coefficient.

In this slide, a red rectangle indicates a lane of a VGPR while a blue rectangle indicates a bit of a scalar register.

As you can see at the bottom of the slide, we’re using the ballot trick to calculate a bitmask indicating the threads which

have a run length that requires to be broken down. To break these long zero runs, we use a scalar loop.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



For each iteration I use a special scalar intrinsic to calculate the index of the highest bit set to 1.

The __s_ff1_i32_b64 intrinsic maps directly to a SALU instruction that searches the first bit set to 1 in a 64 bit value.

Then I use the ReadLane intrinsic to read the zero run length on the lane previously selected (53).

This value is then stored it into a scalar register, in the blue box here.

Then, I enable the bit of the DCT mask preceding the thread we’re breaking down and using the WriteLane intrinsic I write

a “15” inside the relative lane of the VGPR containing the run length.

Finally I subtract 16 to the scalar counter (16 because it’s a 15 zero run and a zero DCT coefficient)

At this point we created a new virtual pair made of a 15 zero run and a zero DCT coefficient, on thread 52.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



We keep looping, creating new {zrl=15, input=0} pairs until the scalar counter containing the zero run length is greater

than 15.

When we finished looping on zrl > 15, we use WriteLane to write back the remaining zero run length to the original lane

and we disable the relative bit in the ballot bitmask using the __s_bitset0_b64 intrinsic.

The idea is to keep looping until all the bits in the ballot bitmask are != 0.

The scalar loop might probably be rewritten in vector code but there’s something extremely important to take in account:

the scalar code of one wavefront can run in parallel with the vector code of another wavefront on the same SIMD! This

concept of balancing scalar and vector code leads me to the next slide.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



A single SIMD can issue one VALU and one scalar ALU instruction per cycle. This means that the theoretical number of

instructions per cycle for ALU heavy code is 2!

Of course, the scalar ALU is low power compare to VALU, but moving to scalar can be more effective than what you might

think. You need to think that every time you extract some scalar work from your shaders you’re basically freeing up cycles

to the vector ALU, and these cycles might be used for something else!

The compiler normally can detect the “uniformity” of expressions, which means that it can understand when a variable

can be aliased by a scalar register or when it requires a VGPR.

However, the compiler cannot rework your algorithms as I have done in my previous slide with the scalar loop.

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



When you’re dealing with uniformity, it’s pretty easy to mess things up.

Keep in mind that every time one of your scalar values gets promoted to vector, all its dependent values have to be

promoted to vector too.

Normally a value is promoted to vector because it’s the result of an operation with another vector value (which means,we’re doing something that cannot be scalarized). However, sometimes the compiler just fails to recognize that a value is

scalar for some reason. I’ll provide an example about this in the next slide.

So, if you’re planning to offload part of your computations to the scalar ALU, there are a couple of tricks that can help

you.

First of all you can use ReadFirstLane to force the uniformity of a value. Technically this results in reading the value of the

first active lane of a vector value, so if you’re sure that all the lanes contain the same value this might look a little bit of an

overhead. Reality is that sometimes it helps the compiler understanding what you’re trying to do with scalarization.

Finally, you can check the uniformity of your expressions with a special intrinsic of the new wave compiler. In this slide

you have a code snippet for an assert macro that determines whether a value is scalar or not.

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



So as I said, sometimes the compiler promotes some scalar values to vector for some odd reason. For instance there’s a

bug in the wave compiler in SDK 2.5 that causes the vector promotion of the integer modulus of a scalar value.

In the Huffman encoding I’m using an integer modulus almost at the beginning of the shader to determine which kind of

block I’m encoding (luma 0-3, chroma R or chroma B). There is a significant amount of ALU code dependent on this value

that should map on uniform variables, considering that the block type is constant across a wavefront.

Because of this bug instead, the PSSL version of the Huffman encoder ended up using the VALU much more than my

original assembly version. With the ReadFirstLane trick, I was able to restore the scalar/vector ALU balance, gaining the

8% of GPU time on this pass.

To help developers understanding these kind of issues, the orbis-shaderperf tool will report the execution unit utilization

in the next major release.

On the left, you have the execution unit utilization of the suboptimal version of the Huffman encoding shader. As you can

see there’s a modest imbalance of the ALU utilization, with almost the 55% of the cycles taken by the VALU and only the

30% taken by the scalar ALU.

On the right instead, you have the execution unit utilization for the Huffman shader in which I use ReadFirstLane on my

integer scalar modulus to force scalarization. As you can see, we freed up quite a lot of VALU cycles here!

The integer modulus scalarization bug in the wave compiler is going to be fixed in the next release of the SDK.

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Let’s now talk about the Huffman encoding itself and how to make it parallel.

This is not an embarrassingly parallel problem because we’re basically outputting a variable length bit stream, and the

position of each element in the stream depends on all the previous elements.

Some implementations use, or actually, abuse restart markers, which are special values that can be embedded in the

JPEG bit stream to split the image data in chunks.

The idea is that the encoding of a single chunk is a serial process, but we can process many chunks in parallel.

However, because a marker is encoded as a special value on 2 bytes, we can easily get some non trivial overhead.

Let’s think about this: a PSVita resolution frames (960x544) encoded with a 4:2:0 subsampling scheme contains >12K DCT

blocks and our GPU can have more than 46k threads in flight at the same time. If each DCT block was calculated by a

single thread and required a restart marker, we would have a quite low GPU occupancy and a significant amount of data

wasted for the restart markers (12Kx2B = 24KB / frame).

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



This is the Huffman encoding part of the JPEG algorithm.

It’s very standardized and I don’t really want to go in the details, apart from saying that…

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



…There is a part of the algorithm which is quite simple VALU code, easily parallelizable,

while the bit packer is more complex, because each thread outputs a variable amount of data with serial dependencies.

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



As we said, each {zero run length, DCT} pair must know its bit position within the stream, and this depends on the length

of the previous pairs.

In JPEG literature there are a few papers that recommend using a parallel prefix sum of the length of all the pairs in the

image to calculate the bit position of each pair.

However, traditional parallel prefix sums require a multi-pass approach and the use of staging buffers, resulting in extra

GPU synchronization cost and extra bandwidth requirements.

So I started thinking, can we do better than this?

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



A better idea consists in breaking the parallel prefix sum on a per-block level, which means calculating the bit length of

each block in parallel.

To do this, I am using a super cool feature of our GPU called ordered count to broadcast the base bit offset of each block

across all the different wavefronts.

This is implemented with a special intrinsic which uses some dedicated hardware.

The ordered count hardware simply updates atomically a GDS counter, but the cool thing is that the GPU serializes the

GDS updates in wavefront creation order. In other words, the hardware can serialize the execution of a portion of our

wavefronts for us.

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



In the next slides I’m going to explain how to implement a local parallel prefix sum on a 8x8 pixels group and how to use

the ordered count hardware but before this I’d like to discuss about how to disable execution threads in a wavefront.

That’s a very common operation and it’s necessary in all the steps I’m going to describe. For instance, the dctMask value

we’ve seen before defines which threads contribute to the final bit stream. Also the parallel prefix sum algorithm

described in the next slides requires to conditionally disable execution threads.

So, as you should know by now, on our GPU we have a special 64 bit register called EXEC which defines which threads are

active and which are not.

Vector control flow is implemented by modifying this mask, so that entering an IF ANDs some bits to EXEC and exiting an

IF restores the bits of EXEC.

The easy way of disabling some execution threads is to use some vector condition based on thread ID as in the code

snippet. However, this tends to generate some redundant VALU code.

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



A more efficient solution to this problem consists in using the so called “predicate trick” described in the wave compiler’s

manual.

The predicate function in the code snippet uses the v_condmask intrinsic which expands each bit of the scalar bitmask

into the relative lane of the output VGPR.

The wave compiler is clever enough to eliminate that v_cndmask and just set the exec mask, saving a few vector ALU

cycles.

4



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



So, let’s go back to the parallel prefix sum implementation.

This is a common GPU algorithm, however I’m doing this locally within a single wavefront.

So, first of al l we initialize a vector value called sum which will contain the result of the prefix sum.

Then we use QuadSwizzle to shuffle even and odd lanes of the sum.

QuadSwizzle is a very useful intrinsic that can shuffle groups of 4 lanes in any possible pattern.

Finally, we use the predicate trick to add the original value to the swizzled values only on the odd lanes, obtaining this

intermediate result in violet.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Now we use again QuadSwizzle to broadcast the second lane of each group of 4 lanes.

And we again, predicate trick to add this new swizzled value to the sum value only on the third and fourth lanes of each

group.

As you can see, we’re starting building our prefix sum in parallel.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Now we use the LaneSwizzle intrinsic, which can shuffle a subset of pre-determined patterns in groups of 32 threads.

The three parameters define an AND, OR and XOR mask that is applied to the ID of each lane and it’s not exactly super

user friendly. I’d suggest you to have a look at the shader compiler’s manual or at the ISA reference of the ds_swizzle_b32

instruction for further info.

In this case, we’re using LaneSwizzle to broadcast the fourth lane of sum to groups of 8 lanes.

With the predicate trick we add the swizzled values only on the highest 4 lanes of each group of 8 lanes.

At this point we built a parallel prefix sum for all the groups of 8 threads in our wavefront.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



And we can extend the same mechanism to calculate easily the prefix sum of a group of 32 lanes.

Then, because the LaneSwizzle intrinsic works only on groups of 32 threads, we have to manually broadcast lane 31

during the last step to get the final result.

At the end of this process we obtained the prefix sum of all the 64 threads in our wavefront and, unlike traditionalapproaches, we did it in a single pass and without accessing the LDS banks.

Remember that the QuadSwizzle and LaneSwizzle intrinsics are incredibly useful.

Unfortunately LaneSwizzle does not support all the possible swizzling patterns in a group of 32 lanes.

To workaround this problem you can use the ReadLane intrinsic, even if it’s purely scalar, which means that it can move

only one lane at a time.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



There’s a trivial but still interesting trick relating the EXEC mask I’d like to discuss: both the tricks we saw before can only

be used to disable active threads. The reason behind this limitation is that re-enabling inactive threads would allow

developers to break the structured nature of the control flow in the PSSL language. So, let’s say that at some point of our

shader we need to re-enable some inactive threads, how can we do it?

Well, let’s say we are inside the first IF in shown in the slide. We can extract the variables we want to manipulate with the

inactive threads (e.g.: huffmanBitCount). It’s extremely important to initialize the extracted variables before the IF to

clean-up the inactive lanes. Forget to do it and you’ll have to deal with garbage data.

Naïvely enough, if we want to re-enable the inactive lanes within the IF block we can just split the if and add all the code

which requires the inactive threads outside the if. In general we can use this trick for any nesting level of our control flow,

as exiting all the conditionals restores the initial EXEC mask. Then we can access our variable from any thread, in this case

to execute the parallel prefix sum that, as we discussed before, requires the cooperation of all the 64 threads of a

wavefront.

Finally we can re-evaluate the IF predicate to restore the conditions in the first IF.

Just one caveat: the initial EXEC mask might not include all threads in some cases, for instance if you’re dispatching 32

threads per wavefront like in the code below.

[NUM_THREADS(32, 1, 1)]

void main(uint threadID : S_DISPATCH_THREAD_ID)

{

// The initial exec mask is 0x00000000FFFFFFFF, 32 threads enabled

}



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Let’s now have a look at how we can use the ordered count hardware to implement a fully parallel variable-length bit

stream writer.

As I said before, the ordered count hardware serializes the atomic updates to a GDS counter. To do so, the hardware has

a queue of all the wavefronts that are trying to modify the GDS counter and an internal register which stores the index of

the last wavefront that updated the counter.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



dcb.dmaData

Let’s start by clearing the GDS counter with a DMA command.

dcb.dispatchWithOrderedAppend

Then we use a special variant of the dispatch command, which will result in the creation of our wavefronts and in theinitialization of the ordered count hardware.

Let’s assume no constraints about our waves, they might be living in totally different compute units.

Right now each of our waves is calculating the number of bits required to store the whole 8x8 block being encoded.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



The first wavefront completing the calculation of the block size is wave 2.

The wavefront tries to execute the ordered count, but because the ordered count hardware is expecting to

execute wave 0 first, wave 2 is stalled and added to the queue of waiting waves.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Wavefront 0 now completes its calculation and executes the ordered count.

The ordered count hardware updates its internal register with the ID of the last executing wavefront and performs an

atomic add on the GDS counter. Of course, we’re adding the total bit count of the first 8x8 block here.

Note that the old value of the GDS counter is returned to the wavefront and it logically represents the position of the block

within the bit stream.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Now both wavefront 1 and wavefront 3 complete their calculation, and they both try to execute the ordered count.

Of course, wavefront 3 is stalled because the hardware is expecting to update the GDS counter for wavefront 1.

Wavefront 1 instead can execute the ordered count and update the GDS counter.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Wavefront 2 is removed from the stall queue and ordered count hardware updates the GDS counter again.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



And finally, the wavefront 3 can update the GDS counter.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



At this point all the wavefronts know how many bits they need to output and their bit position within the Huffman bit

stream.

Each thread can now write in parallel to the Huffman stream memory using atomic OR-s.

Before moving on, let me tell you the ordered count is surely a KEY feature for general compute programming.

If you want to have detailed information, please refer to the ISA documentation of the DS_ORDERED_COUNT instruction.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



So, to recap, once each thread in each wavefront knows how many bits it needs to output, we can calculate the total size

of a 8x8 block using a parallel prefix sum and the bit stream offset with an ordered count.

At this point each thread creates a 32 bit OR-able mask and writes to the bit stream buffer using an atomic OR.

Actually, each thread might perform 2 atomic operations if writes straddle 32-bit boundaries.

Because we’re writing linearly increasing addresses, all the 8 atomic units are used evenly, which is very good.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Just a few more words on the use of atomic OR-s

My original implementation was updating the bit stream buffer three times, which meant a minimum of 3 to a maximum

of 6 atomic OR-s per thread.

- Once for the Huffman prefix

- Once for the normalized DCT value

- Once for the end of block, even if on a single thread per wavefront.

The end of block or EOB is a well know sequence of bits indicating the end of a 8x8 block, and the JPEG standard says that

we need to emit it only if the last AC coefficient (so AC 63) is zero.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



However, my initial approach was quite lame and thanks to the help of a couple of experienced colleagues like Colin I

could optimize it a lot.

First of all, the JPEG standard says that both the Huffman prefix and the normalized DCT value use up to 16 bits each.

This means that we can pack prefix and code in a single 32 bit value.

For the End of Block bits we can just exploit lane 63: we know that if we need to emit the EOB lane 63 is surely disabled.

We can set bit 63 in the DCT mask, which will be used later as the new EXEC mask using the predicate trick we’ve seen

before.

Then we use WriteLane to write the well-known EOB bits into the VGPR storing the Huffman code, on lane 63.

In case you got lost here, the Huffman code is the value that each active thread will write on the bit stream, and at this

point thread 63 will write the EOB value for us.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



With the optimization described in the previous slide I achieved some great performance improvement.

The maximum number of atomic operations per wavefront went from 6 to 2 and I saved the 25% of the GPU time for the

Huffman encoding step with this trick.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



At the end of my optimization process I also tried to rework (again) the steps of the JPEG encoder to merge the first two

steps together.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Merging the color conversion and the DCT steps together was a net win, mainly because these two steps were using the

GPU in a different way: the color conversion was limited by the amount of texture operations, while the DCT step was

mainly VALU bound.

But also, merging the two steps allowed me to remove a staging buffer I used to store the Y Cb Cr texture in memory, de

facto:

- Reducing the bandwidth usage,

- Reducing the pressure on the texture units and

- Saving some memory

Also, I could remove another full GPU fence, which means I could shave some more microseconds.

The net result of this operation is that the merged step is 40% faster than the original color conversion plus DCT steps.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Here you have a nice table summarizing the different timings calculated for all the different versions of my encoder.

My input was a test image from one of the samples of the RtSecondScreen library. The frame was encoded at PSVita

resolution with a compression level of 50.

The green cells in the different rows represent the areas of improvement with respect to the previous row.

It’s extremely interesting to note that the optimized version of the encoder is more than 3 times faster than my original

version, and I’m pretty sure that the techniques we’ve seen together can produce some significant gain in the

performance of your shaders too.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



So, summarizing



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



The take away of this presentation is that it is possible to get high performance using the GPU for not so parallel

problems.

Of course, you have to rethink your algorithms and embrace a SIMD 64 programming model to see these results.

More on the practical side, remember that we have two different ALU units to play with, balancing their usage is the key

to unleash the real power of our GPU.

Remember also that synchronization points should be reduced as much as possible and that super short waves can kill

your occupancy.

Also, if you use asynchronous compute, please remember to dispatch from at least 2 compute pipes.

And please, we have a brand new super cool shader compiler in SDK 2.5. Try all the new intrinsics, read the manual and

I’m pretty sure you guys will be able to do some crazy stuff with it.



= = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

: : : : :5:5:6:6:7:2:4::c52 bbfa77225e36ccf64a5fb828c5c5



Document serial number: 000005566724 = = = = =5=5=6=6=7=2=4=:c52 bbfa77225e36ccf64a5fb828c5c5

07 Implementing a Parallel JPEG Encoder on Compute

Documents

Transcript of 07 Implementing a Parallel JPEG Encoder on Compute