Fine-Grain Parallelization of Entropy Coding on...

Fine-Grain Parallelization of EntropyCoding on GPGPUs

Ana BalevicAbstract

Massively parallel GPUs are increasingly used for acceleration of data-parallel functions, such as image transforms and motion estimation. The laststage, i.e. entropy coding, is typically executed on the CPU due to inherent data dependencies in lossless compression algorithms. We propose asimple way of efficiently dealing with these dependencies, and present two novel parallel compression algorithms specifically designed for efficientexecution on many-core architectures. By fine-grain parallelization of lossless compression algorithms we can significantly speed-up the entropycoding and enable completion of all encoding phases on the GPGPU.

Parallelization of Entropy Coding on GPGPUs

GPGPU Tesla Architecture

I Parallel array of streaming processors (SPs)I Large number of lightweight threadsI Shared memory space for each block of threads (e.g. 256 threads)I Fine-grain scheduling as a mechanism for memory latency hiding

Algorithm Parallelization

Parallelization LevelsI. Data partitioningII. Parallelization on the block level

a) Coarse-grain (one thread processes one data block)b) Fine-grain (one thread block processes one data block)

Fine-Grain Parallel Algorithm DesignI. Assignment of codes to the input dataII. Computation of destination memory locations for the codesIII. Concurrent output of the codes to the compressed data array

Parallel Variable Length Encoding Algorithm (PAVLE)

PAVLE Algorithm Steps

I. Assignment of codewords to source dataI Static codeword table (e.g. Huffman codewords as in JPEG standard)I Codewords can be also computed on the fly (e.g. Golomb codes)

II. Computation of output positionsI Bit-accurate computation of the output position.

The output position is determined by the memory address and the starting bit position.I Efficient parallel computation using prefix sum primitive

III. Parallel bit-level outputI Parallelized writing of variable-length codewordsI Safe handling of race conditions via atomic operations

Parallel Run Length Encoding Algorithm (PARLE)

I Repetitions of a symbol are replaced by a (symbol, run length) pairI Start and ends of symbol runs can be identified by parallel flag generation

I Output positions for RLE pairs also efficiently computed using prefix sumI Run lengths accumulated using reduction or computed by differencing

Performance Results

Algorithm complexity: Algorithm work complexity: O(n), Parallel run time O(log n). Dom-inant component: Prefix sum primitive.

Test EnvironmentL Intel 2.66GHz QuadCore CPU, 2GB RAM, nVidia GeForce GTX280GPU w. SM atomic instructions on 32-bit words. Serial encoder and scan kernel given as areference.

Performance Results Huffman encoding using PAVLE achieves a significant speed-up ona GPU in comparison to a serial (CPU) based encoder. An optimized version of PARLEusing shared memory atomics achieves 3̃5x speed-up, PARLE achieves 6x speed-up.

0.25 0.5 1 2 4 8 16 32 48

Data size [MB]

vle_kernel_gm1vle_kernel_sm1scancpu encoder

1 2 4 8 16 32

Data size [MB]

cpu rlegpu parlegpu scan

Conclusion

The fine-grain parallel compression algorithms for fixed-length and variable-length encoding achieved up to 6x (PARLE) and 35x (PAVLE) speed-up onthe nVidia GeForce GTX280 GPGPU supporting CUDA1.3. These two algorithms can be easily combined with readily available data-paralleltransforms and motion-estimation algorithms. In this way, we provide attractive parallel algorithm building blocks for compression, which enablecompletion of all coding stages of GPU-accelerated codecs directly on a GPGPU. With this approach, not only do we significantly speed-up theentropy coding process (while relieving the CPU), but also reduce the data transfer time from the GPU to system memory. PAVLE and PARLE will bereleased as open source in the upcoming CUDA Compression library.

Ana Balevic, 2009. Contact: ana.balevic@gmail.com Source Code: http://svn2.xp-dev.com/svn/ana-b-pavle/

Fine-Grain Parallelization of Entropy Coding on...

Documents

Transcript of Fine-Grain Parallelization of Entropy Coding on...

The need for parallelization Challenges towards effective parallelization A multilevel parallelization framework for BEM: A compute intensive application.

PARALLELIZATION AND PERFORMANCE OPTIMIZATION OF ...domingo/teaching/ciic8996/PARALLELIZATION … · PARALLELIZATION AND PERFORMANCE OPTIMIZATION OF BIOINFORMATICS AND BIOMEDICAL APPLICATIONS

Parallelization using open mp

OPTIMIZATION AND OPENMP PARALLELIZATION OF …€¦ · OPTIMIZATION AND OPENMP PARALLELIZATION OF A DISCRETE ELEMENT ... with the optimization and parallelization of a discrete element

Parallelization of Explicit and Implicit Solver · — Parallelization of Explicit and Implicit Solver — CFD08-9 Parallelization and Iterative Solver Rolf Rabenseifner Slide 17

Parallelization of a color-entropy preprocessed Chan-Vese ...jxin/PARCO_2015.pdfThe functionCVG_main() corresponds to the PDHG algorithm for minimizingthe CVG model as described in

presentation on CM1 parallelization

Loop Parallelization

Trend Towards Parallelization

Parallelization in Molecular Dynamics

zbornik/doc/ZR20.04.pdfOmax — the intrinsic entropy of the finest soil fraction. Omin s +2/3(s 0 max Omin grain distribution within a range of grain width (Lórincz 1986, 1993),

Unmanaged Parallelization via P/Invoke

Efficient Parallelization of a Dynamic Unstructured ... · Efficient Parallelization of a Dynamic Unstructured Application on the ... the parallelization of a dynamic unstructured

AN ENTROPY BASED THEORY OF THE GRAIN BOUNDARY …math.gmu.edu/~memelian/pubs/pdfs/DCDSB_2010.pdf · The grain boundary character distribution (GBCD) is an empirical distribution of

Parallelization: Conway’s Game of Life

Parallel Monte-Carlo Tree Search - Maastricht University · Parallel Monte-Carlo Tree Search 63 Fig.2. (a) Leaf parallelization (b) Root parallelization (c) Tree parallelization with

Parallelization - cons.mit.edu

Dynamically reinforced heterogeneous grain structure prolongs … · Dynamically reinforced heterogeneous grain structure prolongs ductility in a medium-entropy alloy with gigapascal

Parallelization of Dijkstra's algorithm

Scalable-Grain Pipeline Parallelization Method for Multi ...