TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION …€¦ · TECHNIQUES TO LEVERAGE...

TECHNIQUES TO LEVERAGE

DATA-PARALLEL GPU ACCELERATION FOR

COMPUTER VISION ALGORITHMS

by

Allen Paul Nichols

B.S., University of Colorado Denver, 2007

A report submitted to the

University of Colorado Denver

in partial fulfillment

of the requirements for the degree of

Master of Science

Electrical Engineering

2011

This report for the Master of Science

degree by

Allen Paul Nichols

has been approved

by

Daniel A. Connors

Robert Grabbe

Yiming Deng

Date

Nichols, Allen Paul (M.S., Electrical Engineering)

TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION FORCOMPUTER VISION ALGORITHMS

Report directed by Professor Daniel A. Connors

ABSTRACT

Graphics Processing Units (GPUs) have proven to be a powerful and efficient compu-tational platform. An increasing number of applications are demanding more efficientcomputing power at a lower cost. The modern GPU can natively perform thousandsof parallel computations per clock cycle. Relative to the traditional power of a CPU,the GPU can far out-perform the CPU in terms of computational power or FloatingPoint Operations per Second (FLOPS). Traditionally GPUs have been used exclusivelyfor graphics processing. Recent developments have allowed GPUs to be used for morethan just graphics processing and rendering. With a growing set of applications thesenew GPUs are known as GPGPUs (General Purpose GPUs). NVIDIA R© has devel-oped the CUDA (Compute Unified Device Architecture) API (Application Program-ming Interface) which enables software developers to access the GPU through standardprogramming languages such as ’C’. CUDA gives developers access to the GPU’s vir-tual instruction set, onboard memory and the parallel computational elements. Takingadvantage of this parallel computational power will result in significant speedup formultiple applications. One such application is computer vision algorithms. From theassembly line to home entertainment systems, the need for efficient real-time computervision systems is growing quickly. This paper explores the potential power of usingthe CUDA API and NVIDIA R© GPUs to speedup common computer vision algorithms.Through real-life algorithm optimization and translation, several approaches to GPUoptimization for existing code are proposed in this report.

This abstract accurately represents the content of the candidate’s report. Irecommend its publication.

Daniel A. Connors

TABLE OF CONTENTS

Tables vi

Figures vii

Chapter

1. Introduction 1

2. Background and Motivation 5

2.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 San Diego Vision Benchmark Suite . . . . . . . . . . . . . . . . . 6

2.1.2 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3 Disparity Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Feature Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Data-level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Data-Level Parallelism Opportunities with SD-VBS . . . . . . . 12

2.2.2 The Diminishing Return of the Traditional CPU . . . . . . . . . 13

2.3 Graphics Processing Unit Architecture . . . . . . . . . . . . . . . . . . . 15

2.3.1 CUDA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 GPGPU Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

iv

2.4.1 Arithmetic Intensity . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.2 Demonstration of GPU Execution Efficiency: Sorting . . . . . . . 29

2.4.3 Demonstration of GPU Execution Efficiency: Array Reduction . 30

2.4.4 Non-traditional Approach for Computer Vision Applications . . 33

3. Non-traditional Exploitation of GPUs 34

3.1 Data-size Based GPU versus CPU Execution . . . . . . . . . . . . . . . 34

3.2 Computation Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Computation Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4. Experimental Results 44

4.1 Data-size Based GPU versus CPU Execution . . . . . . . . . . . . . . . 44

4.2 Computation Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Computation Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5. Conclusion 54

Bibliography 56

v

TABLES

Table

3.1 Pre-calculating look-up tables for the SVM application . . . . . . . . . . 41

3.2 Look-up table access statistics for the SVM application . . . . . . . . . 41

vi

FIGURES

Figure

2.1 A Pre-Processed image (left) and the same image processed by the Image

Segmentation algorithm (right). . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Stereo Image Inputs (left and right). Output image after processing by

the Disparity algorithm (bottom). . . . . . . . . . . . . . . . . . . . . . 9

2.3 Series of sequential image inputs (left) and the resulting Representative

Motion Vectors (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 A representative data set and the result of the Support Vector Machine

algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Thread Processing Cluster for the GTX200 Series GPU. . . . . . . . . . 16

2.6 Architecture diagram of the GTX200 series Graphics Processing Unit

(GPU). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Example of CUDA processing flow. . . . . . . . . . . . . . . . . . . . . . 19

2.8 The CPU code implementation of matrix multiplication. . . . . . . . . . 21

2.9 Example code for invoking the memory transfers and kernel execution on

the GPU for matrix multiplication. . . . . . . . . . . . . . . . . . . . . . 23

2.10 Example code for memory transfers to and from the GPU for matrix

multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.11 Example CUDA kernel code written for a GPU to do a matrix multipli-

cation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

vii

2.12 Arithmetic Intensity for the SD-VBS and the SPEC 2006 benchmarks. . 27

2.13 Identifying hotspots : Percentage of dynamic execution of SD-VBS pro-

grams attributed to program code. . . . . . . . . . . . . . . . . . . . . . 28

2.14 Operation counts for SD-VBS program hotspots. . . . . . . . . . . . . . 29

2.15 Sorting performance (CPU vs. GPU) for various sizes of a 1-dimensional

array of floating point numbers. . . . . . . . . . . . . . . . . . . . . . . . 30

2.16 Sorting performance detail for GPU Execution. . . . . . . . . . . . . . . 31

2.17 Reduction performance (CPU vs. GPU) for various sizes of a 1-dimensional

array of floating point numbers. . . . . . . . . . . . . . . . . . . . . . . . 32

2.18 Reduction performance detail for GPU Execution. . . . . . . . . . . . . 32

3.1 Code for Sobel dX computation on CPU. . . . . . . . . . . . . . . . . . 36

3.2 Diagram of Sobel dX computation on CPU. . . . . . . . . . . . . . . . . 36

3.3 Code for Sobel dX computation on GPU. . . . . . . . . . . . . . . . . . 37

3.4 Diagram of Sobel dX computation on GPU. . . . . . . . . . . . . . . . . 38

3.5 Code for SVM computation conditionally requiring polynomial calculation. 39

3.6 Diagram of SVM potential polynoimal calculation based on i and k. . . 40

3.7 Distributions of loop iterations that require polynomial calculations in

SVM for cif input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.8 Timeline of unique polynomial calculations in SVM for input cif. . . . . 43

4.1 Image Segmentation performance change using GPU implementation. 45

4.2 fSortIndices performance change using GPU implementation. . . . . . . 46

4.3 Tracking kernels (BlurImage, Sobel dX, Sobel dY) performance change

using GPU implementation. . . . . . . . . . . . . . . . . . . . . . . . . . 47

viii

4.4 Tracking kernels (BlurImage, Sobel dX, Sobel dY) GPU performance

change. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Tracking kernel BlurImage comparison of CPU and GPU: minimum,

maximum, average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.6 Tracking kernel Sobel dX comparison of CPU and GPU: minimum,

maximum, average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.7 Tracking kernel Sobel dY comparison of CPU and GPU: minimum,

maximum, average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.8 Tracking kernels (BlurImage, Sobel dX, Sobel dY) standard deviation

as a percentage of average. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.9 GPU performance for SVM application. . . . . . . . . . . . . . . . . . . 51

4.10 GPU performance for each table access versus generation. . . . . . . . . 52

4.11 GPU performance for SVM application. . . . . . . . . . . . . . . . . . . 53

ix

1. Introduction

Computer vision is the science that enables computer systems to extract infor-

mation from an image or a sequence of images. The development of computer vision is

essential for the advancement of a multitude of areas including medical, entertainment

and security. Computer vision systems are useful for tasks such as industrial control,

event detection, informational organization and object modeling. Other domains of

computer vision systems also include motion analysis, image restoration and scene con-

struction. With a wide variety of emerging applications, the demand for more advanced

computer vision systems is quickly growing.

There are several limiting factors when developing accurate real-time computer

vision systems. Most computer vision tasks require a great deal of mathematical com-

putation. For many computer vision algorithms, the analysis of a single image can take

anywhere from a few seconds to several hours to process. In short, computer vision al-

gorithms require a large number of computations as well as an equally large number of

memory values. Computer vision algorithms are applied to broad range of environments

from home entertainment systems to the operation of unmanned aerial and ground ve-

hicles. Each new generation of application increases the need for more computational

resources. Traditionally software developers and even scientists have strictly relied on

the increase of processor clock frequency as the primary method of gaining performance

for the next generation of applied algorithms.

Emerging technology constraints are slowing the rate of performance growth in

computer systems. Specifically, designers are finding difficulties addressing strict pro-

1

cessor power consumption and cooling constraints. Design modifications to address

power consumption generally limit processor performance and reduce peak operating

frequency, thus changing the trend of providing increased system performance with every

new processor generation. As such, modern architectures have diverged from the clock

speed race into the multicore era with multiple processing cores on a single chip. While

the multicore design strategy improves chip power density, there remains a significant

potential in improving run-time power consumption.

Graphics Processing Units (GPUs) are commonly found on graphics cards and

main computer boards. These specialized processors have great potential for solving

a number of problems. Unlike traditional Central Processing Units (CPUs), the GPU

contains as many as several hundred mathematical computation cores. These cores

have evolved recently from being able to perform simple graphics computations to fully

capable processing engines. Due to the nature of many processing cores, GPUs can

perform mathematical computations in a massively parallel manner.

Many applications and algorithms have the potential to take advantage of the

parallel processing capabilities of GPU systems. In most cases, problems possessing

data-level parallelism are best suited for GPU execution. Data parallelism focuses on

distributing the large amounts of data across different parallel computing cores. A

problem appears data-parallel if each core can perform the same identical task on

different pieces of distributed data. There are distinct ranges of data-parallel problems.

Small scale image processing that includes the parallel manipulation or analysis of pix-

els can be achieved with multiprocessor extensions such as Single Instruction, Multiple

Data (SIMD). Larger problems of data-parallel computing can be solved with large-

scale distributed systems consisting of multiple independent computers that communi-

2

cate through a computer network or network grid. Non-large-scale data-level parallel

problems are ideal applications that can be optimized. In this categorization, modern

computer vision applications fall into a unique domain since they are more complex than

simple image processing tasks yet would not be described as large enough to require

massive computing resources of a distributed system.

This thesis investigates the potential of adapting computer vision algorithms to

execute on GPUs. As GPUs operate in a heterogeneous system in which both the

CPU and GPU perform some fraction of the computational work, there are unique

performance constraints to explore. This thesis considers two primary parameters in

developing optimal GPU solutions: problem size and problem reformulation.

Problem Size: For every application, the problem size (number of data items

to be processed) is a direct factor to consider when computing results on a GPU. Most

GPU systems are deployed as hardware accelerators for CPU systems, requiring memory

transfers to be performed between each computational component. As the GPU system

includes its own memory system, a necessary step in the heterogeneous system is to

transfer data between the CPU and GPU memory systems. Applications with a smaller

input size may not necessarily benefit from using the GPU even for the computationally

intensive sections. An optimal use of GPUs would include either compile-time or run-

time capabilities to detect computation sizes and automatically choose the best resource

(CPU only or CPU/GPU combination) for the task. Compile-time techniques would

have the code written in such a way that the executable decides at fixed computation

thresholds to use a particular resource. While a run-time approach would allow the

application to determine the optimal method at run-time based on computation size,

available resources and any other computation parameters.

3

Reformulation: For every application, there is a unique degree in which the

computation tasks can be transformed for the GPU resources. Often the reformulation

requires significant code development and sufficient experience with GPU programming

techniques. Several forms of reformulation for GPU execution have been studied for

sorting [1] and parallel reduction methods [1]. This thesis examines two techniques for

mapping non-traditional computations to GPUs: work reformulation and computation

speculation. These techniques are not traditionally exploited on traditional CPU sys-

tems, as they appear to require more computational time. However, based on the fact

that the GPU can do a large amount of computations very quickly, there appear to be

successful ways to exploit GPUs for non-data-level parallel algorithms.

This thesis is organized as follows: Chapter 2 discusses the motivation and back-

ground of computer vision applications. Chapter 3 examines several examples of prob-

lem solving on GPUs using reformulation techniques. The experimental results section,

chapter 4, shows performance data for the various optimization cases. Finally, chapter 5

concludes this thesis.

4

2. Background and Motivation

2.1 Computer Vision

Computer vision systems are integrated within a wide variety of industrial and

scientific applications. Such systems extract key information from an image or series of

images necessary to complete a single specific task or a series of tasks. There are increas-

ing uses of computer vision in emerging software applications and product development.

In the scientific realm, computer vision methods exist in medical imaging, intelligent

robotics and topographical modeling. There are also many industrial applications such

as autonomous vehicles, visual surveillance, industrial inspection and quality control.

Even modern day digital cameras now apply simple algorithms for face detection to

ensure the best outcome in photographs.

Many computer vision applications already have well defined algorithms and asso-

ciated methodologies. Most of these algorithms have a trend of being computationially

intensive as well as memory intensive. The computational intensive nature of computer-

based vision algorithms has traditionally detered the development of new applications

because results could not be generated in real-time. In many cases, only off-line appli-

cation of computer vision systems is expected. Nevertheless, as computer performance

generally increases in each generation, there is a great potential for real-time computer

vision applications to become more common. However, as time progresses, the data in-

put available for computer vision algorithms increases as the image size and resolution

5

increase. This increase in resolution allows for greater scientific discovery and higher

industrial precision. However, the trade-off of more precise image quality and availabil-

ity results in even further increasing the demand for speed and volume of computations

that must be performed. In many ways the computer system designs are consistently

behind the performance needs of the latest applications.

The application of muti-core and many-core processing architecture systems is

an area of current research and development. A system with multiple processor cores

would be ideal for a wide variety of computer vision applications. Each algorithm must

be carefully studied to determine if all or part of the computation can be performed in

parallel. Some such computer based vision algorithms lend themselves to computational

parallelism while others do not.

2.1.1 San Diego Vision Benchmark Suite

To gain a better understanding of the possible advantages to be gained with

many-core systems, a set of common computer vision algorithms will be carefully in-

spected and tested. The specific sub-set of algorithms to be examined has been de-

veloped by the Department of Computer Science and Engineering at the University of

California, San Diego. The benchmark suite they have developed is known as ”The San

Diego Vision Benchmark Suite” (SD-VBS) [2]. The suite contains applications from

the following representative areas: Image Processing and Formation; Image Analysis;

Image Understanding; and Motion, Tracking and Stereo Vision. This suite contains

nine representative computer vision applications and each application contains a set of

image inputs that vary in size.

6

2.1.2 Image Segmentation

The Image Segmentation algorithm processes an image by dividing it into con-

ceptual regions or segments. These regions include boundaries, borders and objects that

appear in the image. The algorithm functions on the premise that a set of pixels share

a common set of characteristics [2]. This algorithm is commonly applied to fingerprint

and face recognition. Other applications include medical imaging, machine vision and

computational photography.

To ease understanding of this complex algorithm, it has been broken down into

three separate parts or sections. The first section of the algorithm deals with the con-

struction of a similarity matrix. This matrix is computed by analyzing pixel-pairs across

the entire data input stream (image). The second section involves the computation of

discrete regions based on the results from the first matrix computation. The third and

final part of this algorithm normalizes the segmentation results from the regions previ-

ously computed. This application is very computationally intensive based on the fine

granularity of multiple complex operations [2]. Figure 2.1 shows an example of an image

after it has been processed by the Image Segmentation algorithm.

2.1.3 Disparity Map

This algorithm takes a pair of stereo images as an input. These stereo images are

taken from a slightly different position looking in the same direction. This is similar to

a person taking a picture with the left eye looking through a traditional camera view-

finder and then switching the camera to the right eye and taking an additional picture.

There are a number of applications in which this algorithm could be utilized. One

possible computer vision application for this algorithm would be to use two cameras

7

Figure 2.1: A Pre-Processed image (left) and the same image processed by the ImageSegmentation algorithm (right).

that could be used as depth sensors on a conveyor belt for an assembly line. After

running this algorithm against the input images, the computer control system would

have depth information about where each product is on the conveyor belt.

The Disparity algorithm computes depth information based on a set of two

stereo images that are provided. Take for example the robot with cameras placed as

eyes for virtual vision. This algorithm could be applied to these camera inputs and the

computer could calculate depth information based on the stereo input. Other industrial

applications include systems such as intelligent cruise control, pedestrian tracking and

collision avoidance systems.

The San Diego Vision Benchmark Suite has implemented this algorithm using the

concept of Stereopsis [2] which allows depth analysis to be performed based on a set of

stereo images. Figure 2.2 shows an example of a set of images that has been processed

by the Disparity algorithm and the generated disparity map.

This algorithm computes a dense disparity map between two images while preserv-

ing any discontinuities resulting from image boundaries. The concept of dense disparity

8

mapping operates on the premise that every pixel of a given image is important, not

just sections or features of an image. Because every pixel is analyzed, this algorithm

is very computationally intensive. The general sections of the Disparity algorithm are

filtering, correlation, calculation of the sum of squared differences and sorting.

Figure 2.2: Stereo Image Inputs (left and right). Output image after processing by theDisparity algorithm (bottom).

2.1.4 Feature Tracking

The operation of feature tracking is fundamental in computer vision systems.

Feature tracking is the process of locating and characterizing moving objects given a set

of subsequent frames. When tracking is enabled in vision systems, multiple objects in a

given field of view can be monitored and analyzed with other computer vision algorithms.

Applications for this algorithm include robotic vision systems and automotive traffic

monitoring systems. Figure 2.3 shows an example of a series of sequential image inputs

9

and the resulting motion vectors that this algorithm generates.

The SD-VBS implementation involves the Kanade Lucas Tomasi (KLT) tracking

algorithm [2]. The overall algorithm can be broken down into three major sections.

The first section operates on pixel level granularity while the second and third sec-

tions operate on coarse grained data or feature points. The first section is an image

processing phase. This phase accomplishes things such as noise filtering, gradient im-

age and image pyramid computations. This low-level image processing is comprised of

mostly Multiply-and-Accumulate computations. The second and third sections contain

the core functionality of the algorithm. These routines invlove feature extraction and

feature tracking, respectively. The core functionality is based on a large number of

complex matrix operations and vector estimation.

Figure 2.3: Series of sequential image inputs (left) and the resulting RepresentativeMotion Vectors (right).

2.1.5 Support Vector Machines

The support vector machine (SVM) algorithm is used for data classification and

regression analysis. For each application the algorithm separates the input data into

10

two categories. These categories are the calculated maximal geometric margins of the

data sets. This classic machine learning algorithm is closely related to neural networks

and is a form of a generalized linear classifier. Figure 2.4 shows a representative data

set with lines depicting the maximal margins.

The SVM algorithm is organized into two distinctive stages. The first stage is

the training phase, the SVM classifier is trained based on the training data. When the

training phase is complete, the classifier has a polynomial function that describes the

learning data. The second stage involves applying the input data set to the classifier.

As the algorithm continues these same two stages are iterated multiple times to achieve

higher accuracy of the polynomial function.

Figure 2.4: A representative data set and the result of the Support Vector Machinealgorithm. H3 (green) doesn’t separate the 2 classes. H1 (blue) does, with a smallmargin and H2 (red) with the maximum margin.

2.2 Data-level Parallelism

Data parallelism is a method of parallel computing across multiple processors.

This type of parallelism focuses on distributing the data across different parallel comput-

11

ing nodes. When multiprocessor systems are given a data parallel task, each processor

performs the same task on a different piece of data. Flynn’s taxonomy classifies this

as Single Instruction, Multiple Data or SIMD [3]. In some instances, a single execution

thread controls all of the ongoing operations. Different situations dictate that different

threads control the ongoing operations, but ultimately all threads execute the same

code.

We call these algorithms data parallel because their parallelism comes from si-

multaneous operations using large sets of data [4]. To illustrate data-level parallelism,

consider a system with ten processor cores. The task at hand is to add together two

matrices with ten elements in each matrix. Traditional code can be written so that each

processor does one addition and store. At run-time, each processor executes exactly the

same instructions, but on different pieces of data. For example, matrix addition can

assign each matrix position to a unique processor to complete many times faster than

a system with only one processor core. In this particular example each matrix position

is independent from the surrounding positions making data-level parallelism possible.

2.2.1 Data-Level Parallelism Opportunities with SD-VBS

Large portions of the SD-VBS benchmark suite exhibit forms of data-level paral-

lelism. Upon careful inspection of the internal workings of the algorithms, it can be seen

that a large portion of the routines are performing the same task on different pieces of

data repeatedly. There are also several portions of the algorithms which are dependent

on the previous iteration and cannot be parallelized at the data level.

In the Disparity algorithm, one section of the code calculates an output matrix

based on two input matrices. The algorithm takes a given point from each matrix and

12

performs a simple calculation. The result is then stored in an output matrix. Examples

like this lend themselves very nicely to data-level parallelism. Other portions of this

algorithm allow for the same optimization.

The Image Segmentation algorithm operates on pixel granularity which results

in a large number of repetitive operations. The execution time of this algorithm is almost

completely consumed with performing a sort on a single dimensional array of numbers.

Performing a sort in a parallel fashion can be performed significantly faster than a serial

sorting method.

Taking a closer look at the feature tracking algorithm shows several opportunities

for data-level parallelization. The image processing phase operates on the entire image

and thus makes it parallelization friendly. The feature extraction and feature track-

ing portions operate on feature level precision and are comprised of complex matrix

operations. These complex operations involve matrix inversions and motion vector esti-

mations. These operations are computationally intensive. While parallelism is possible

it is more challenging for this algorithm [2].

The support vector machine by nature is irregular and random. The sections

of this suite are not necessarily considered as data level parallelism candidates. The

iterative nature of this algorithm is comprised of mostly complex computations. This

algorithm can be optimized with thread level parallelism and instruction level paral-

lelism [2].

2.2.2 The Diminishing Return of the Traditional CPU

Microprocessors that feature a single processing core have evolved over several

decades. These single core processors are the driving force behind the development of

13

the modern personal computing platform. For example, Intel R© Pentium R© processor,

released in 1993, could perform over one billion floating-point operations per second

(FLOPS) [5]. Comparatively, the Intel R© CoreTMi7 processor, released in 2008, is rated

for 50 billion FLOPS [6].

The continued advancement of software applications has steadily grown the de-

mand for faster and more powerful processing ability. As this demand has increased,

software developers have mostly relied on the hardware advancement to increase speed

and performance of the new applications they develop [7]. This demand began to slow

when energy consumption and heat dissipation became a limiting factor with increasing

the clock speed of the processor. In order to increase performance while keeping energy

consumption down, processor designers increased the number of processing cores on

a single chip die [8]. Modern day personal computing platforms have as many as six

processing cores on a single chip die.

Even with six cores on a single chip, today’s CPU is not optimized for large

scale data-level parallelism. The x86 instruction set does include the Streaming SIMD

Extensions [9] which are inherently parallel, but are very limited. Looking closer at the

architecture of CPUs reveals that only about twenty percent of the chip area on a CPU

is dedicated for arithmetic [10]. The main design goal of a CPU is the complex control

of the entire computer. Traditionally, CPUs are designed for low latency operations and

keeping their pipelines busy (high cache hit rates and efficient branch prediction) not

for high bandwidth.

For many years past, consumer-grade software has been developed to run in a se-

rial manner and typically utilizes only a single processing core. This serial methodology

will not experience a significant performance increase with the addition of computing

14

cores. The software development community must optimize current and future appli-

cations for multicore systems to achieve the best run-time performance. This relatively

new incentive for parallel program development is referred to as the concurrency revolu-

tion [11]. High-performance parallel computing is not a new concept to the computing

industry. The scientific industry has been using large-scale clusters of single-core pro-

cessors for several decades to solve large complicated problems [7].

2.3 Graphics Processing Unit Architecture

The underlying architecture of the GPU is optimized for data-level parallelism.

Taking a closer look at the NVIDIA R© GPU reveals the chip is organized into an array

of highly threaded streaming multiprocessors (SMs). Each streaming multiprocessor

consists of several streaming processors (SPs). Streaming multiprocessors are grouped

into thread processing clusters (TPCs). The number of SPs per SM depends on the

generation of the chip. The GTX200 series consists of eight SP per SM and three SMs

per TPC [10]. Each SM cluster has a portion of shared control logic and instruction

cache as well as texture filtering processors. Figure 2.5 shows a TPC block diagram

for the GTX200 series GPU. The GTX200 chip has 240 streaming processors (SP).

Every SP has a multiply-add unit plus an additional multiply unit. The combined

power of that many SPs in this GPU exceeds one teraflop [7]. Each SM contains a

special function unit which can compute floating-point functions such as square root and

transcendental functions. Each stream processor is threaded and can run thousands of

threads per application. Graphics cards are commonly built to run 5,000-12,000 threads

simultaneously on this GPU. The GTX200 can support 1,024 threads per SM and 30,720

threads for the entire chip [10]. In contrast, the Intel R© CoreTMi7 series can support two

15

Figure 2.5: Thread Processing Cluster for the GTX200 Series GPU.

threads per core.

GPUs are optimized via the execution throughput of a massive number of threads.

The hardware takes advantages of this by switching to different threads while other

threads wait for long-latency memory accesses. This methodology enables very minimal

control logic for each execution thread [7]. Each thread is very lightweight and requires

very little creation overhead.

From a memory perspective, the GPU is architected quite differently than a CPU.

Each GPU currently comes with up to four gigabytes of Graphics Double Data Rate

(GDDR) DRAM which is used as global memory. For the GTX200 series, every SM

has twenty-four kilobytes of dedicated memory (shared between SPs). This dedicated

memory boasts a one-cycle access time. The architecture of the GPU is designed to

exploit arithmetic intensity and data-level parallelism. Figure 2.6 shows the architecture

of the GTX200 series.

Graphics processors have traditionally been designed for very specific specialized

16

Figure 2.6: Architecture diagram of the GTX200 series Graphics Processing Unit(GPU).

tasks. Most of their transistors perform calculations related to 3D computer graphics

rendering. Typically GPUs perform memory-intensive work such as texture mapping

and rendering polygons. The GPU also performs geometric calculations such as rotation

and translation of vertices into different coordinate systems. The on-chip programmable

shaders can manipulate vertices and textures.

Specialized video decoding processes are optimized on the modern GPU. These

processes include:

• Motion compensation (mocomp)

• Inverse discrete cosine transform (iDCT)

• Inverse telecine 3:2 and 2:2 pull-down correction

• Inverse modified discrete cosine transform (iMDCT)

17

• In-loop deblocking filter

• Intra-frame prediction

• Inverse quantization (IQ)

• Variable-Length Decoding (VLD), more commonly known as slice-level acceler-ation

• Spatial-temporal deinterlacing and automatic interlace/progressive source de-tection

• Bitstream processing (CAVLC/CABAC).

The programming interfaces and instruction sets change quite frequently in the

graphics processing realm. New versions of GPU APIs such as Microsoft’s Direct3D and

OpenGL (originally developed by SGI) are released every year. Not only are these APIs

quickly changing to meet demand, they are also very specialized for graphics rendering

and video processing.

GPUs by nature are designed as numeric computing engines, but their use has

been historically limited by graphics-oriented APIs. These APIs are designed with

unusual programming models and developers must be experts in computer graphics to

use them. To advance the use of GPGPU programming, BrookGPU [12] and Sh [13] were

projects that had the goal of abstracting the GPU as a streaming processor. The Brook

model was soley based on streaming computation abstraction where data is represetned

as streams and computation as kernels [14].

NVIDIA R© has developed a parallel computing architecture which is known as

the Compute Unified Device Architecture (CUDA). This computing engine, which is the

core of modern NVIDIA R© GPUs, is accessible to software developers through extensions

of industry standard programming languages. The development of CUDA has enabled

developers to access the virtual instruction set and memory of the GPU. This has

18

enabled the exploitation of the native parallel computational elements of the NVIDIA R©

GPU. Figure 2.7 shows a high-level overview of the steps required for data transfer and

code execution on the GPU.

Figure 2.7: Example of CUDA processing flow.

2.3.1 CUDA Overview

The CUDA model is optimized for maximum compatibility. NVIDIA R© utilizes

a soft instruction set which enables the GPU designers to change the low level hard-

ware and instruction set without having to address backwards compatibility. Similarly,

NVIDIA R© has also built in scalability to the CUDA model. These GPUs are scalable

in that the CUDA code that is written is not tied to a specific release of the NVIDIA R©

GPU. This can be contrasted to traditional CPUs where the hard instruction set is

published. CPU software developers often optimize their programs for how many cores

are available which can change as new CPUs are released.

19

As seen in figure 2.7, a CUDA program is comprised of multiple execution phases.

Depending on the phase, the execution involves the CPU, the GPU or both. Portions

of the CUDA code are executed on the GPU while other portions are executed on the

CPU. The NVIDIA R© compiler known as nvcc translates the code for the GPU and CPU

accordingly. This model is very easy to work with because the device code is written

in ANSI C extended with keywords for labeling data-parallel functions, called kernels,

and their associated data structures [14].

The GPU kernels create a large number of threads on the device in order to take

advantage of data parallelism. For example, a matrix multiplication problem would be

very simple to implement using the CUDA API. If the matrix was a 1000 x 1000 matrix,

the kernel would invoke 1,000,000 threads on the GPU to perform the computation.

Each thread block is responsible for computing one square sub-matrix [15]. Furthermore,

each thread within the block is responsible for computing one element of the sub-matrix.

Unlike CPU threads, CUDA threads are much lighter weight. In most cases, CPU

threads literally take thousands of clock cycles to schedule and generate while CUDA

threads take only a few [14].

The matrix multiplication problem is a great example to explore. This common

mathematical computation is easily translated into CUDA code. The CPU code for

performing this multiplication is shown in figure 2.8. This example assumes that the

matrices have already been allocated and generated. Furthermore, it assumes that these

two dimensional array elements are stored in a linear addressed memory system. The

inner loop iterates over the variable k and steps through a column of N and a row of

M. The loop then calculates the corresponding dot product. The outer loops (i and j )

jointly iterate over all of the rows and columns of the matrix MN. Finally, the outermost

20

void MatrixMul(float* M, float* N, float* P, int Width)

{

for (int i = 0; i < Width; ++i) {

for (int j = 0; j < Width; ++j) {

float sum = 0;

for (int k = 0; k < Width; ++k) {

float a = M[i * Width + k];

float b = N[k * Width + j];

sum += a * b;

}

P[i * Width + j] = sum;

}

}

}

Figure 2.8: The CPU code implementation of matrix multiplication.

loop identifies every row in the resultant matrix P.

Now that the traditional implementation has been explored it will now be shown

how this code translates into data parallel CUDA code. The CUDA API allows for the

programmer to continue to use the same function call within the code. This is useful

because many programs can be easily modified for GPU execution without a major

upset in the program architecture.

As previously discussed, GPU execution happens in three basic phases. The first

phase is to allocate memory on the device and copy the beginning matrices to the GPU.

The second phase is invoking the kernel on the GPU device itself. Once the computation

is complete, the resulting matrix is copied from the GPU device to the CPU.

Phase one or the memory transfer and allocation step is similar to that on a CPU.

A memory allocation command is called (similar to malloc) to allocate memory on the

GPU device. After the memory is allocated the CPU transfers the data to the GPU for

processing. The GPU stores this in the onboard GPU card DRAM. Figure 2.10 shows

21

the code to transfer to and from device memory. To put this concept in perspective,

the NVIDIA R© T10 processor comes with up to 4 GB of DRAM.

Next, the kernel is invoked on the GPU card itself. The GPU card is optimized for

efficient thread scheduling and execution. With the data available in DRAM, execution

can begin. The GPU card is also optimized for memory access and caching. If one

thread is waiting for DRAM, another thread will actually execute until the previous

thread can continue. With single cycle cache available to the core-blocks, the threads

execute with incredible speed and efficiency.

The final phase of the CUDA cycle is retrieving the result from the GPU card

(see figure 2.10). This step is fairly straight forward as it is a simple memory transfer

from the GPU back to the CPU.

The code shown in figure 2.9 is separated into functions similar to the three

sections that have been discussed. This code is the over all function that calls the

memory transfer functions as well as the kernel itself.

The amount of kernel code itself is relatively short in comparison to the setup

code. This kernel is responsible for creating all of the threads that were previously

mentioned. The CUDA kernel code is shown in figure 2.11.

Some situations do not lend themselves to port over to the GPU as easily as

this example. However, many problems can be analyzed and parallelized with medium

effort. Many factors must be taken into consideration when considering porting code

over to the GPU.

22

void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P)

{

//Interface host call to the device kernel code and invoke the kernel

// Device matrices

Matrix Mdevice;

Matrix Ndevice;

Matrix Pdevice;

// Allocate and copy the M and N matrices to the device

Mdevice = AllocateDeviceMatrix(M);

CopyToDeviceMatrix(Mdevice, M);

Ndevice = AllocateDeviceMatrix(N);

CopyToDeviceMatrix(Ndevice, N);

// Allocate the P matrix on the device

Pdevice = AllocateDeviceMatrix(P);

// Define thread and grid dimensions

dim3 threads(MATRIX_SIZE, MATRIX_SIZE);

dim3 grid(WP / threads.x, HP / threads.y);

// Launch the device threads to calculate matrix P

// Invoke the CUDA Kernel

MatrixMulKernel<<<grid, threads>>>(Mdevice, Ndevice, Pdevice);

// Copy the resulting P matrix from the device

CopyFromDeviceMatrix(P, Pdevice);

// Free the device matrices

cudaFree(Mdevice.elements);

Mdevice.elements = NULL;

cudaFree(Ndevice.elements);

Ndevice.elements = NULL;

cudaFree(Pdevice.elements);

Pdevice.elements = NULL;

}

Figure 2.9: Example code for invoking the memory transfers and kernel execution onthe GPU for matrix multiplication.

23

// Copy a host(CPU) matrix to a device(GPU) matrix.

void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost)

{

int size = Mhost.width * Mhost.height * sizeof(float);

Mdevice.height = Mhost.height;

Mdevice.width = Mhost.width;

Mdevice.pitch = Mhost.pitch;

cudaMemcpy(Mdevice.elements, Mhost.elements, size,

cudaMemcpyHostToDevice);

}

// Copy a device(GPU) matrix to a host(CPU) matrix.

void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice)

{

int size = Mdevice.width * Mdevice.height * sizeof(float);

cudaMemcpy(Mhost.elements, Mdevice.elements, size,

cudaMemcpyDeviceToHost);

}

Figure 2.10: Example code for memory transfers to and from the GPU for matrixmultiplication.

2.4 GPGPU Motivation

General-purpose computing on graphics processing units (GPGPU) is the tech-

nique of using a GPU, which typically handles computation only for computer graphics,

to perform computation in applications traditionally handled by the CPU. There are

several opportunities to apply data-level parallel techniques of GPUs to non-traditional

programming tasks. The GPU is optimized for data-parallel operations such as scan,

sort, search, data queries, differential equations and linear algebra. This optimization

allows for a wide range of applications such as databases or scientific simulations such as

fluid dynamics [14]. The data-parallel architecture of the GPU requires programming

idioms that are by no means new to the parallel computing world.

24

// Matrix multiplication kernel thread specification

__global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P)

{

//Multiply the two matrices

// Get the column and row id

int column = threadIdx.x;

int row = threadIdx.y;

// Calculate the P matrix result and the column and row

float Presult = 0;

// Calculate the P result value by multiplying the

// M row with the N column and adding the values

for(int i = 0; i < WM; i++)

{

float Mvalue = M.elements[row * WM + i];

float Nvalue = N.elements[i * WN + column];

Presult += Mvalue * Nvalue;

}

// Set the P result value

P.elements[row * WP + column] = Presult;

}

Figure 2.11: Example CUDA kernel code written for a GPU to do a matrix multiplica-tion.

2.4.1 Arithmetic Intensity

Arithmetic intensity (AI) is defined as the number of operations performed per

word of memory transferred. With today’s computing hardware, computation is rela-

tively cheap, but memory bandwidth is a commodity. This is especially true for GPUs

with their inherent massive floating-point computational ability. Traditionally, the best

way to predict the performance of a GPU for a specific problem is to first observe

high levels of arithmetic intensity. To this end, two benchmark suites, SD-VBS and

25

the SPEC 2006 [16] were examined for their characteristics of arithmetic intensity. The

SPEC CPU benchmark suite is designed to provide performance measurements that can

be used to compare compute-intensive workloads on different computer systems. SPEC

has held to the principle that better benchmarks are based on applications and these ap-

plications can come from any area of work. For example, the current SPEC CPU2006

suite includes applications from the following areas: AI game theory, bioinformatics,

chemistry, compilers, interpreters, data compression, physics, speech recognition, video

processing, and weather prediction.

To investigate arithmetic intensity, an experimental tool was developed using the

Pin [17] instrumentation engine. The Pin instrumentation framework allows a program’s

execution to be traced to collect operation counts, memory addresses and control be-

haviors. Figure 2.12 shows the arithmetic intensity (ratio of arithmetic operations to

memory operations) of the SD-VBS benchmark along side of the SPEC 2006 bench-

mark. The results are interesting in that the two sets of benchmarks have nearly the

same arithmetic intensity average, around 2.2. However, the SD-VBS suite consists of

a wider range of intensity. The applications within the SD-VBS suite are more likely to

be significantly higher or lower than the average. In the case of SPEC 2006, the set of

applications are more consistently close to the average arithmetic intensity.

Benchmarks from the SD-VBS set having high arithmetic intensity are: Local-

ization, Sift, and Stitch. These applications appear to be naturally well suited for

GPU execution. For example, Scale-Invariant Feature Transform (or SIFT) is an algo-

rithm in computer vision to detect and describe local features in images. The algorithm

was published by David Lowe in 1999 [18]; Applications using the SIFT include object

recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture

26

recognition, and video tracking. Already a considerable amount of research has been

done at developing GPU versions of the Sift algorithm [19]. For the remaining parts

of this thesis, benchmarks Multi ncut, SVM, and Tracking were modified to operate

on GPUs using the CUDA framework. These applications have the lowest arithmetic

intensity and thus represent interesting opportunities to show how non-traditional codes

may be transformed for GPU systems.

Figure 2.12: Arithmetic Intensity for the SD-VBS and the SPEC 2006 benchmarks.

Another relevant factor in considering the use of GPUs for any application is the

amount of execution credited to small regions of code. GPUs are effectively hardware

accelerators in a heterogeneous environment. In short, large portions of application code

may still be best suited for CPU execution. However, if small important kernels of code

account for large portions of the dynamic execution, then the task of off-loading some of

the execution to the GPU is straightforward. To this end, this thesis evaluated the SD-

VBS benchmarks using the GNU gprof profiling tool. Several top functions were isolated

to account for the majority of program execution. Furthermore, using a Pin profiling

tool to collect the code execution counts for each assembly instruction, Figure 2.13 and

27

Figure 2.14 were created. Figure 2.13 shows the cummulative percentage of dynamic

program execution attributed to the program’s static percentage of operations. The

curves show that 5% of the static operations account for over 55% of the dynamic

execution of all benchmark programs. In more detail, 5% of the code covers 90% of

execution for applications MSER, Stitch, Localization and SVM. Based on the

same curves, 15% of the static code covers 80% of the execution for all programs.

The benchmark Multi ncut is the application with the lowest percentage of dynamic

execution covered by the program code. In contrast, mser has one of the highest

coverage’s as 3% of the static code covers nearly 99% of the dynamic execution.

Figure 2.13: Identifying hotspots : Percentage of dynamic execution of SD-VBS pro-grams attributed to program code.

Figure 2.14 expands on the cummulative percentage of dynamic execution by pre-

senting the exact number of assembly operations that make up a large percent of the

program’s execution. Three separate data bars are presented for showing the distribu-

tion of 5%, 10% and 15% of the static code. For example, in the case of Disparity, 5%

28

of the static code amounts to approximately 550 assembly code operations. Given the

coverage trends of Figure 2.13, across most benchmarks, 99% of the dynamic execution

covered by 15% of the static program code can be implemented with as few as 2250

assembly operations.

Figure 2.14: Operation counts for SD-VBS program hotspots.

2.4.2 Demonstration of GPU Execution Efficiency: Sorting

The inherent calculations of computer vision algorithms take many different

forms. For example, numerical sorting represents how GPUs can attain significant

performance increases over traditional CPUs in certain cases. For relatively small sets

of data, the CPU is very efficient at sorting. To illustrate the concept of CPU vs. GPU

performance for sorting, Figure 2.15 shows the relative sorting performance of a sort

performed on a CPU vs. the same sort performed on the GPU. One can see that near

the middle of the graph, the GPU overtakes the CPU in terms of performance. For this

particular example, the GPU over takes the CPU when the array to be sorted reaches

29

6,000 floating point numbers. In some cases, the CPU may outperform the GPU, since

the GPU must perform memory transfers to and from its dedicated RAM. The setup

overhead cost of tasking the GPU can often cause the CPU to win over GPU for small

data sets. This translates into the fact that the CPU can just do a brute force sort on

an array faster than the time required to send the task to the GPU for processing. In

the case of 2 million elements, the GPU speedup for sorting is nearly 60x over CPU

performance.

Figure 2.15: Sorting performance (CPU vs. GPU) for various sizes of a 1-dimensionalarray of floating point numbers.

2.4.3 Demonstration of GPU Execution Efficiency: Array Reduction

GPUs also excel at reduction algorithms such as finding the sum (average, min-

imum, maximum or variance) of a collection of numbers. Traditionally, a CPU would

step in a serial fashion through every value to find the sum, for example. The GPU

30

Figure 2.16: Sorting performance detail for GPU Execution.

implementation involves a reducing the array over a number of threads, effectively using

a thread per every pair of numbers. Each time the add is performed, the number of

elements shrinks down until the final solution is reached. Figure 2.17 shows the relative

reduction performance of summing an array on a CPU vs. summing the same array

on the GPU. Figure 2.17 shows that if an array size if under 6000 elements, then the

CPU performance achieves a better outcome than the GPU. However, as the data sizes

increase, the GPU decisively outperforms the CPU performance. In this experiment, 4

million elements were reduced to a single summation value 3.5x faster with the GPU.

In this particular example, a great deal of time is consumed due to overhead.

Figure 2.18 shows the memory transfer times side-by-side with the actual GPU execution

time. In the case where 4 million elements are reduced, merely 3.6% of the total time

is spent on the actual computation.

31

Figure 2.17: Reduction performance (CPU vs. GPU) for various sizes of a 1-dimensionalarray of floating point numbers.

Figure 2.18: Reduction performance detail for GPU Execution.

32

2.4.4 Non-traditional Approach for Computer Vision Applications

Clearly for traditional computational cases of sorting numbers and reducing an

array of numbers, the GPU offers a significant performance advantage over CPUs. Like-

wise, since small code regions of computer visions algorithms account for substantial

portions of execution, there is good potential for using a GPU to accelerate the execu-

tion of some code functions. However, several benchmarks from the SD-VBS suite do not

appear to have sufficient arithmetic intensity required to map to the data-parallel na-

ture of GPU systems. While compilers can perform transformations to changes some of

the fundamental executing characteristics of applications, more significant programmer-

based reformulations are necessary to synthesize new computational models. Often such

reformulation techniques are not considered by the original programmer since the re-

sulting transformation would not achieve performance gains on a traditional CPU. In

the next section, the thesis explores reformulation of a few SD-VBS benchmarks.

33

3. Non-traditional Exploitation of GPUs

3.1 Data-size Based GPU versus CPU Execution

In most applications there is a section of code that runs efficiently on the GPU.

However, this is not true in all cases, specifically for small amounts of data. For a few

small tasks, the CPU performs very well and do not require any additional memory

transfers between processing elements. For example, when sorting numbers, for a small

amount of data the CPU will win every time due to the setup overhead for the GPU

execution. When the dataset is sufficiently large, the GPU will perform better than the

CPU.

On a given program or algorithm, the input data set can vary with size depending

on what the algorithm is processing. Computer vision algorithms are a great example

of how input sizes can vary at run-time. All computer vision algorithms need an image

input, which often comes from a camera or a saved image from taken with a camera.

With the variety of cameras and resulting picture quality, there exists a broad spectrum

of image input sizes and resolutions.

The approach to get the best performance using the GPU to complement the

CPU would be to implement intelligent run-time program execution. Conceptually, the

program would decide at run-time which device is best suited to take on the upcoming

operation. Code written in this manner will serve to complement CPU execution.

This will ultimately support the exploitation of the heterogeneous parallel computing

capability of a CPU/GPU system.

34

In the case where an algorithm needs to sort an array, the program would decide

if the CPU or the GPU could do the sort faster. During run-time, the program would

look at the array and make a determination based on the number of elements contained

in the array. Matrix multiplication is another example where the GPU could outperform

the CPU if the matrix size is sufficiently large.

3.2 Computation Reformulation

An example of computation reformulation appears in the application Tracking,

which has three main routines that account for a large percentage of execution: So-

bel dX, Sobel dY, and BlurImage. The Sobel computation is commonly used in image

processing, particularly within edge detection algorithms. Figure 3.1 shows the CPU

implementation of the Sobel computation algorithm.

Figure 3.2 illustrates the computation for one point of the output image of the X-

based Sobel operation. The Sobel calculation uses a window size of three, to generate the

output value in two distinct phases. The first phase takes three horizontal values (point

to the left, center point, and point to the right) and multiplies the values respectively

with the values of the first kernel array. The second phases takes three vertical points

(point in earlier row, center point, and point directly in the next row) and multiples the

values respectively by a second kernel array. In this example, for each calculated point,

there are a total of six multiples and four additions.

The GPU solution using the NVIDIA R© CUDA interface describes the computa-

tion task of each individual thread. In this implementation each thread is responsible

for calculating nine multiplies and eight additions, nearly 50% more work than the CPU

version performed. The resulting code demonstrates the set of code reformulations that

35

\* Code for Sobel_dX computation on the CPU *\

for(i=startRow; i<endRow; i++) {

for(j=startCol; j<endCol; j++) {

temp = 0;

for(k=-halfKernel; k<=halfKernel; k++) {

temp += subsref(imageIn,i,j+k) * asubsref(kernel_2,k+halfKernel);

}

subsref(tempOut,i,j) = temp/kernelSum_2;

}

}

for(i=startRow; i<endRow; i++) {

for(j=startCol; j<endCol; j++) {

temp = 0;

for(k=-halfKernel; k<=halfKernel; k++) {

temp += subsref(tempOut,(i+k),j) * asubsref(kernel_1,k+halfKernel);

}

subsref(imageOut,i,j) = temp/(float)kernelSum_1;

}

}

Figure 3.1: Code for Sobel dX computation on CPU.

Figure 3.2: Diagram of Sobel dX computation on CPU.

are necessary to generate a data-parallel computation from an existing computation

36

meant for sequential execution. Figure 3.3 shows the CUDA code reformulation for

Sobel dX computation.

temp0 = SM_REF(s_data,col-1,row-1, SOBEL_DX_RADIUS) * kernel2_0;

temp0 += SM_REF(s_data,col-1,row, SOBEL_DX_RADIUS) * kernel2_1;

temp0 += SM_REF(s_data,col-1,row+1, SOBEL_DX_RADIUS) * kernel2_2;

temp1 = SM_REF(s_data,col,row-1, SOBEL_DX_RADIUS) * kernel2_0;

temp1 += SM_REF(s_data,col,row, SOBEL_DX_RADIUS) * kernel2_1;

temp1 += SM_REF(s_data,col,row+1, SOBEL_DX_RADIUS) * kernel2_2;

temp1 = SM_REF(s_data,col+1,row-1, SOBEL_DX_RADIUS) * kernel2_0;

temp1 += SM_REF(s_data,col+1,row, SOBEL_DX_RADIUS) * kernel2_1;

temp1 += SM_REF(s_data,col+1,row+1, SOBEL_DX_RADIUS) * kernel2_2;

temp = temp0 * kernel1_0 + temp1 * kernel1_1 + temp2 * kernel1_2;

d_base[global_base_index] = temp/(float)(kernelSum2*kernelSum1);

Figure 3.3: Code for Sobel dX computation on GPU.

The GPU implementation is far more computationally efficient. Each thread

calculates the appropriate data point based on the input data. Figure 3.4 illustrates the

work performed by each thread on a GPU machine. From the code it can be seen how

the thread loads nine points of data from memroy to perform the calculation.

An interesting note is the GPU implementation actually uses nine data points to

perform the calculation. In contrast the CPU uses six data points (three horizontal and

three vertical). The GPU architecture has an advantage over the CPU because each

thread has quick one-cycle access to the ’window’ data.

3.3 Computation Speculation

In some computer vision applications the amount of computation depends on the

content of the data itself. Conceptually, as the content of an image changes, there may be

37

Figure 3.4: Diagram of Sobel dX computation on GPU.

more points of data exposed to certain computer vision algorithms. One example of this

variable amount of computation is found in the application Support Vector Machines

(SVMs). SVMs are a set of related supervised learning methods that analyze data and

recognize patterns, used for classification and regression analysis. The original SVM

algorithm was invented by Vladimir Vapnik and the current standard implementation

(soft margin) was proposed by Corinna Cortes and Vladimir Vapnik [20]. The standard

SVM takes a set of input data, and predicts, for each given input, which of two possible

classes the input is a member of, which makes the SVM a non-probabilistic binary

linear classifier. Since an SVM is a classifier, then given a set of training examples, each

marked as belonging to one of two categories, an SVM training algorithm builds a model

that predicts whether a new example falls into one category or the other. Intuitively,

an SVM model is a representation of the examples as points in space, mapped so that

the examples of the separate categories are divided by a clear gap that is as wide as

possible. New examples are then mapped into that same space and predicted to belong

to a category based on which side of the gap they fall on.

In the case of the SVM benchmark, there is a loop with a conditional statement in

the loop body. Depending on the value of an array indexed by the loop iteration variable,

38

the conditional is taken and requires the program to execute additional computation.

The result is that the loop execution takes a variable amount of execution time and

is difficult to parallelize on multiple processing cores. Figure 3.5 illustrates the loop

example from the SVM application in which the calculation does not conform to a

traditional data-parallel problem. In this example, a function polynomial takes the

results of multiplying two 256-entry vectors of the array i and k. The multiplication

represents a significant amount of work, however, only a few iterations of the loop pass

the conditional check inside the loop.

polynomial(float *exp, float A[], float B[], float dimension) {

float sum = 0;

for (int index; index < 256; index++) {

sum = sum + A[index]*B[index];

}

return (pow(sum, exp) / dimension);

}

...

for (i=0; i<N; i++) {

if(subsref(a,i,0) > 0) {

count ++;

s += asubsref(a,i) * asubsref{Y,i)

* polynomial(3, Vector[i], Vector[k], 256);

}

}

Figure 3.5: Code for SVM computation conditionally requiring polynomial calculation.

Figure 3.6 shows two vectors i and k potentially multiplied together in SVM. In

terms of optimizing this particular type of problem, it is difficult to determine if there

is a performance increase in off-loading the computational work to the GPU. If the

given input dictates that most of the conditionals are not taken, then the additional

computation would be skipped. This presents an interesting problem on finding the

39

Figure 3.6: Diagram of SVM potential polynoimal calculation based on i and k.

methodology for best optimization using the CPU and the GPU. For this case, one

approach would be to have the GPU pre-calculate every possible computation needed

by the iterative loop and store the results into a table. This would result in perform-

ing far more computations than would be necessarily needed by the algorithm. For a

CPU system, the additional computation would be prohibitive. For a GPU system,

computation is essentially free.

Table 3.1 shows for the given inputs of the SVM benchmark how much of the

table is calculated and consequently used, as well as the table size and number of table

accesses made. For example, when the algorithm is run with the sqcif input, 97.32%

of the table is used during execution, and the table contains 5050 entries and accessed

97839 times in the program. For non-test inputs, a table of pre-calculated results would

have between 85% and 98% of the pre-calculated entries referenced. This indicates that

in the long run of the program, there is good reasoning to build the table all at once.

Table 3.2 shows that for each input there are phases of the program that call the

polynomial function. The polynomial function is inside a loop (see Figure 3.5). The

second column is the number of times the loop is called. The third column shows how

many times that loop iterates when it is called. For example, cif calls the loop 6441

40

Table 3.1: Motivation to pre-calculate the entire table

Input Percentage of Table Used Table Size Run-Time Accesses to Table

cif 98.47% 5050 97839qcif 97.60% 2628 30126sqcif 97.32% 1830 30907sim 94.11% 136 1753

sim fast 85.23% 210 1776test 50.00% 10 20

times and each loop iteration count is 100. Based on the next columns for cif, the

minimum number of times that polynomial is called across the 6441 times is zero, the

average is 15.2 times out of 100, and the maximum number of times polynomial is called

is 30 out of 100. Finally, the standard deviation shows the variation each time the loop

is invoked. Overall the per-loop iteration results show that each loop execution would

not alone justify computing all of the potential values of a table.

Table 3.2: Number of accesses to table per loop invocation.

Loop Percentage of Loop ExecutionInput Invocations Loop Iterations Min Avg Max Standard Deviation

cif 6441 100 0.0% 15.2% 30.0% 9.0 %qcif 2635 72 0.0% 15.9% 41.7% 11.5%sqcif 2197 60 0.0% 23.4% 50.0% 14.8%sim 466 16 0.0% 23.5% 37.5% 15.6%

sim fast 214 20 0.0% 41.5% 75.0% 27.1%test 18 4 0.0% 27.8% 75.0% 32.2%

Figure 3.7 shows the exact distribution of loop iterations that require the poly-

nomial calculation in SVM for the cif input. For this input, each loop iterates for 100

times. However, as the figure shows, there are never any loop iterations that require

more than 30 polynomial calculations. Again, these results indicate that for regular

41

systems, each loop iteration would not look to use many of the pre-computed values.

Figure 3.7: Distributions of loop iterations that require polynomial calculations in SVMfor cif input.

Figure 3.8 shows the number of unique polynomial calculations required over a

timeline of loop executions. Overall, the results indicate that over time nearly all of

the table entries are required. In this case, not building the table of pre-computed or

speculated results is short-sighted if only each loop innovcation is considered. The table

results could be pre-computed on a traditional CPU system, given the large table size

and upfront overhead of computation.

42

Figure 3.8: Timeline of unique polynomial calculations in SVM for input cif.

43

4. Experimental Results

The experimental results show that in many cases the GPU out-performs the

CPU as expected. The performance increase is seen in all of the translation methods

from computation speculation to reformulation. However, with a smaller problem size,

the CPU can often beat the GPU time simply due to the overhead associated with

executing code on the GPU device. The data also shows that in many cases the actual

computation time of the GPU is only a small percentage of total GPU execution time.

4.1 Data-size Based GPU versus CPU Execution

The Image Segmentation algorithm yields some interesting experimental re-

sults. This benchmark requires a large amount of processing power in order to sort a

large set of floating point numbers. Figure 4.1 shows that for an input size greater than

96x128 there is a definite performance increase when using the GPU implementation.

Performance is severely decreased for the very small input sizes due to the GPU over-

head requirements such as memory transfers and kernel setup time. In this particular

case, if the input size was small then the CPU would execute the code. Mid-size and

larger data input sizes would be executed on the GPU.

The majority of the Image Segmentation algorithm consists of one major func-

tion known as fSortIndices. This function consumes more than 99 percent of the execu-

tion time of the overall algorithm. This function simply sorts a large array of floating

point numbers. This particular application is slowed down due to the structure of the

Image Segmentation algorithm. The structure of this algorithm requires the sort

44

Figure 4.1: Image Segmentation performance change using GPU implementation.

to be done via indexing, wherein the function builds an index to the original array as

to where the smallest to largest values occur. If this application utilized an in-place

style sort then the overall performance would be significantly increased. The original

author of this algorithm implemented a brute force sorting method which requires n2

comparisons to take place (where n is the number of elements to be sorted).

This algorithm was translated to the GPU by first building an index on the

CPU side and then transferring all the values to the GPU to be sorted. Once the

GPU executes an in-place sort, the resulting sorted array of floating point numbers

is transferred back to the CPU. The CPU then correlates the returned array to the

appropriate index values. The performance is negatively impacted based on the CPU

doing the correlation work.

Figure 4.2 shows the relative execution time for the fSortIndices function. This

figure compares the CPU execution time side-by-side with the total GPU execution

45

time (includes memory transfer and kernel setup time). Another interesting note is the

relative time of the GPU arithmetic execution time versus the total GPU execution

time. For the small input sizes nearly all of the GPU execution time is consumed for

computation. Looking at the larger input sizes one can see that the amount of time

spent on computation versus memory transfer time has a downward trend. Compared

relatively to execution time, the memory trasfer times actually increase with larger

input sizes.

Figure 4.2: fSortIndices performance change using GPU implementation.

4.2 Computation Reformulation

The example of reformulation of computation for GPUs for computer vision ap-

plication came from the application Tracking. The Tracking benchmark consists of

three main functions: Sobel dX, Sobel dY, and BlurImage. This particular example

actually reformulates how the computation is performed. Figure 4.3 depicts the per-

formance change for the overall Tracking benchmark. The graph shows that there

46

is a performance increase once the image input is sqcif or larger. The results for the

Tracking are similar to Image Segmentation in that for small input sizes there is an

overall performance decrease due to the GPU memory trasfer and kernel setup time.

Figure 4.3: Tracking kernels (BlurImage, Sobel dX, Sobel dY) performance changeusing GPU implementation.

The three sections of the Tracking benchmark yield different results. Higher

performance increases are seen with the GPU optimized BlurImage than the Sobel func-

tions. BlurImage, for example, using the full HD input size has a performance increase

of 8.3x while calcSobel dY only experiences a performance increase of 2.8. The per-

formance increase for the dY algorithm is greater because the CPU utilizes onboard

caching optimization for the dX version. Figure 4.4 shows that for all three algorithms

there is a performance improvement for all input image sizes. This result is interesting

due to the fact that the GPU is doing more computational work than the CPU.

47

Figure 4.4: Tracking kernels (BlurImage, Sobel dX, Sobel dY) GPU performancechange.

Figure 4.5 shows the performance improvement of BlurImage in terms of execution

time measured in microseconds. The blue line shows how much of the overall GPU time

is spent on the execution, the rest is spent on memory transfers and overhead. For the

high definition image (1080x1920) we see that 0.35% of the total GPU time is actually

spent doing the execution and the other 99.65% of that time is used on kernel setup

and memory transfers.

The Sobel algorithm performance for the delta X direction can be seen in figure 4.6.

The results show the performance improvement of calcSobel dX for various input sizes.

The minimum, average and maximum values have similar trends on each data set with

the CPU versus the GPU. For example, the largest (right most) data set is very flat for

both the CPU values as well as the GPU values. In all cases for this algorithm there is

a performance increase by using the GPU implementation of the algorithm.

Similarly, figure 4.7 shows the performance improvements the Sobel algorithm in

the delta Y. The results indicate a performance improvement for every given input.

48

Figure 4.5: Tracking kernel BlurImage comparison of CPU and GPU: minimum, max-imum, average.

Figure 4.6: Tracking kernel Sobel dX comparison of CPU and GPU: minimum, maxi-mum, average.

When the algorithm is optimized for the GPU, less than one quarter of one percent of

the total GPU time is used for computation. More than 99.7% of the GPU time is taken

up by memory transfers and kernel setup.

49

Figure 4.7: Tracking kernel Sobel dY comparison of CPU and GPU: minimum, maxi-mum, average.

Figure 4.8 shows an overview plot of all the kernels involved in Tracking. The

plot shows the standard deviation as a percentage of the average. In all cases, the GPU

has a lower standard deviation than the same function ran on the CPU. Each GPU run

has a more consistent execution time than the correlating CPU only run. Some of the

deviations on the CPU are attributed to the operating system scheduling various other

tasks on the CPU while the benchmark is running.

4.3 Computation Speculation

An example of speculative computation comes from SVM. The SVM benchmark

was implemented on the GPU in such a way that a large amount of pre-calculation is

performed at the beginning of the algorithm. Not all of the pre-calculated data may be

used during the run. Figure 4.9 shows the overall performance improvement of SVM by

using the GPU to pre-calculate the entire table. For the smaller inputs we see a 5x to

10x performance increase. For the larger inputs we see a consistent 28x increase. The

50

Figure 4.8: Tracking kernels (BlurImage, Sobel dX, Sobel dY) standard deviation asa percentage of average.

bars (plotted on the right-axis) show how much of the GPU time is used in the actual

computation (the remainder is GPU kernel setup and memory transfer time). For the

largest input (cif), less than 5% of the GPU execution time is used on computation.

Figure 4.9: GPU performance for SVM application.

Figure 4.10 shows the speedup of each table access versus generation. Each bar

51

indicates the speedup factor of the GPU compared to the CPU for each table access

versus generation. The number on top of each bar shows the ratio of standard deviation

relative to the average. The data shows that the GPU time is more fixed than the CPU

and does not stray from that amount.

Figure 4.10: GPU performance for each table access versus generation.

The polynomial algorithm, a portion of the SVM application, exhibits extreme

performance improvement using the GPU implementation. Figure 4.11 shows the total

speedup of the polynomial function and the total speedup of the SVM application. The

polynomial function is sped up as much as sixty times faster than the CPU for the

sqcif input. The overall speedup is around 2.5 to 3 times faster than the CPU only

implementation for all input sizes.

52

Figure 4.11: GPU performance for SVM application.

53

5. Conclusion

The optimization approaches described and the experimental results have proved

successful techniques for real-world applications. The results show, given realistic prob-

lem sizes, a performance increase is seen with GPU optimized algorithms.

The example of sorting optimization shows that in some cases the GPU can

actually slow down the overall performance of the algorithm. This particular example

also shows that the GPU can perform tasks such as sorting significantly faster than

a CPU given a large enough input data set. Future work would potentially focus on

optimizing additional commonly used algorithms used in software.

Many cases exist where optimization can be done by reformulating the computa-

tion methods contained in the algorithm. Often times the GPU will actually perform

more computations than the CPU version. Reformulation can allow the GPU to per-

form the same calculations in a parallel form. Additional work for this case would focus

on creating templates and examples of ways to reformulate commonly seen algorithms.

Speculative computation involves careful observation of the underlying algorithm

and identifying the generation of large lookup tables and the like. Code for CPUs is

often written to optimize the code for serial execution and minimize the total number of

computations performed. Future work would involve finding a method to easily identify

when such situations occur in software code.

This thesis shows that different types of computationally intensive problems can

be optimized further by using the processing power of the GPU. By identifying and

properly applying the techniques presented, a software developer could increase the

performance of a given process by using the GPU. Utilizing the GPU, which exists on

54

many desktop systems today, is a great way to improve performance, boost efficiency

and decrease hardware costs.

55

BIBLIOGRAPHY

[1] M. Pharr and R. Fernando, Gpu gems 2: programming techniques forhigh-performance graphics and general-purpose computation. Addison-Wesley Pro-fessional, 2005.

[2] SD-VBS: The San Diego Vision Benchmark Suite, October 2009.

[3] M. Flynn, “Some computer organizations and their effectiveness,” IEEE Trans.Comput., vol. C-21, pp. 948+, 1972.

[4] W. D. Hillis and G. L. Steele, Jr., “Data parallel algorithms,” Commun. ACM,vol. 29, pp. 1170–1183, December 1986.

[5] INTEL, “The evolution of a revolution. http: // download. intel. com/ pressroom/kits/ intelprocessorhistory. pdf , accessed,” 2008.

[6] INTEL, “Intel microprocessor export compliance metrics. http: // www. intel.com/ support/ processors/ sb/ cs-023143. htm , accessed,” 2010.

[7] D. B. Kirk and W.-m. W. Hwu, Programming Massively Parallel Processors: AHands-on Approach. Morgan Kaufmann, 1 ed., February 2010.

[8] W.-m. Hwu, K. Keutzer, and T. G. Mattson, “The concurrency challenge,” IEEEDes. Test, vol. 25, pp. 312–320, July 2008.

[9] I. Corporation, Intel R© 64 and IA-32 Architectures Optimization Reference Manual.No. 248966-023a, Intel Corporation, January 2011.

[10] NVIDIA, “Geforce gtx 200 gpu architectural overview. http: // www. nvidia. com/docs/ io/ 55506/ geforce gtx 200 gpu technical brief. pdf , accessed,” 2008.

[11] H. Sutter and J. Larus, “Software and the concurrency revolution,” Queue, vol. 3,pp. 54–62, September 2005.

[12] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanra-han, “Brook for GPUs: stream computing on graphics hardware,” in SIGGRAPH’04: ACM SIGGRAPH 2004 Papers, (New York, NY, USA), pp. 777–786, ACMPress, 2004.

[13] M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule, “Shader algebra,” ACMTrans. Graph., vol. 23, pp. 787–795, August 2004.

56

[14] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips,“GPU computing,” Proceedings of the IEEE, vol. 96, pp. 879–899, May 2008.

[15] NVIDIA, NVIDIA CUDA Programming Guide 2.0. 2008.

[16] Standard Performance Evaluation Corporation, “The SPEC CPU 2006 benchmarksuite,” 2006.

[17] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J.Reddi, and K. Hazelwood, “Pin: building customized program analysis tools withdynamic instrumentation,” SIGPLAN Not., vol. 40, pp. 190–200, June 2005.

[18] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedingsof the International Conference on Computer Vision-Volume 2 - Volume 2, ICCV’99, (Washington, DC, USA), pp. 1150–, IEEE Computer Society, 1999.

[19] S. Warn, W. Emeneker, J. Cothren, and A. W. Apon, “Accelerating sift on parallelarchitectures.,” in CLUSTER, pp. 1–4, IEEE, 2009.

[20] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20,pp. 273–297, 1995. 10.1007/BF00994018.

57

TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION …€¦ · TECHNIQUES TO LEVERAGE...

Documents

Transcript of TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION …€¦ · TECHNIQUES TO LEVERAGE...