Linear Algebra PACKage on the NVIDIA G80 Processor · PDF fileOpenGL and DirectX are designed...

Linear Algebra PACKage on the NVIDIA G80 Processor Robert Liao and Tracy Wang Computer Science 252 Project

University of California, Berkeley Berkeley, CA 94720

liao_r [at] berkeley.edu, tracyx [at] berkeley.edu

Abstract The race for performance improvement previously depended on running one application or sequence of code really quickly. Now, the processor industry and universities focus on running many things really quickly at the same time in an idea collectively known as parallelism. Traditional microprocessors achieve this by building more cores onto the architecture. A single processor may have as many as 2 to 8 cores that can execute independently of one another. However, many of these processor systems include an underutilized graphics processing unit. NVIDIA’s G80 Processor represents the first step to improve performance of an application through the use of the inherent parallel structure of a graphics processing unit.

This paper explores the performance of a subset of the Linear Algebra Package running on the NVIDIA G80 Processor. Results from the exploration show that if utilized properly, the performance of linear algebra operations is improved by a factor of 70 with a suitable input size. Additionally, the paper discusses the issues involved with running general programs on the GPU, a relatively new ability provided by the G80 Processor. Finally, the paper discusses limitations in terms of the use of a GPU.

1 Introduction The effort to improve performance on

microprocessors for much of the 1990s focused

primarily on running instructions faster. This brought forth the frequency race between microprocessor manufacturers. Programmers could write code, wait 6 months, and their code would suddenly run faster. Additionally, many techniques for increasing instruction level parallelism were introduced to improve performance. Today, the next performance obstacle is not clear, but the general consensus is that the next big thing includes putting many processing cores on a processor. This will enable applications to take advantage of some higher form of parallelism beyond instruction parallelism.

During all of this time, Graphics Processing Units (GPUs) have been doing many operations in parallel due to the relatively independent nature of its computations. Graphics scenes can be decomposed into objects, which can be decomposed into various rendering steps that are independent from one another. However, this led to a very specialized processor that was optimized for rendering.

Performing computations on a GPU is a relatively new field full of opportunities thanks to the NVIDIA G80 GPU. This processor represents one of the first processors to expose as many as 128 computation cores to the programmer. As a result, a comparison with respect to the CPU to see how the GPU performs is more than appropriate to determine if this is right direction for parallelism.

Though any program can be loaded onto the GPU for this exploration, we decided to benchmark linear algebra operations as a starting point to evaluate the GPU for other research areas like the Berkeley View. The Berkeley View has compiled many kernels called dwarves that they think should perform well on parallel platforms. These include dense and sparse matrix operations. Many programs can be reduced purely to these dwarves. As a starting point to examining efficacy of using the GPU in this capacity, we benchmark the performance of general linear algebra operations.

This paper is organized as follows. Section 2 provides a general introduction to traditional GPUs and how they worked prior to the NVIDIA G80 processor. Section 3 discusses details about the NVIDIA G80 processor and outlines its capabilities. Section 4 discusses the benchmarks used to profile the performance of the G80 with respect to two CPUs used in this paper. Section 5 provides results, discussion, and speculation on the performance runs on the GPU. Section 6 brings forth the issues associated with GPU computing along with a discussion on issues on running applications on the G80 platform. Finally, the paper concludes in Section 7 with a summary of the results, future directions for research, as well as related works on GPUs.

2 Traditional GPUs

Background Many modern personal computing and

workstation architectures include a GPU to off-load the task of rendering graphical objects from

the Central Processing Unit (CPU). The relationship between the CPU and GPU is closely intertwined as most of the output from a computer is received through video. In the past, architects optimized the CPU and GPU bus routes because of the high demand to display video to an output device like a monitor or LCD.

GPU manufacturers have kept up with the demand by offering more advanced capabilities in their GPUs beyond putting text and windows to a screen. GPUs today are typically capable of taking geometric information in the form of polygons from an application like a game and performing many different transformations to provide some sort of realistic or artistic output.

This video processing is embarrassingly parallel. The representation of a pixel on the screen often can be rendered independently of other pixels. As a result, GPU manufacturers have provided many superscalar features in their processors to take advantage of this parallelism. This push for parallelism has come to a point where a GPU is basically a specialized vector processor.

Motivation for Change A fixed pipeline characterizes the traditional GPU. Many have a fixed number of special shaders such as vertex shaders and pixel shaders. NVIDIA noticed that during certain rendering scenarios, many of the specialized shaders remain dormant. For instance, a scene with many geometric features will use many vertex shaders, but not very many pixel shaders. As a result, NVIDIA began to look for a reconfigurable solution.

Figure 1 shows a typical high-altitude pipeline of a GPU. Data flows forward from the CPU through the GPU and ultimately on to the display. GPUs typically contain many of these pipelines to process scenes in parallel. Additionally, the pipeline is designed to flow forward, and as a result, certain stages of the Figure 1: The Traditional GPU Pipeline

pipeline have features like write-only registers to avoid hazards like read after write hazards found in typical CPU pipelines.

Additionally, the vector processor right next to the CPU is quiet during heavy computations performed on the CPU. Most developers do not send parallelizable computations to the GPU because the APIs make it too difficult to do so. The typical interfaces like OpenGL and DirectX are designed for graphics, not computation. As a result, the programmer cannot tap into the GPU’s vast vector resources.

3 The NVIDIA G80 GPU The G80 GPU is found in NVIDIA’s

GeForce 8 Series graphics cards as well as the NVIDIA Quadro FX 4600 and 5600. The NVIDIA Quadro FX 5600 is the card used in this exploration.

Architecture The G80 GPU is NVIDIA’s answer to

many of the aforementioned concerns and issues. It represents a large departure from traditional GPU architectures. A block diagram of the architecture is shown in Figure 2. The GPU contains 8 blocks of 16 stream processors with a total of 128 stream processors. Each stream processor can execute floating point instructions. From the block below, each group of 16 shares a L1 cache. From there, each block has access to 6 L2 caches. This architecture arrangement also allows one processor to directly feed results into another processor for continued stream processing.

Each processor can be configured to be a part of some shader unit in the traditional GPU sense. This reconfigurability also means that the processors can be dedicated to performing general computations. This capability is exposed in NVIDIA’s Compute Unified Device Architecture.

Each processor also has local memory as well as shared memory with other processors. According to the NVIDIA guide, accessing local and shared memory on-chip is as fast as accessing registers.

Compute Unified Device Architecture The Compute Unified Device Architecture (CUDA) is NVIDIA’s API for exposing the processing features of the G80 GPU. This C Language API provides services ranging from common GPU operations in the CUDA Library to traditional C memory management semantics in the CUDA runtime and device driver layers. Additionally, NVIDIA provides a specialized C compiler to build programs targeted for the GPU.

Code compiled for the GPU is executed on the GPU. Likewise, memory allocated on the GPU resides on the GPU. This introduces complications in interfacing programs running in CPU space with programs running in GPU space. The programmer must keep track of the pointers used in each processor. Many programs, including the ones used to benchmark the GPU in this paper, reasonably assume that all pointers and execution code reside on one memory space and one execution unit. Porting this style of programming to a separate format is a non-trivial task.

Figure 2: The NVIDIA G80 Graphics Processor Architecture

A Note on Scarce Specifications Due to the secretive nature of the industry,

NVIDIA has not released much information about the G80 processor beyond a high level overview. As a result, we can only speculate on specific like L1 and L2 cache sizes in this benchmark.

4 Benchmarking

LAPACK and CLAPACK The Linear Algebra PACKage (LAPACK)

consists of a library of programs that operate on matrices. This exploration uses LAPACK to gauge the performance of the GPU with respect to the CPU. Originally written in Fortran 77, LAPACK is designed to solve simultaneous systems of equations, determine least-squares solutions for systems of linear equations, and eigenvalues of matrices among other linear algebra problems. It not only solves those problems, but provides the basic tools necessary for solving those problems. Our exploration deals with CLAPACK, a C version of LAPACK run through F2C, a Fortran to C converter.

We focus on the sgesv function in LAPACK for measuring the performance of the GPU with respect to the CPU. sgesv solves a linear system of equations in the form of

A × X = B

where A is a matrix of coefficients, X is the variable vector, and B, is a constants vector. LAPACK uses triangular factoring to solve this system of equations. Our performance tests how quickly LAPACK can perform this factorization

with respect to matrix size in both the CPU and GPU.

BLAS and CUBLAS The LAPACK tools rely on the Basic

Linear Algebra Subprograms (BLAS) library. These subprograms are a set of primitive operations that operate on matrices. The original BLAS can be run on the CPU. NVIDIA provides its own version called CUBLAS (Compute Unified BLAS). CUBLAS is designed to run on the G80 GPU, and abstracts much of the CUDA programming API in a succinct mathematical package. The only major change is the inclusion of allocation and freeing function to deal with the separation of CPU and GPU memory.

CULAPACK To perform this exploration, we took the general solve function and altered it to be compatible with CUBLAS. We called the resulting port CULAPACK (Compute Unified LAPACK). Figure 3 shows the organization of the aforementioned packages.

Performing a port on LAPACK to CULAPACK is non-trivial due to the assumption of one memory space. LAPACK often makes references to CPU memory for branching conditions before calling BLAS.

There are several approaches to this port. One approach is to wrap the memory transfer code around each BLAS call. Unfortunately, this comes at the expense of performance since repeated copies are not necessary for each BLAS call. To lessen this performance hit, the second approach moves the memory transfer to the level of the LAPACK call. The third and ideal solution would be to move the GPU and CPU boundary between LAPACK and BLAS to above LAPACK. This requires LAPACK to be re-architected to account for two memory spaces.

Figure 3: Organization of the modules.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 100000 200000 300000 400000 500000 600000 700000

Tim

e (M

illis

econ

ds)

Number of FORTRAN Real Elements

Memory Overhead on the NVIDIA Quadro FX 5600

Figure 4: Graph showing memory overhead on the NVIDIA Quadro FX 5600..

Because the first approach brings a heavy memory overhead hit, and the third approach is more time-consuming and involved, we chose to implement the second approach in the following steps:

1. Perform a preprocessing step: Take matrices supplied by our testing framework in CPU memory, and copy them to GPU memory. Additionally, perform any computations that LAPACK needs before invoking BLAS calls.

2. Compute: Invoke as many CUBLAS calls as is possible given the preprocessing.

3. Postprocess the results: Copy relevant parts of the GPU matrix to allow LAPACK to continue processing without changing behavior. Additionally, if LAPACK needs to return the actual matrix, copy the result out of the GPU and into the CPU.

Even though step 3 requires some amount of copying from GPU to CPU in the program flow, it is minimal and becomes negligible as the matrix size grows larger. We found this approach to be the most sensible and an equally telling method to evaluate the GPU for this paper.

5 Performance and Analysis

Memory Allocation Overhead CULAPACK requires many, potentially

large, memory transfers between the CPU and the GPU. The benchmark first allocates a specified amount of memory on the CPU and the GPU. Randomized data is generated and is used to populate the CPU array. Next, the array is copied to the GPU. After that, the array is copied back into the GPU and both arrays are freed.

Figure 4 shows the memory overhead for various amounts of elements transferred between the CPU and GPU. The blue line represents the average of three different trial runs of this

benchmark and the red line represents a linear regression of these points.

The memory overhead is linear as indicated from the graph. This means that memory operations for subsequent benchmarks will scale linearly with size. This is important for the GPU since later findings will show that the GPU performs better with larger input sizes.

Additionally, this also means that the programmer must be careful when dealing with the CPU and GPU interface. A naïve approach with copying memory between the CPU and GPU can result in poor performance in computations.

LAPACK Mathematical Performance The benchmarks targeted the triangular factoring step of the general solve function in LAPACK. Figure 5 shows the speedup of the NVIDIA Quadro FX 5600 over the Core 2 Duo 6700. The speedup is calculated by taking the time to triangle factor a given matrix of a particular input size on the CPU divided by the time required of the same operation on the GPU. Since memory copying is required to perform these operations on the GPU, this memory overhead is included in the chart. In the surface graph, a higher the surface means a greater speedup for a particular row and column size. We performed three time trials for each of the dimensions and

averaged the time to help get rid of random factors affecting any particular time trial.

As shown from the graph, most of the data points lie below a speedup of 1.0 (red and blue regions. This means that the GPU was slower than the CPU in performing an identical calculation. An interesting peak occurs at a row size of 600 in the graph. The same benchmark was

run on a Pentium D 820 processor, which contains a different core design from the Core 2 Duo. The same peak occurs at around 600 rows. The next large peak does not occur until around 1600 rows, which is not shown in the graph above. Figure 6 shows a slice along this interesting axis.

To further investigate this unusual peak, we plotted the time required to perform this calculation. This is shown in Figure 7. Notice how the graph is relatively linear for larger input sizes. For the CPU, there is quite an execution time peak at 600.

We speculate that the BLAS memory access and instruction patterns exacerbate some architectural feature on the Core 2 Duo 6700. The Core 2 Duo 6700 has a shared 4MB L2 cache. This means that it should be able to hold the entire matrix inside of the cache with plenty of space to spare. While there are no specifications on the L1 and L2 cache for the G80 processor, NVIDIA has stated that an L1 cache is associated with a group

100

200

300

400

500

600

700800

9001000

0

0.5

1

1.5

2

2.5

3

100200

300400

500600

700800

9001000

Input Column Size

Spee

dup

Input Row Size

Speedup of the NVIDIA Quadro FX 5600 over a Core 2 Duo 6700(CPU Time / GPU Time)

2.5-3

2-2.5

1.5-2

1-1.5

0.5-1

0-0.5

Figure 5: Surface chart showing speedup of the NVIDIA Quadro FX 5600 over the Core 2 Duo 6700

0

1

2

3

4

5

6

7

8

100 200 300 400 500 600 700 800 900 1000

Spee

dup

Input Column Size

Speedup of the NVIDIA Quadro FX 5600600 Rows

Core 2 Duo 6700 Pentium D 820

Figure 6: A Slice of the Surface Graph

of 16 stream processors and 6 L2 caches can be accessed by all groups. This grouped cache probably large enough to hold the entire matrix for input sizes tested in this benchmark.

We used the Intel VTune Performance Environment profiler to further investigate this unusual hit in performance. We ran the profiler on the program operating on matrix row sizes of 500, 600, and 700. For cache misses, the numbers seem to be in order. The program cache misses proportionally to the matrix row size. However, upon examination on the branch predictor, we

found that the processor seemed to be branch mispredicting at an unusually high rate for a row size of 600. The Pentium D 820 also exhibited a similar performance hit at the same row size. If both processors use the same branch predictor, then this may be the cause of the large performance hit. However, determining why the branch predictor on the processor mispredicts on that particular row size is beyond the scope of this paper.

For even larger matrix sizes, the GPU simply blows the CPU out of the water. Figure 8 shows the extraordinary speedup calculated in the same fashion. For a 5000x9000 matrix, the CPU took over two hours to determine the result. The GPU, in contrast, took under 2 minutes. This benchmark shows two different wins for the GPU. First, the vast number of cores on the GPU means it can perform many more operations at the same time compared to the CPU. Additionally, these large matrices take up quite a bit of memory on both respective platforms. In a CPU, a cache miss means a penalty of many cycles. On the NVIDIA G80, much of the memory is as fast as registers. As a result, the G80 can process more data faster. We would like to note that the total time to execute the benchmark on the Core 2 Duo was on the order of 10 hours. The Pentium D 820 was not halfway through the benchmark after 8 hours, and as a

100

400

700

1000

0

0.1

0.2

0.3

0.4

0.5

0.6

100

200

300

400

500

600

700

800

900

1000

Input Column Size

Tim

e (S

econ

ds)

Input Row Size

Time Required for the NVIDIA Quadro FX 5600

0.5-0.6

0.4-0.5

0.3-0.4

0.2-0.3

0.1-0.2

0-0.1

Figure 7: Time required to perform triangular factorization. Figure 8: A Slice of the Surface Graph

-10

10

30

50

70

90

0 500 1000 1500 2000 2500

Tim

e Re

quir

ed (

Seco

nds)

Size of Square Matrix

Swapping Matrices

NVIDIA Quadro FX 5600 Core 2 Duo 6700

Figure 9: Time required to swap two matrices.

result, we decided not to show its speedup in Figure 8.

It is worthy to note that the memory overhead in our ported version of CULAPACK quickly becomes negligible as the matrix size grows larger. Because we have minimized the amount of memory transfers, we can approximate most of the performance results to actual computation time.

Non-computational Performance For completeness, we decided to see how well the GPU performed on non-computational operations. We decided to find out how quickly the GPU could swap two matrices. The results of this benchmark are shown in Figure 9, and the results are quite surprising. The time required for the GPU reasonably increases quadratically since the number of swaps required increases quadratically as the matrix square size increases. However, the CPU seems to be able to perform this swap very quickly if near constant time is considered to be quick.

Without more information about NVIDIA’s architecture, it is difficult to determine the cause of the slow down. However, it is not unreasonable to speculate that the slow down might be a difference in optimization goals. For typical graphics computations, swaps may not occur often in the write-once semantics of traditional GPUs. As a result, it is not a common use scenario. On the other hand, swaps occur all the time on the CPU. An excellent example comes from sorting. Many data sets performed on the computer require some sort of sorting, and as a result, the CPU can optimize for that. As a result, NVIDIA may not have optimized for this new scenario during design.

From this example, the G80 shows that there is little benefit derived from executing non-computational operations on the GPU.

6 GPU Programming Issues GPU programming is not without its

issues. The biggest issue is that of memory separation between the GPU and CPU. Minimizing the memory overhead in using GPU in CPU code either requires large memory transfers at the border of GPU and CPU or an involved reorganization of the CPU code to avoid many memory copies. This means that while the GPU performs better for large matrix operations, to fully take advantage of the performance gain, current applications need to be aware of the GPU memory space.

In the first iteration of this project for the project presentations, we were not able to obtain any sort of speed up for matrix sizes below 1000 by 1000. That iteration simply wrapped CUBLAS calls to copy memory at the appropriate places and then call CUBLAS. After that, the second iteration optimized the memory copy. We copied the matrices into GPU memory and made as many CUBLAS calls as could be grouped together. This allowed some of the benchmarks to go on par if not better with the GPU for matrices smaller than 1000 by 1000.

Another issue involved with operating on the G80 GPU involves issues with float point arithmetic. A close examination of the solutions obtained from the linear algebra operations reveals a level of floating point imprecision (e.g. 0 becomes 0.0001). The NVIDIA CUDA Programming Guide explains many implementation deviations from IEEE-754. There is no support for denormalized numbers, and underflowed numbers are flushed to zero. This lack of precision indicates that the GPU is unsuitable for scenarios where high precision is required. This may impact its usefulness in scientific computing, where parallelism is king.

7 Conclusion The primary goal of this project was to

explore the GPU programming space as well as

assess the performance of a GPU running some type of non-graphical computation. In this paper, we have just begun the exploration of GPU computing, and there is much that others can do in this very new field.

The G80 demonstrated that it is up to the task of handling many of the computations that are typically left to the CPU. For small matrix operations, the computation is better left to the CPU since the CPU is usually as fast, if not faster, than the GPU. For matrices larger than 1000x1000, the GPU exhibits some very large performance gains. It is up to 70 times faster with matrices on the order of 5000 rows when compared to a Core 2 Duo CPU.

Additionally, NVIDIA’s new CUDA API offers a very convenient way for the programmer to leverage the GPU for non-graphical computations. This also opens the potential for the GPU and CPU to work in parallel where the GPU serves primarily as a math coprocessor and the CPU coordinates tasks for the GPU or does some other useful work.

Related Work “The Landscape of Parallel Computing

Research: A View from Berkeley” Project provided the inspiration for this exploration. One of their main goals is to explore a parallel landscape with processors with many cores. The GPU provides a promising step towards their goal of thousands of cores on a processor.

Professor James Demmel in the Electrical Engineering and Computer Science Department at the University of California, Berkeley also does mathematical work on GPUs. One of his research projects involves extracting useful computational information from an ATI GPU with the DirectX graphics API.

Vinay Bondhugula et al., currently is exploring Fast Singular Value Decomposition in Graphics Processors in the Computer Science

Department at the University of North Carolina at Chapel Hill. They currently run their computations on the NVIDIA GeForce 7 Series graphics cards.

Future Directions We have only benchmarked a small subset of the CLAPACK library with limited optimization in terms of bringing CLAPACK in to the GPU. Since the results are promising, there are many directions that can be taken.

One large scale project can port all BLAS calls to CUBLAS in CLAPACK functions in the form described above. Benchmarks can be performed for this wider sample of linear algebra operations. This will provide better insight to the advantages of GPU for various operations.

Additionally, porting CLAPACK to the GPU program space would provide more interesting results for a program specifically designed with the GPU in mind. This is more involved as it requires that the computations outside of CUBLAS also use CUDA to execute directly on the GPU. However, this would minimize the memory copying overhead and provide a more accurate benchmark of GPU performance.

A more detailed analysis of the benchmarks and further optimizations can be performed on CULAPACK with more specifications of the G80 processor from NVIDIA. With this information, the project can optimize allocations based on cache sizes in the GPU among and code ordering among other optimizations. Optimized performance for linear algebra operations may open new areas for the VIEW project.

Acknowledgements The authors would like to thank Professor James Demmel from the UC Berkeley Electrical Engineering and Computer Science department

for his guidance on the project. Additionally, the authors would like to thank Professor Sara McMains from the UC Berkeley Mechanical Engineering department for supplying the NVIDIA Quadro FX 5600 graphics board used in this exploration. Finally, the authors would like to thank the CITRIS Tele-immersion Group for providing the machine used to host the NVIDIA Quadro FX 5600 card. The card requires two PCI-Express 16x power supply sources and around 12 inches of clearance for the card. Not many machines support these requirements.

References 1. “NVIDIA CUDA Compute Unified Device

Architecture.” Available at http://developer.download.nvidia.com/compute/cuda/0_8/NVIDIA_CUDA_Programming_Guide_0.8.pdf

2. “NVIDIA CUDA CUBLAS Library.” Available at http://developer.download.nvidia.com/compute/cuda/0_8/NVIDIA_CUBLAS_Library_0.8.pdf

3. Linear Algebra Package. Available at http://www.netlib.org/lapack/

4. A. Stephin, Y. Lyssenko, and A. Shilov. “Directly Unified: Nvidia GeForce 8800 Architecture Review.” X-bit Laboratories. Available at http://www.xbitlabs.com/articles/video/display/gf8800.html

5. Intel Core 2 Duo Processor Technical Details. Available at http://www.intel.com/design/core2duo/documentation.htm

Linear Algebra PACKage on the NVIDIA G80 Processor · PDF fileOpenGL and DirectX are designed...

Documents

Transcript of Linear Algebra PACKage on the NVIDIA G80 Processor · PDF fileOpenGL and DirectX are designed...