HPCG benchmark for characterising performance...

HPCG benchmark forcharacterising performance of SoC

devices

Rabi Javed Abbasi

Project report submitted for the degree ofMaster of Computing

The Australian National University

October 2015

Final version – 30 October 2015

c© Rabi Javed Abbasi 2015


Except where otherwise indicated, this report is my own original work.

Rabi Javed Abbasi30 October 2015


Abstract

HPCG (High Performance Conjugate Gradient) is an emerging benchmarkthat has rapidly gained importance as an efficient and reliable performanceevaluation tool. It was proposed to overcome the deficiencies with the widelyused benchmark, HPL, which was no longer considered to be representativeof actual system performance. HPCG implementations are however only intheir infancy and lack coverage for a number of device architectures and plat-forms.

System on Chips (SoC) are highly integrated systems which combine hard-ware elements of a conventional computing system onto a single chip. Witha shift in computing trends towards low cost, low powered computing, SoCshave gained immense importance.

We aim to develop a benchmark for SoCs which can fully utilize the hard-ware capabilities of the devices through the parallel programming model.Two popular and competing APIs for achieving this are Cuda and OpenCl.We explore the HPCG algorithm in detail, providing context for Cuda andOpenCl based implementations. We evaluate the performance and powerefficiency of the implementations on two SoC devices and assess its effective-ness as a benchmarking tool.

We find that SoC devices can benefit from offloading work to accelerators butare bound by the limiting memory bandwidth of the systems. We observeimproved power efficiency, making the benchmark relevant as an evaluationtool for SoCs.

v


vi


Contents

Abstract v

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Project Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and Motivation 32.1 HPL Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 High Performance Conjugate Gradient (HPCG) . . . . . . . . . 4

2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.2 Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.3 Problem Description . . . . . . . . . . . . . . . . . . . . . 62.3.4 Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 System on Chip (SoC) . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Programming SoC Devices . . . . . . . . . . . . . . . . . . . . . 10

2.5.1 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5.2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Related HPCG Publications . . . . . . . . . . . . . . . . . . . . . 112.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Design and Implementation 133.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Modeling the Problem . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Assumptions/Data Access rates . . . . . . . . . . . . . . 143.2.2 Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.3 Hardware Architecture A . . . . . . . . . . . . . . . . . . 153.2.4 Hardware Architecture B . . . . . . . . . . . . . . . . . . 163.2.5 Hardware Architecture C . . . . . . . . . . . . . . . . . . 173.2.6 Model Summary and Deductions . . . . . . . . . . . . . 17

3.3 Implementation of parallel HPCG algorithm . . . . . . . . . . . 193.3.1 Sparse matrix vector multiplication (SPMV) . . . . . . . 193.3.2 Scaled Vector Addition (WAXPBY) . . . . . . . . . . . . . 203.3.3 Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.4 Multi Grid Preconditioner (Compute MG) . . . . . . . . 22

vii


viii Contents

3.3.4.1 Restriction . . . . . . . . . . . . . . . . . . . . . 233.3.4.2 Prolongation . . . . . . . . . . . . . . . . . . . . 233.3.4.3 Gauss Seidel Preconditioner . . . . . . . . . . . 243.3.4.4 Level Scheduling . . . . . . . . . . . . . . . . . 243.3.4.5 Multigrid coloring . . . . . . . . . . . . . . . . . 25

3.3.5 Integration with HPCG . . . . . . . . . . . . . . . . . . . 263.3.6 Unit/Integration Testing . . . . . . . . . . . . . . . . . . . 27

3.4 Optimization Summary . . . . . . . . . . . . . . . . . . . . . . . 273.5 CUDA vs OpenCL development . . . . . . . . . . . . . . . . . . 293.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Results and Findings 314.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Hardware Systems . . . . . . . . . . . . . . . . . . . . . . 314.1.2 Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.1 Performance Comparison . . . . . . . . . . . . . . . . . . 324.2.2 Run time contribution of Algorithms . . . . . . . . . . . 334.2.3 Individual Speedup of Algorithms . . . . . . . . . . . . . 344.2.4 Problem Size Scaling . . . . . . . . . . . . . . . . . . . . . 36

4.2.4.1 Relative Scaling of the Benchmark . . . . . . . 364.2.4.2 Relative Scaling of Algorithms . . . . . . . . . . 37

4.3 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4 Power Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 394.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 40

5 Conclusion and Future Work 415.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 Recommendations for future work . . . . . . . . . . . . . . . . . 42

6 References 43

Appendix A Software Artifacts and Results 46A.1 Software Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . 46A.2 Results (Jetson) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48A.3 Results (FireFly) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Appendix B Project Proposal 50B.1 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . . 50B.2 Project Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Appendix C ReadMe File 53


List of Figures

2.1 Average time to completion for HPL benchmark over the last10 years (ICL 2014). . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 HPCG, 27 point sparse stencil matrix (Rubel 2009) . . . . . . . . 62.3 Programming model employed by CUDA and OpenCL. . . . . 11

3.1 Architecture A: Serial computation. . . . . . . . . . . . . . . . . 153.2 Architecture B: GPU access via PCI-Express bus. . . . . . . . . . 163.3 Architecture C: GPU access via main memory. . . . . . . . . . . 173.4 Local reduction of Dot product on GPU . . . . . . . . . . . . . . 223.5 Level Scheduling on two threads with barrier synchronization. 253.6 Graph coloring using min-max approach. . . . . . . . . . . . . . 26

4.1 Time to completion for increasing domain size (Jetson). . . . . . 334.2 Time to completion for increasing domain size (FireFly) . . . . 344.3 Time distribution on Jetson for domain size of 32. . . . . . . . . 344.4 Run-time comparison of algorithms on Jetson, for max domain

size of 80. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Run-time comparison of algorithms on FireFly, for max domain

size of 48. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Time to completion with algorithm wise breakdown (Jetson). . 384.7 Time to completion with algorithm wise breakdown (FireFly). . 384.8 Serial vs parallel performance comparison (Jetson). . . . . . . . 394.9 Serial vs parallel performance comparison (FireFly). . . . . . . . 39

A.1 CUDA Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46A.2 OpenCL Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

ix


x LIST OF FIGURES


List of Tables

2.1 Features and differences of OpenCL and CUDA. . . . . . . . . . 12

3.1 Computations and memory transfers required by HPCG fordomain size of 323 . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Calculated/theoretical execution times for Architecture A. . . . 163.3 Calculated/theoretical execution times for Architecture B. . . . 173.4 Calculated/theoretical execution times for Architecture C . . . 183.5 Comparison of execution times on different architectures. . . . 18

4.1 Hardware specification for selected SoCs . . . . . . . . . . . . . 324.2 Problem Scaling for Jetson and Firefly systems. . . . . . . . . . 374.3 Energy consumed in Joules, for a domain size of 80 . . . . . . . 40

xi


xii LIST OF TABLES


Chapter 1

Introduction

1.1 Introduction

High Performance Conjugate Gradient is an emerging benchmark which wasproposed with the aim of introducing a relevant and effective tool for per-formance evaluations. It overcomes the deficiencies posed by HPL: a widelyrecognized benchmark which gained prominence over 15 years ago but haslost relevance in recent years due to change in data access patters of modernapplications. HPCG uses a preconditioned conjugate gradient (PCG) methodwith a local symmetric Gauss-Seidel pre-conditioner to record and evaluatethe system performance. HPCG has not reached maturity and lacks widescale implementations, limiting its use over varying system architectures andplatforms.

A System on Chip is an integrated circuit that combines components of aconventional computer onto a single chip. The integration of Graphics Pro-cessing Units (GPU) or Digital Signal Processors (DSP) with the CPU canresult in large performance benefits, where the CPU performance is limitedby the architecture. Due to the increasing importance of SoCs, we want todevelop a benchmark that is able to efficiently utilize the hardware featuresof the devices. This can be achieved with the parallel programming model.We will look at two competing APIs, CUDA and OpenCl for harnessing thecomplete computational power of the systems.

The aim of this report is to develop OpenCl and CUDA based implementa-tions of the HPCG benchmark, which are able to utilize on-board acceleratorsfor the benchmark computation. We present an in depth evaluation of thebenchmark algorithm and develop a design for efficiently parallelizing theproblem. A detailed performance and power comparison of the OpenCl andCUDA based implementations is also presented.

1


2 Introduction

We deduce that performance gain for the implementations can be achieved byCPU offloading but is highly dependent on the memory bandwidth and ar-chitecture of the device hardware. Due to the high energy efficiency achieved,we conclude that the implementation is a relevant tool for characterizing SoCdevices.

1.2 Project Outline

Chapter 2 provides the background and motivation for the project.Chapter 3 details the HPCG performance model and the implementation ofthe parallel benchmark.Chapter 4 lists a comprehensive analysis and comparison of the results.Chapter 5 concludes the report and presents possible future work.


Chapter 2

Background and Motivation

Section 2.1 provides the background on HPL (High Performance Linpack)and its shortcomings, forming the motivation behind the project. In Sec-tion 2.3 we discuss the HPCG benchmark in detail. We introduce the readerto System on Chip (SoC) devices and GPU computing in Sections 2.4 and 2.5respectively. We conclude the chapter by listing previous related work on thesubject.

2.1 HPL Benchmark

HPL is a scalable and freely available implementation of Linpack and hasbeen widely used as a yardstick to report the top 500 supercomputer ranks(Strohmaier, 2015). The benchmark solves a uniformly random system oflinear equations and reports time and floating-point execution rates using astandard formula for operation count (Dongarra 1979). Having been writtenin ANSI C and without any external dependencies, it is highly portable. Someof the important features of the HPL algorithm are as follows.

1. Distributed 2 dimensional matrix blocks.

2. LU factorization with varying depths of look-ahead.

3. Recursive factorization.

4. Six panel broadcasting variants and swap-broadcast algorithm for re-ducing bandwidth drain.

5. Problem solving through backward substitution.

3


4 Background and Motivation

2.2 Motivation

HPL gained importance in the early 1990s, when there was a strong cor-relation between the predicted and actual application performance (Heroux2013). However, in recent years, HPL has been criticized as being only par-tially representative of the system performance for a large number of scien-tific applications and computing systems. A number of factors have beenhighlighted for the ineffectiveness of HPL, as a tool for ranking modern ap-plications.

1. There has been a drastic change in computer architectures and applica-tion usage patterns since HPL was introduced. As computer vendorscompete to achieve higher HPL performance for the next generation ofsystems, it can lead to architectures that do not perform well on realworld applications but have higher benchmark rating.

2. Modern application usage can be distributed into two categories. Type 1patterns represent dense matrix multiplications which show high stream-ing data access. Type 2 patterns consist of recursive computations withirregular data access patterns. While HPL can efficiently evaluate Type1 patterns, it is unable to assess Type 2 patterns of application usage(Sandia 2013).

3. Only assessing achievable CPU performance is not a sufficient represen-tation of the performance and a higher emphasis on local bandwidthand network performance should be highlighted.

4. The time to complete the HPL benchmark increases with the number ofcores on the system due to inherent scalability issues. Figure 2.1 showsthe current run-time of HPL, for the top 500 computers list. For somesystems it may take up to 100 hours to complete execution.

5. HPL does not efficiently probe the architecture of the system which de-motivates vendors, designing systems for increased HPL performance.

2.3 High Performance Conjugate Gradient (HPCG)

2.3.1 Overview

HPCG benchmark was introduced to overcome the deficiencies of HPL witha focus on the issues mentioned in Section 2.2. The HPCG algorithm at-tempts to solve a system of linear equations, Ax = b given an initial guessof x0 (Sandia 2013). The problem is solved using domain decomposition. It


§2.3 High Performance Conjugate Gradient (HPCG) 5

Figure 2.1: Average time to completion for HPL benchmark over the last 10years (ICL 2014).

is discretized as a 27 point, three dimensional heat diffusion problem on asparse semi regular grid. The benchmark generates preconditioned conjugategradient iterations, using a two step Gauss-Seidel sweep.

2.3.2 Problem Size

The problem domain is three dimensional. The global domain of Mx, My, Mzin the x, y, z dimensions is distributed into a local domain of Nx, Ny andNz. Each local sub grid is assigned to a process. The total number of pro-cesses are provided as input, or detected at run-time. These processes aredistributed into a three dimensional domain, where the total number of pro-cesses P = Px ∗ Py ∗ Pz. The global domain Mx, My, Mz can be calculated asPx.Nx, Py.Ny, Pz.Nz. A restriction is also imposed on the domain for accu-racy and efficiency purposes. Minimum grid size in each dimension must beat least 16 and the grid should be a multiple of 8.

As an example consider a local domain size of 48 (Nx), 25(Ny), 20(Nz) exe-cuted with 32 MPI processes.

P (processes) = 4, 4, 2 (4 ∗ 4 ∗ 2 = 32)M (global domain) = 192(Mx), 100(My), 40(Mz)Total number of equations (TE) = 768000 (192 ∗ 100 ∗ 40)Number of equations per process = 24000 (TE/P)∗Total number of non zeros = 20736000 (27 ∗ TE).

∗This is only an approximation. The exact number of zeros will be less than



Figure 2.2: HPCG, 27 point sparse stencil matrix (Rubel 2009)

the indicated number due to the boundary points.

2.3.3 Problem Description

The benchmark constructs a 3 dimensional 27 point stencil matrix such thateach point, represented by i, j and k depends on the value of the surrounding26 points. The properties of the linear system of equations are as follows :

The generated matrix (A) is sparse and has at least 27 non zero values for in-terior points. The boundaries have 7 to 18 non zero points depending on theboundary position. Matrix A, is symmetric with positive eigenvalues (sym-metric positive definite). It is stored in Compressed Sparse Row format (CSR).

Figure 2.2 shows a corresponding 27 point stencil matrix where the centralpoint (highlighted in red) is dependent on the surrounding 26 points.

The algorithm listed below gives a detailed step by step execution for thesolution of the equation Ax = b where A is a real, symmetric, positive-definitematrix and x0 is the approximate initial solution.


§2.3 High Performance Conjugate Gradient (HPCG) 7

Initializationp(0) = x(0)r(0) = b − Ap(0) ( r = residual vector )i = 0

Loop ( i: 1...n ) (Start Iterations)

if( preconditioning is active )z(i) = M−1 ∗ r(i − 1) (Apply preconditioner)

else {z(i) = r(i − 1) (Copy residual vector to z)

if (i == 1) { (do this only for the first iteration)p(i) = z(i) (copy preconditioned vector to pi)α(i) := dotProduct(r(i − 1), z(i))

}else { (iterations)

α(i) := dotProduct(r(i − 1), z(i))(i) := (i)/(i − 1)p(i) := (i) ∗ p(i − 1) + z(i) (waxpby)

}end if

end if

ap(i) = A ∗ p(i) (spmv)α(i) := dotProduct(r(i − 1), z(i))/dotProduct(p(i), ap(i))x(i + 1) := α(i) ∗ p(i) + x(i) (waxpby)r(i) := r(i − 1)α(i) ∗ A ∗ p(i) (waxpby)

End Loop

∗spmv = sparse matrix vector multiplication∗waxpby = scaled vector addition

From the above algorithm we can identify four major steps in the executionprocess.

1. MG (Multi Grid Preconditioning) : The preconditioner uses a symmetricGauss-Seidel smoother with two step sweeps. The forward and back-ward sweeps act as sparse triangular solvers and apply the approxima-



tion to each element one row at a time.

2. SPMV (sparse matrix vector multiplication) : This operation collects thevalues of neighbouring points and proceeds to multiply the vector withthe input vector to produce a result.

3. WAXPBY (scaled vector addition) : Computes the update of a vectorwith the sum of two scaled vectors.

4. Dot Product: Computes the sum of the products of the correspondingentries of the two vector matrices consisting of double precision num-bers. An inner local product is first computed which is then reducedglobally to produce the final dot product value.

2.3.4 Execution Flow

The program execution is divided into 5 discrete steps which include theproblem generation, testing and execution. The detailed steps for the bench-mark execution are as follows:

1. Problem Setup

(a) Generate Geometry : Calculate the domain and problem size (Referto Section 2.3.2).

(b) Generate Problem: Construct the symmetric positive definite ma-trix. Populate the associated sparse row format data structures(Values ,Row/Column Indices).

(c) Setup Halo: Setup data structures for exchange of data.

(d) Initialize Sparse CG Data: Setup data structures for precondition-ing.

(e) Optimize Problem: Perform optimizations to the matrix layout andstructures. This includes optimizations for faster memory access ofdata points within the matrix, or rearranging the matrix to comple-ment data parallelism.

2. Testing

(a) Spectral Test: Temporarily convert the matrix into a diagonal ma-trix by multiplying the diagonal values by factor of 106. The mod-ified matrix should then converge in approximately 10 to 12 itera-tions.

(b) SPMV : Compare the results of SPMV with the solution vector.


§2.4 System on Chip (SoC) 9

(c) Symmetry: The scalar products for SPMV and SYMGS are com-puted, which represent the departure from symmetry for the givenmatrices.

3. Reference Run

(a) Run the reference algorithm for 50 iterations. The residual valueachieved by reference run should be the same as calculated by theoptimized algorithm. However the optimized algorithm may ormay not converge in the same number of iterations.

4. Optimized Problem Setup

(a) Run the optimized algorithm and record the time and number of it-erations it takes for it to achieve the same residual as with referencerun.

(b) Using the time to completion for a single run, compute the totalnumber of sets that will be required. The number of set requiredcan be calculated as: No of runs = Total Time / Time for one itera-tion.

(c) If the residual drop, of reference and optimized algorithms is thesame, the convergence will occur in approximately 50 iterations foreach set. If this value is not achieved then the optimized algorithmis not valid for the benchmark calculations.

5. Benchmarking and Reporting

(a) Run the optimized algorithm for the total number of sets requiredto consume the total execution time.

(b) The residual value of each set is recorded. Small variations in theresidual are allowed which may occur from rounding off or slightchange in precision of data structures.

(c) Two reporting files are generated. The report indicates if the algo-rithm completed successfully as well as the residual for each set.

2.4 System on Chip (SoC)

A system on chip device integrates the components of a conventional com-puting system on a single chip. Along with a CPU, a SoC is integrated withDSP/GPU and other standard I/O peripherals. Due to the high level of inte-gration, it consumes considerably lower power, in comparison to conventional



systems. Reducing the number of physically distributed chips results in a de-vice that is low cost. These systems are used widely in consumer electronics,mobile and embedded systems market. SoCs are evolving to achieve higherperformance while decreasing power consumption, increasing their use andimportance in the computing industry.

Due to the low performance of central processing units, computation on thesedevices is done by offloading tasks to the DSP/GPU. Two competing APIs forparallel processing, CUDA and OpenCL are used for harnessing the availablecomputational power of the systems.

2.5 Programming SoC Devices

Accelerators/GPUs have evolved into highly parallel multi-core systems al-lowing efficient manipulation of large blocks of data at the cost of memorytransfers. Due to the constraints on the available memory and performanceof a single computation unit, accelerators on the integrated hardware shouldbe utilized to realize the full potential of the device. For developing a parallelimplementation and for it to be applicable to a range of devices, an OpenCLas well as CUDA version of parallel HPCG was developed. Details of the twoprogramming APIs are listed in Section 2.5.1 and Section 2.5.2 .

2.5.1 CUDA

Compute Unified Device Architecture (CUDA) is an API introduced by NVIDIAin 2006, for general purpose GPU programming. It has been widely adoptedin the industry for graphic processing as well as scientific computation. CUDAis designed to work with C, C++ and Fortran (Owens 2013). It providesa number of extensions and performance enhancements over other paral-lel programming APIs, for Nvidia devices. The execution model of CUDAis divided into Grids, Blocks and Threads. A grid is a collection of blocks.Blocks are subdivided into threads, where each block maps to a multiproces-sor within the GPU. A thread is a single execution instance and one or morethreads may be handled by a streaming multiprocessor.

2.5.2 OpenCL

Open Computing Language (OpenCL) is a framework for writing programsthat execute across heterogeneous platforms consisting of central processingunits (CPUs), graphics processing units (GPUs) and digital signal proces-sors (DSPs). OpenCL is an open standard maintained by the Khronos Group


§2.6 Related HPCG Publications 11

Figure 2.3: Programming model employed by CUDA and OpenCL.

(David Jin 2011). The execution model of OpenCL is divided into Workitems,Workgroups and Threads. A work item is a single kernel execution instance. Awork group is a set of work items which in turn consists of multiple threads.

Figure 2.3 shows a visual concept of the OpenCL and CUDA constructs. Themajor features and differences of OpenCl and CUDA are highlighted in Ta-ble 2.1.

2.6 Related HPCG Publications

A study was published by Nvidia, titled, "Optimizing the High PerformanceConjugate Gradient Benchmark on GPUs" (Everett Phillips 2014). The bench-mark was converted to a Cuda 6.5 compatible version, using the proprietarycudaSparse library extension. Nvidia adopted a three step process for theoptimization of the HPCG benchmark which included, matrix reordering,creating custom kernels and converting the matrix to Ellpack format, usingCUDA extensions. They reported minor gains for the SPMV, Dot Productand WAXPBY algorithms and a large performance gain for the SYMGS pro-cedure, for a single iteration. The overhead for the parallel version was thegreatest for Dot product and least for SPMV. The source code and detailedmethodology of the implementation was not released by Nvidia and there-fore a comparative analysis could not be done for SoC devices.



Feature OpenCL Cuda

Pinned MemoryAvailable through Map

BufferNative support

CPU Support OpenCL CPU Device No CPU Device

Vendor Support Industry Wide Nvidia Only

C/C++ Language

SupportYes Yes

DriversNot all vendor drivers are

matureMature drivers

PackagesFewer built in functions

compared to Cuda

More built in

functions and libraries

+ Developer support

Table 2.1: Features and differences of OpenCL and CUDA.

Intel and IBM, in collaboration published a paper titled, "Optimizations ina high-performance conjugate gradient benchmark for IA-based multi andmany-core processors" (Park 2015). The HPCG benchmark was optimized formulti core Intel Xeon processors and many core Xeon Phi coprocessors. Theoptimizations included fusion of multiple algorithms to achieve higher band-width and enhanced task scheduling. The implementation was evaluated inclusters for various configurations and high performance efficiency was seenfor low-diameter high-radix network topologies.

2.7 Summary

HPL is a widely used benchmark which has lost relevance due to the diver-gence between the practical and actual performance results. This has pro-vided grounds for the proposal of an enhanced benchmark. HPCG is anemerging benchmark which has shown promising results, however its cur-rent implementations are limited to a small number of hardware architec-tures. Due to the emergence and increasing importance of SoC devices, weaim to implement parallel Cuda and OpenCl versions of HPCG, in accordancewith the SoC programming model.


Chapter 3

Design and Implementation

We discuss the steps towards building parallel Cuda and OpenCl versionsof HPCG in detail. Section 3.2 constructs a theoretical performance modelfor HPCG, followed by a detailed explanation of the implementation in Sec-tion 3.3. The optimizations introduced to the algorithms are highlighted inSection 3.4.

3.1 Design

An iterative design methodology was used for developing the parallel HPCGsoftware. A base version of the algorithm was developed and incrementalperformance improvements were introduced with each iteration. The imple-mentation was completed in five major steps:

1. Modeling the problem.

2. Implementation of parallel HPCG algorithms.

3. Integration of algorithms with HPCG benchmark.

4. Unit and Integration testing.

5. Introducing optimizations.

The implementation of parallel algorithm focused on the four main modules,namely SPMV, WAXPBY, Dot Product and MG. The MG algorithm internallyinvokes three additional modules that can be implemented on the GPU (Re-striction, Prolongation and SYMGS). Refer to Section 3.3 for details.

13


14 Design and Implementation

3.2 Modeling the Problem

HPCG is not computationally expensive but requires large number of mem-ory transfers due to value gathering and reduction operations. We want to in-vestigate how HPCG performs theoretically on different hardware platformsand if it is a good candidate for parallelization.

Three models have been presented below. The first model, as shown in Fig-ure 3.1, depicts an architecture which does not have a GPU unit. This modelassumes the execution of the serial HPCG algorithm. Model two (refer toFigure 3.2) is for a conventional system architecture, where the GPU is con-nected to the CPU via the PCI express bus. The third architecture as shown inFigure 3.3, considers the case where the GPU is connected to the CPU via themain memory. This design is closest to what a SoC architecture may look like.Model two and three assume that the CUDA/OpenCL HPCG benchmark willbe executed on them. The listed models evaluate the execution time for DotProduct, WAXPBY and SPMV algorithms. The MG algorithm has not beenincluded due to the complexities involved in parallelization.

3.2.1 Assumptions/Data Access rates

1. CPU = Clock: 2Ghz, Cores: 1

2. CPU Memory = Type : DDR3, Clock: 1066Mhz, Bus width: 64 bits,Channels: 2, *Bandwidth: 17.06 GB/s, Latency = 7 ns

3. GPU = Multiprocessors: 16, Cores: 32 per SM, Clock : 1.54 Ghz, 2 Multunits / SM

4. GPU Memory = Type: GDDR5, Clock: 1375 Mhz, Bus Width: 256 bits,Channels 2, *Bandwidth: 88 GB/s, Latency = 15 ns

5. PCI Express = Type : PCI-E 3, *Bandwidth : 15.754 GB/s, Latency :(Read request + Read Completion) = 240 ns (2)

*Bandwidth = Clock x Channels x Bus WidthFor simplicity we assume blocking read/write to memory.

3.2.2 Problem Size

A domain size of 323 has been considered for modeling the problem. Table 3.1shows the associated operations and data transfers required.


§3.2 Modeling the Problem 15

Type SPMV WAXPBY Dot Product

Vectors/MatrixRead(A, A index,

A length), Write(y)Read (x,y) Write(z)

Read (x,y),

Write (z)

Memory Read26,2144 + 78,6432

bytes52,4288 bytes 52,4288 bytes

Memory Write 26,2144 bytes 26,2144 bytes 8192 bytes

Operations27 ops x (3 LD,

1ST, 2Add 1 Mult),

Exchange Halo

2 LD, 1 ST, 1 Mult2 LD, 1ST,

1 Mult, 1 Add,

Reduce

Table 3.1: Computations and memory transfers required by HPCG for domainsize of 323

3.2.3 Hardware Architecture A

We consider the architecture where the CPU is not integrated with a GPUunit. The processing of the HPCG algorithm is done serially. Figure 3.1 pro-vides a visual representation of the architecture. Table 3.2 lists the theoreticalexecution time in milliseconds, for each phase of the transfer/calculation.

Figure 3.1: Architecture A: Serial computation.



Transfer Type SPMV (ms) WAXPBY (ms) Dot Product (ms)

Memory to CPU 0.0461 0.0307 0.0307

CPU to Memory 0.0154 0.0154 0.0005

Computation 0.1016 0.0492 0.0743

Total Time 0.1631 0.0953 0.1055

Table 3.2: Calculated/theoretical execution times for Architecture A.

3.2.4 Hardware Architecture B

The second models considers the case where the CPU and GPU are connectedvia PCI-Express bus. This model represents the architecture of most conven-tional computing devices. A memory copy to the GPU has to be transferredover 3 busses, as shown in Figure 3.2. Table 3.3 provides the calculated exe-cution time for each phase of the process.

Figure 3.2: Architecture B: GPU access via PCI-Express bus.


§3.2 Modeling the Problem 17


CPU to GPU 0.1056 0.0701 0.0701

GPU to CPU 0.0352 0.03520.0013 + 0.035

(reduction)

GPU to GPU memory 0.0150 0.0089 0.0031

Computation Time 0.0067 0.00010.0042 + 0.03

(reduction)

Total Time 0.1625 0.1144 0.1106

Table 3.3: Calculated/theoretical execution times for Architecture B.

3.2.5 Hardware Architecture C

Architecture C, examines the case where the CPU and GPU are connectedthrough the main memory. A direct transfer between the GPU and mainmemory can occur using DMA controllers, reducing transfer time overhead.This can also result in gains due to the high transfer rate between the twomemory devices. Figure 3.3 presents this model in detail. The execution timecalculations are shown in Table 3.4.

Figure 3.3: Architecture C: GPU access via main memory.

3.2.6 Model Summary and Deductions

It can be seen from Table 3.5, that the parallel HPCG algorithm does not re-sult in any performance gains for system architectures where the CPU andGPU are connected via the PCI bus. The large number of memory transferson the PCI bus act as a bottleneck due to limited bandwidth and high latency




CPU to GPU 0.0461 0.0308 0.0308

GPU to CPU 0.0154 0.01540.0005 + 0.017

(reduction)

GPU to GPU memory 0.0150 0.0089 0.0031

Computation Time 0.0067 0.00010.0042 + 0.03

(reduction)

Total Time 0.0832 0.0552 0.0856

Table 3.4: Calculated/theoretical execution times for Architecture C

Architecture SPMV (ms) WAXPBY (ms) Dot Product (ms)

A (CPU only) 0.1631 0.0953 0.1055

B (GPU via PCI) 0.1625 0.1144 0.1106

C (GPU via Memory) 0.0832 0.0552 0.0856

Table 3.5: Comparison of execution times on different architectures.

rates, resulting in parallel execution time that is worse than the serial execu-tion time.

From the provided model we can estimate that the HPCG benchmark willpotentially benefit from the use of an accelerator in limited number of casesas highlighted below:

1. The host system is compute performance limited, such that the compu-tational time exceeds the memory transfer time from the host to device,by a factor greater than 1. In this case, performing computations locallyon the CPU will be significantly more expensive and we can benefitfrom the faster parallel processing on the GPU.

2. The HPCG algorithm is executed for a large problem size, given thetime for benchmark execution remains constant. The number of execu-tion sets will decrease as a single iteration will consume more time forcompletion. Therefore fewer overall memory transfers from the CPU toGPU and back will be required, essentially making the problem morecomputationally bound.

3. The host system and the GPU are directly connected to the memory


§3.3 Implementation of parallel HPCG algorithm 19

interface therefore enabling the GPU to have direct access to the hostmemory. This eliminates the need for expensive Host to GPU and GPUto Host memory transfers over the bandwidth limited PCI bus.

3.3 Implementation of parallel HPCG algorithm

The main focus of the parallelization are the procedures inside the conju-gate gradient iterations, as listed in the HPCG algorithm description (refer toSection 2.3.4). The details and implementations of the SPMV, WAXPBY, Dot-Product, Prolongation and Restriction kernels are highlighted in this section.The SYMGS procedure or the Gauss Seidel preconditioning, serves as an in-teresting problem for parallelization, however the complete implementationof the module was out of the scope of this project. A description of the prob-lem and potential solutions for parallelizing SYMGS have been provided, as areference for future work and implementations. The algorithms listed belowwere implemented using Cuda and OpenCL APIs, resulting in two separateversions of the HPCG benchmark.

3.3.1 Sparse matrix vector multiplication (SPMV)

SPMV solves the equation y = Ax, given the sparse matrix A stored in com-pressed row format. Input vector x and output vector y are both dense ma-trices. SPMV consists of nested loops where the inner loop traverses over thenumber of items per row, which in the case of HPCG will always be 27 orless. The outer loop traverses over the size of the input vector. The productAx for each value of the row is summed up and assigned to the y index.Since there are no row based dependencies, the SPMV algorithm can be par-allelized. Each row is assigned to a separate thread, summed up over a localshared variable and assigned to the y matrix index, matching the thread id.

A number of optimizations were made to make the computation more ef-ficient. Matrix A and corresponding CSR tables were converted to singledimensional matrices for faster and aligned access to data. The code wasmodified for vector calculations to achieve higher level of concurrency. Thiswas done by processing the elements of a row in chunks of four vector op-erations and by loop unrolling. Since the number of elements of a row canvary and will not always be divisible by 4, the left over rows were identifiedand summed up separately before assigning the final value to the output, Ymatrix.



The sparse matrix A and CSR tables were also pre-loaded into the GPU mem-ory, to avoid repetitive communication time from copying data to the GPUfor every iteration. This enabled the SPMV to run with one Device to Hostand Host to Device data transfer. Since Matrix A is never modified duringthe iterations, this did not affect the results. The Device to Host transfer wasdone in non blocking mode to avoid excessive communication time overhead.The Host to Device data transfer however had to be done in blocking modeso that the data was available immediately for further calculations.

Pinned memory, can be used in OpenCL which prevents memory from be-ing swapped out, providing higher throughput. However the use of pinnedmemory was not found beneficial in this case, as the frequent memory alloca-tion operations, were found to be more expensive compared to conventionalallocations. A modification was also made to allow for a single thread to pro-cess more than one row by setting the elements per thread variable, to controlthe level of parallelism on the GPU. A simplified representation of SPMV hasbeen presented below:

1. Assign global index as thread numbers : i = getGlobal Id

2. Calculate and assign offset : o f f set = i ∗ itemsPerThread

3. Loop over the offset : xo f f

4. Get matrix A, row values : ∗rowVals = A.matrixValues[xo f f ]

5. Get matrix A, element indices : ∗rowIndex = A.mtxIndL[xo f f ]

6. Get matrix A, no of non zero elements : j = A.nonzerosInRow[xo f f ]

7. Iterate over j and sum up values (A.x) : sum = sum + rowVals[j] ∗x[rowIndex[j]]

8. Assign to matrix y: y[xo f f ] = sum

3.3.2 Scaled Vector Addition (WAXPBY)

Scaled vector addition is a simple operation that computes the output vectorby scaling the input vectors with a constant and performing an addition onthe values of the same index. Consider alpha and beta as double precisionscaling values, then the WAXPBY operation is represented by the equation:w(i) = alpha ∗ x(i) + beta ∗ y(i). There is no dependency between successiveiterations, as can be seen from the equation and therefore the algorithm canbe parallelized.



Most of the HPCG references to WAXPBY are made with an alpha or betavalue of 1.0. As an optimization, we can save up on a double precision multi-plication in the case where either of the alpha or beta values equate to 1.0, byhaving conditional branches in the algorithm. Since this is a computationallysimple operation we can use 8 vector operations per thread to compute theresult. A global items per thread variable was also used to allow for com-puting multiple iterations per thread. The algorithm requires two vectors xand y to be copied to the GPU in non blocking mode. The resulting outputvector is copied back to the host in blocking mode. The WAXPBY algorithmis presented below :


2. Calculate and assign offset : offset = i ∗ itemsPerThread


4. case alpha = 1 : x[xo f f ] + beta ∗ y[xo f f ]

5. case beta = 1 : alpha ∗ x[xo f f ] + y[xo f f ]

6. else : alpha ∗ x[xo f f ] + beta ∗ y[xo f f ]

3.3.3 Dot Product

Dot product is the sum of the products of the corresponding entries of twomatrices. It can be simply expressed as x.y = Summation(x(i) ∗ y(i)). Thereare no dependencies in the calculation of dot product.

Dot Product requires two reduction operations for the parallel implementa-tion. In the first phase, a local array is allocated to each work item/block. Thethreads multiply the x and y vectors and store them in corresponding indicesof the local array. The second phase reduces the local values into a singlevalue. This step can be better explained by the diagram shown in Figure 3.4,which shows an array of 14 items as an example. In each successive iterationthe threads sum the values such that the size of the array is decreased by halfand eventually is reduced to a single value. However it should be noted thatfor each iteration only half the threads are active, therefore the reduction timeis equivalent to the worst time taken by the last thread to complete execution.Multiple synchronization points have to be used for parallel computation ofthe dot product to ensure correctness, limiting the performance.

After the reduction phase we have an array of items equivalent to the size of



Figure 3.4: Local reduction of Dot product on GPU

the number of work items/blocks allocated on the GPU. The array is trans-ferred to the host. On the host, the final reduction has to be done in serial toproduce the final output.

Dot product requires a Host to Device transfer of the two x and y vectors,which is performed in non blocking mode. The Device to Host memory trans-fer size is limited to the maximum number of permissible blocks/work items,supported by the device and is set on initialization of HPCG. An improve-ment in device to host transfer time was achieved by using pinned memory,as frequent memory allocations were not required. Vector data types of size 4were also used. However this required the 4 vector values to be reduced intoa single summed value, before being copied back to the host.

In order to shrink the local reduction phase, multiple iterations were pro-cessed by each thread in the first phase by using an item per thread variableand calculating the value offsets. Therefore for the reduction phase, fewer val-ues needed to be combined. The value was tweaked for different hardwaresystems by hit and trial, to achieve the best possible results.

3.3.4 Multi Grid Preconditioner (Compute MG)

The HPCG benchmark uses a multigrid preconditioner to produce a dampingeffect and therefore reduce the error rates. The idea of the multigrid precon-ditioner is to represent the error from the initial grid on to a coarser grid,where the low frequency components of the vector become high frequencycomponents of the coarser vector x. It is completed in a number of steps :

1. Perform Gauss Seidel Preconditioning and compute the residual vector.

2. Restrict the residual vector to a grid spacing of 2s.

3. Perform Gauss Sidel Preconditioning to smooth the error.



4. Prolong the error correction vector back to grid spacing of s.

3.3.4.1 Restriction

The restriction operation is executed as part of the multi grid preconditioner.It transfers the residual matrix onto a coarse residual vector of spacing 2s.Given the fine grid matrix-vector product a f , fine grid vector r f and assumingan initial grid spacing of s, the coarse grid vector can be represented by rc(i) =a f (j) − r f (j). The operation is not computationally expensive but requireslarger number of data accesses. A disadvantage of the algorithm is that wecannot effectively use vector data types, as the index needs to be computedfrom an external array. However an offset was used to allow computation ofmultiple values per thread. The algorithm can be represented as follows :

1. Get MG fine grid vector : ∗a f = A.mgData− > Ax f− > values

2. Get MG coarse grid vector : ∗rc = A.mgData− > rc− > values

3. Get MG data indexes : ∗idx = A.mgData− > f 2cOperator




7. Calculate coarse grid vector : rc[i] = r f [idx[xo f f ]]− a f [idx[xo f f ]]

3.3.4.2 Prolongation

The prolongation operation is executed in succession to restriction and GaussSeidel smoothing. It transfers the error correction matrix, as computed fromthe Gauss Seidel conditioning, back to fine grid space of s. Given the finegrid sparse matrix object x f and the coarse grid correction vector xc, prolon-gation can be computed as x f (i) = x f (i) + xc(j). Similar to the restrictionoperation, the prolongation operation is memory bound. An explanation ofthe algorithm is provided below :

1. Get MG data indexes : ∗idx = A.mgData-> f 2cOperator

2. Get MG coarse grid correction vector : ∗xc = A.mgData->xc->values






6. Calculate the update : x f [idx[xo f f ]]+ = xc[xo f f ];

3.3.4.3 Gauss Seidel Preconditioner

Preconditioning is used by iterative solvers to obtain an equivalent systemfor the solution of A.x = b. The system should converge faster with thematrix M−1.A, compared to the original system, where matrix M−1 is an ap-proximation to A−1. HPCG uses a symmetric pre-conditioner and thereforecompletes the process in forward and backward sweeps. The matrix is de-constructed such that A = L + U + D, where D represents the diagonal, Lis the lower triangular component and U is upper triangular component ofmatrix A. The forward sweep then estimates: M f = D + L and the backwardsweep Mb = D + U. The symmetric Gauss Seidel preconditioner then can beexpressed as sgs = (D + L)D−1(D + U). The pseudo code for the forwarditerations is shown below. For a backward sweep the direction of the outerloop is reversed.

for i = 1 : N dosum = b[i];for j = 1 : NumberO f NonZerosInRow, do

sum = sum − A[i, j] ∗ x[columnIndex[j]]end forsum = sum + x[i] ∗ Diagonal[i] (Diagonal contribution)x[i] = sum/Diagonal[i]

end for

It can be seen that there is a direct dependency on the ordering of the matrixrows and therefore the problem cannot be directly parallelized. A parallelOpenCl version of SYMGS was created, however the parallel kernel could notbe executed without reordering of dependencies and would lead to incorrectresults. Two potential solutions have been proposed for achieving parallelism.Level scheduling exposes a small degree of parallelism and does not affect theconvergence rate of the preconditioner. Multigrid coloring on the other handexposes a larger degree of parallelism at the cost of higher convergence rates.

3.3.4.4 Level Scheduling

Level Scheduling relies on building a task dependency graph in order toachieve parallelism. A task dependency graph can be built by consideringthat for any non zero element (i, j), solving for an unknown in the i’th row,



will depend on non zero elements in the j’th column (Martin 2014). For ex-ecuting level scheduling, the level of a task is defined by its distance fromthe entry point. Each level of the task dependency graph can be executedin parallel or evenly partitioned among available threads. The levels will beexecuted one at a time with barrier synchronizations. Figure 3.5 shows howtasks are distributed into levels and allocated to threads.

Figure 3.5: Level Scheduling on two threads with barrier synchronization.

3.3.4.5 Multigrid coloring

Multigrid coloring is more efficient than level scheduling and can result inhigher degree of parallelism. The techniques explored draw inspiration fromLuby / Jones-Plassmans work on graph coloring (Jones and Plassman 1993).The objective of graph coloring is to assign colors to a matrix such that no twoconnecting elements have the same color. Given a matrix, we can attempt thisby assigning all the elements in the matrix a random number. We proceedby traversing over the elements and checking if an element is a local maxi-mum i.e has the highest assigned value among all the immediate connectedelements. If the element has the maximum value, we assign it a color, if notwe proceed to the next node. We only assign one distinct color per traversal,and perform multiple iterations until all the elements have been assigned acolor.

The algorithm can be made more efficient by assigning two colors per traver-sal, one for the local maximum and one for the local minimum. Using efficienthash functions can result in complete coloring of the matrix with further re-duced overhead. An example of graph coloring using min-max approach hasbeen presented in Figure 3.6 and can be completed in two iterations. Once the



Figure 3.6: Graph coloring using min-max approach.

colors have been assigned, we can proceed to process the values representedby each individual color in parallel.

For HPCG, since we have row based dependencies, we will consider eachrow of the matrix as a single node, for the purposes of graph coloring. Toreorder the rows, a permutation matrix is created from the colored matrixsuch that each color is treated as a separate key. Now the algorithm can beexecuted in parallel, one color at a time, with barrier synchronization betweenindividual colors, therefore exposing parallelism.

3.3.5 Integration with HPCG

The parallelized algorithms after being tested for accuracy had to be inte-grated with the HPCG algorithm so that they would be called and timedby the benchmark without inducing any irregularities. HPCG is written inC++ however Cuda and OpenCl libraries are more readily available in C andtherefore the parallel algorithms were coded in C.

The files were linked using object files to produce a single executable. C func-tion name linkages were used inside the C++ HPCG code. Two Makefileswere used for the project. The make file for the Cuda/OpenCl code compilesand copies the object files in a common HPCG directory. The main HPCGMakefile, initially invokes the external Cuda/OpenCl Makefile, compiles theHPCG code and produces the executable.


§3.4 Optimization Summary 27

3.3.6 Unit/Integration Testing

The modules were developed independently and tested, before integrationwith HPCG. The following unit test procedures were used to ensure correct-ness :

1. Initializing the input matrix to simple unit values and comparing theactual with expected output

2. Comparing the output with the result from equivalent serial version ofalgorithm. Diff checker was used to identify and inspect errors.

Integration testing was done in order to maintain accuracy and achieve pre-dicted HPCG convergence rates. The calculated residual value, from thereference phase, was compared with the actual value computed during thebenchmarking phase. For an error free run, the residual deviation should benegligible and a single set of computations should terminate after approxi-mately fifty iterations. The residual deviation for the presented results hasbeen listed in Appendix A for reference.

3.4 Optimization Summary

The following general optimization techniques were used to improve the per-formance and efficiency of the benchmarking algorithm:

1. Using vector operations: Vector data types can be used in OpenCl tofully utilize multiple compute units on the hardware. However thesedata types can only be used for simple computations such as divi-sion, subtraction and multiplication without invoking excessive over-head. The use of vector operations may require an additional step ofreduction, to combine the multiple data types into a single variable.

2. Moving the declaration of variables outside the loop: If a variable decla-ration is inside a loop, it may cause excessive overhead, as the variablehas to be reallocated on each loop iteration. Therefore moving the vari-able outside the loop can improve performance.

3. Pre-loading Matrices: Some matrices will not be updated for succes-sive iterations. Identifying and pre-loading matrices to GPU memory,instead of transferring them on each iteration can result in bus band-width savings and reduce memory transfer penalties, resulting in largeperformance gains.



4. Rearranging Matrices: Rearranging the matrices to allow unit stride ac-cess to memory, can greatly enhance the speed of load, store operations.This can result in better memory locality and eliminate the need for con-stant memory swapping, therefore improving efficiency. In the case ofHPCG rearranging the matrices to ensure row based access resulted inimproved performance.

5. Non Blocking Memory Transfers: We can mask the memory transferpenalties by triggering non blocking memory transfers and continuingcomputation. However it can only be used in cases where the trans-ferred data is not required for immediate use and will otherwise resultin incorrect results.

6. Storing frequently executed operations in a variable: if an operation a ∗ bis computed x number of times within a loop iteration, then instead ofperforming ten multiplications, we can store the outcome of a ∗ b andreplace the subsequent operations with the variable, therefore savingx − 1 multiplication operations.

7. Removing un-utilized loops: If an array is being initialized and subse-quently being assigned values, then the initialization is not required andits removal can save computation time.

8. Using pinned memory : Pinned memory can provide faster transferspeeds by not allowing the memory to be swapped out. However thiscomes at a cost, as it requires more time for the initial memory alloca-tion. Additionally, the completion of the memory transfer can only beguaranteed on the deallocation operation due to its non blocking nature,enforcing the need for repetitive allocation/deallocation operations.

9. Merging Loops: If two loops iterate over the same limits and the param-eters inside the loop are independent of each other, then the loops canbe merged, saving N branching operations.

10. Looped Initializations: Initializing arrays by looping over individualvalues can be computationally expensive. Memory copy operations canachieve the same initialization by using direct memory transfers andtherefore saving CPU cycles.

11. Unified memory : Unified Memory creates a pool of managed memorythat is shared between the CPU and GPU, bridging the CPU-GPU divide(Harris, 2013). It is accessible to both the Host and Device using a singlepointer. The availability of unified memory is architecture dependent.


§3.5 CUDA vs OpenCL development 29

12. Function in lining: Function in lining can reduce the overhead of call-ing another function, especially if the function is being called multipletimes. In HPCG, container functions are used to allow for convenienceof modifications. Removing the container functions and directly callingthe intended procedures can result in performance gain.

13. Branching vs Multiplication: If the output of a double precision multi-plication operation is expected to be constant value (multiplication by1 or 0), then we can save on the computations by using conditionalbranches. The gain on performance will be dependent on the architec-ture as pipelined multiplication can be faster to compute than a branch-ing operation on some hardware systems.

14. Register Variables: The register keyword can be used so that the pro-grammer can suggest to the compiler that particular variables shouldbe allocated to CPU registers for faster performance. However this doesnot guarantee faster execution, for example, if too many register vari-ables are declared and there are not enough registers available to storeall of them, then values of some registers will be moved to lower levelmemory.

15. Compiler Optimizations: C/C++ compilers will probe the code andattempt to perform thorough compiler optimizations. For the case ofHPCG, level O3 optimizations were used.

3.5 CUDA vs OpenCL development

We have already seen the differences between the CUDA and OpenCL APIsin Section 2.5. In context of HPCG, development on OpenCL was foundto be relatively more difficult compared to CUDA. This is primarily due tothe reason that the programmer has to keep track of low level constructsin OpenCL and efficiently utilize them. While this provides more controlover the hardware, it can also lead to hard to resolve errors. CUDA has asimpler instruction model. It also has a wider range of available tools andbetter vendor support. However the most significant drawback of CUDA isits limited hardware compatibility.

3.6 Summary

We developed a theoretical model for selected HPCG algorithms to show thatHPCG is a memory bound problem. We realized that performance gain from



a parallel implementation can be achieved, given some desirable features areavailable on the system hardware. A parallel CUDA and OpenCL basedversion of HPCG was implemented after thorough testing and integration.A number of optimizations were introduced to obtain higher performancegains.


Chapter 4

Results and Findings

In this chapter, we provide the performance results for the parallel HPCGbenchmark. The experiments allows us to have a better understanding ofhow the implementations perform on SoC devices. Section 4.1 provides thedetails of the selected devices and introduces the test case. We evaluate theoverall and individual performance of the HPCG algorithms in Section 4.2and Section 4.3. Section 4.4 provides a comparison of the energy consumedby the systems, followed by the conclusion.

Note: For convenience, the problem domain size in this chapter is indicatedas the cube root of the actual problem size (actual domain size = n3).

4.1 Experimental Setup

4.1.1 Hardware Systems

Two SoC devices were chosen for the performance evaluations. Jetson TK1 isan embedded platform, produced by Nvidia, with high memory bandwidthand CPU processing power. Firefly in comparison has lower CPU/GPU clockrate and bandwidth. A summary of the system hardware specifications isprovided in Table 4.1.

4.1.2 Test Case

We built the test case, to best fit the performance capabilities of the targethardware. Testing on Jetson was done with a domain range of 16 to 80 andsuccessive increments of 16. The iterations were terminated after the execu-tion time exceeded 60 seconds. The total number of sets required to fullyconsume the allocated time varied from 188 to 2 sets.

After initial sample run of HPCG on FireFly and realizing lower performance

31


32 Results and Findings

Name CPU GPU MemoryGPU

Driver

Jetson

TK1

Model: Cortex A15

Clock: 2.32GHz

Cores: 4 + 1

Model: GK20A

Clock: 852MHz

Type: DDR3L

Clock: 933MHzCUDA

FireFlyModel: Cortex A17

Clock: 1.8GHz

Cores: 4

Model: Mali-T764

Clock: 650MHz

Type: DDR3L

Clock: 792MHzOpenCL

Table 4.1: Hardware specification for selected SoCs

compared to Jetson, a smaller domain range of 16 to 48 was chosen, withsuccessive increments of 8. The iterations were concluded after 240 seconds.The run time was set such that at least 2 sets of 50 iterations were executedfor the largest domain size. The range of sets varied between 49 and 2.

The presented results were normalized against the number of sets for eachexecution.

4.2 Results

This section presents and analyzes the results obtained on the Jetson andFireFly hardware systems. We start by looking at the overall performancefollowed by a detailed examination of the individual HPCG algorithms.

4.2.1 Performance Comparison

The execution time for the serial and parallel algorithms had small observabledivergence on both hardware systems. A comparison of the computationtime for Jetson can be seen in Figure 4.1. The time to completion for theparallel version, is initially worse than the serial time due to the added cost formemory transfers. At a domain size of 56 the curves overlap i.e the memorytransfer penalties are averaged out by the computational time savings. For adomain size greater than 56 the time to completion increases at a slower ratein comparison to the serial version, therefore a divergence between the curvesand increase in performance can be observed.


§4.2 Results 33

Figure 4.1: Time to completion for increasing domain size (Jetson).

Figure 4.2 shows the graph for the serial and parallel computation time onthe FireFly system. Similar to Jetson, the time to completion is initially higherthan the serial time. For a domain size greater than 32 we can observe a gainin performance for the parallel version. Even though it may not be evidentfrom the Figure 4.2, because of the large range of the y axis but FireFly showsa higher initial divergence compared to Jetson. This is due to the lower mem-ory bandwidth of the system, which results in higher memory transfer delays.For larger domain sizes, the number of required memory transfers decreaseas the overall time to completion remains constant. This makes the problemmore computationally bound, resulting in performance gain.

4.2.2 Run time contribution of Algorithms

Before looking at the detailed performance comparisons we look at how eachalgorithm contributes towards the overall run time of the benchmark. Therun time contributions are shown in Figure 4.3 as a percentage of the time tocompletion, for a domain size of 32. It can be seen that the multi grid pre-conditioner has the highest execution time followed by sparse matrix multi-plication. Dot product and scaled vector addition only have a minor effecton the overall run time of the benchmark. The percentage contribution of thealgorithms approximately stay the same over different devices, for a domainsize of 32. From this, we can conclude that the MG algorithm is the a goodcandidate for computation over the GPU and can potentially lead to higher



Figure 4.2: Time to completion for increasing domain size (FireFly)

overall performance gain.

Figure 4.3: Time distribution on Jetson for domain size of 32.

4.2.3 Individual Speedup of Algorithms

We want to know how the individual algorithms perform in comparison tothe serial versions. This can help us analyze the performance bottlenecks and


§4.2 Results 35

further optimize the HPCG benchmark. A better and balanced utilization ofresources can therefore be achieved, by shifting memory bound algorithmsto the CPU. A speedup of 1.13 and 1.03 is achieved for MG and SPMV algo-rithms on Jetson which makes them viable for execution on the GPU. How-ever a speedup of 0.31 and 0.26 is obtained for WAXPBY and Dot Product,indicating that memory transfer penalties for these algorithms outweigh thecomputation time. This can be observed in Figure 4.4.

Figure 4.4: Run-time comparison of algorithms on Jetson, for max domainsize of 80.

Similar results are obtained on FireFly, with larger gains for SPMV. A speedupof 1.05 and 2.34 was achieved for MG and SPMV algorithms. The calculatedspeed up for WAXPBY and Dot Product was 0.09 and 0.12 respectively, whichis significantly lower than Jetson due to the lower performance specificationsof FireFly. A comparison of the time to completion for the algorithms on theFireFly system can be seen in Figure 4.5.

From the presented results, we conclude that the SPMV and MG algorithmsperform relatively well on the GPU. WAXPBY and Dot Product are highlymemory bound and do not scale well for GPU processing. We can ob-serve variances in the presented speedup figures, with change in domainsize. However the Dot Product and WAXPBY algorithms still maintain aspeedup smaller than one, indicating slower than serial performance, whilelarger gains for SPMV and MG can be seen.



Figure 4.5: Run-time comparison of algorithms on FireFly, for max domainsize of 48.

4.2.4 Problem Size Scaling

Measuring how the problem size scales can provide us information on howwell the individual algorithms perform and grow with increasing dimensions.It also provides us with a deeper understanding of how Jetson and Fireflysystems perform in comparison.

4.2.4.1 Relative Scaling of the Benchmark

The execution time of the benchmark on Jetson was approximately 15 - 18times faster than that on Firefly. However the problem scales better on Fire-fly, resulting in higher speedup. Table 4.2 shows the time to completion ofa single set on Jetson and Firefly, and the corresponding scaling factor. Thescaling factor is calculated as *SF = T/(T − 1). The theoretical scaling factorindicates the amount by which the problem size should ideally grow at eachstep.

It can be seen that for a problem size of 48, the time to completion grew bya factor of 4.64 for Jetson and only 3.61 for FireFly. Both devices scale worsethan the theoretical scaling, especially at higher domain sizes. This occursdue to the loss of performance resulting from caching effects and nonlinearscaling of the memory transfer time.


§4.3 Overall Results 37

Problem

Size

*Theoretical

Scaling

Factor

Wall Time

Jetson

(sec)

*Scaling

Factor

Jetson

Wall Time

FireFly

(sec)

*Scaling

Factor

FireFly

16 - 0.33 - 5.353 -

32 8 2.69 8.28 45.69 8.53

48 3.375 9.77 4.64 164.91 3.61

Table 4.2: Problem Scaling for Jetson and Firefly systems.

4.2.4.2 Relative Scaling of Algorithms

We now look at how the individual algorithms of the HPCG benchmark scale.On Jetson as the domain size grows, SPMV remains approximately constantas a proportion of the time to completion, whereas both Dot Product andWAXPBY decreased significantly (from 10% to 3%). MG increases in propor-tion from 60% to 75%, as seen in Figure 4.6.

The problem changes similarly on Firefly. Figure 4.7 shows a higher over-all contribution of the MG algorithm. SPMV decreases from 5% to 3% whileboth Dot product and WAXPBY decrease from 7% to 2%. Due to the higherrepresentation of the MG algorithm, a higher speedup and therefore betterscaling can be seen at larger domain sizes, as it negates the performance lossresulting from the WAXPBY and Dot Product algorithms.

4.3 Overall Results

To present the consolidated results, the Dot Product and WAXPBY moduleswere reverted to CPU computation, as no performance gain was observedwith CPU offloading (refer to Section 4.2.3). A comparison of the serial andparallel Floating Point Operations per Second (FLOPS), for Jetson have beenprovided in Figure 4.8. There was little deviation for the GFLOPS across dif-ferent domain sizes, however the rates still follow the observed trend of thenormalized time to completion. The performance rates were measured withinthe range of 0.2 and 0.25 GFLOPS.

It is interesting to note that while Jetson has a single precision peak per-formance of 300 GFLOPS, it can only attain a theoretical peak of 13 GFLOPSfor double precision operations. The obtained results were still a fraction ofthe peak value, due to the focus of HPCG on memory bound operations.



Figure 4.6: Time to completion with algorithm wise breakdown (Jetson).

Figure 4.7: Time to completion with algorithm wise breakdown (FireFly).

The performance measured on Firefly was lower, approximately by a factor of15, in comparison to Jetson. This can be seen in Figure 4.9. The floating pointoperations per second, varied between 11 to 15 MFLOPS. There is howevera higher performance improvement, for larger domain sizes. This is primar-ily due to the fact that FireFly has a lower performance CPU and thereforeoffloading operations to the GPU results in higher performance gain.


§4.4 Power Comparison 39

Figure 4.8: Serial vs parallel performance comparison (Jetson).

Figure 4.9: Serial vs parallel performance comparison (FireFly).

4.4 Power Comparison

SoC systems have a high desirability for power efficiency. It will be advan-tageous to see if computation on the GPU can also result in critical systempower savings. A comparison of the energy consumed for serial and parallelversions, with an algorithm wise breakdown have been presented in Table 4.3.The power readings were taken against a problem size of 80.



Energy savings of approximately 110 percent, were obtained on Jetson and44.5 percent on Fifefly. The MG algorithm, being the most computationallyexpensive, consumes the most energy.

The divergence between Jetson and Firefly can be explained by the hard-ware features of the CPU. The Cortex A15, quad core processor on Jetson iscoupled with a fifth power saving core. When the processor is not being fullyutilized, the quad-core cluster is deactivated, resulting in large savings. Thisis best observed from the large difference in readings for the SPMV algorithmon Jetson, compared to the relatively small gain obtained on FireFly.

The Cortex A17 CPU on FireFly is a lower performance chip, and consumesless power. However the overall energy consumption is higher compared toJetson, since the computation of the problem on FireFly takes longer.

System TypeSPMV

(J)

Dot

Product (J)

WAXPBY

(J)

MG

(J)

Total

(J)

JetsonSerial 4.93 0.13 0.17 25.4 30.6

Parallel 1.06 0.07 0.06 13.2 14.3

FireFlySerial 1.37 0.12 0.15 41.1 42.7

Parallel 1.30 0.03 0.03 28.2 29.6

Table 4.3: Energy consumed in Joules, for a domain size of 80

4.5 Summary and Discussion

We observed a small degree of performance gain from employing the parallelprograming model to the HPCG benchmark, which suggests that most cur-rent SoC devices are bandwidth limited. A higher gain in performance cantherefore be achieved for the next generation of devices, if higher memoryand bus bandwidth technologies are utilized compared to faster processingunits. Additionally SoCs can only attain a fraction of peak performance fordouble precision calculations, therefore changes in the architecture to supportfaster double precision operations would also be beneficial.

Even though small gains in performance were observed, a high level of powerefficiency was obtained through the parallel approach which makes the im-plementation highly desirable and relevant for SoCs.


Chapter 5

Conclusion and Future Work

In this final chapter, we will conclude by describing the progress made to-wards the stated goal in terms of development and achieved results. Sugges-tions for some future directions that could provide the next steps for furtherwork on the subject have also been provided.

5.1 Conclusion

We introduced the HPCG benchmark as an emerging tool for ranking com-puter systems. Due to the increasing importance of SoCs our aim was todevelop a version of HPCG, to fully utilize the available hardware resources,through the parallel programming model. We employed CUDA and OpenCL,which are two competing APIs for exploiting processing over DSP/GPUs.

We developed an initial model and concluded that performance gain for theparallel implementation can be beneficial depending on a number of desir-able hardware features. We identified multigrid conditioning (MG), sparsematrix vector multiplication (SPMV) , scaled vector addition (WAXPBY) andDot Product as the four major steps in the development process and pro-ceeded to implement and optimize them for parallel processing.

The two developed versions of HPCG (CUDA, OpenCL) were evaluated onthe Jetson and FireFly hardware systems. We observed small overall perfor-mance improvements and concluded that this was due to the limited band-width of the hardware systems and memory bound nature of the HPCG al-gorithm. However we noticed large power savings for computation over theGPU, which makes the parallel implementation relevant and desirable forevaluating SoCs.

41


42 Conclusion and Future Work

5.2 Recommendations for future work

A number of recommendations can be made for conducting further evalua-tions and improving the HPCG implementations.

We can further optimize the multigrid conditioner by implementing a systemwhich is able to effectively parallelize the Gauss Seidel pre-conditioner. Twomethods of achieving this have been mentioned in Section 3.3.4. It would beinteresting to explore how the two algorithms perform in comparison whenapplied to HPCG.

Another approach is to employ an enhanced matrix reordering techniquewhich allows for faster memory access of data elements. Existing or novelapproaches to modify the matrix format can be explored and relative perfor-mance changes can be measured, serving as an interesting problem.

Finally we can also compare the performance of the OpenCL vs CUDA bench-mark and explore implementation specific performance improvements.


Chapter 6

References

[1] Banger, Ravishekhar, and Koushik Bhattacharyya. OpenCL Programmingby Example. Packt Publishing Ltd, 2013.

[2] Cevahir, Ali, Akira Nukada, and Satoshi Matsuoka. "High performanceconjugate gradient solver on multi-GPU clusters using hypergraph partition-ing."Computer Science-Research and Development 25.1-2 (2010): 83-91.

[3] Chen, R. F., and Z. J. Wang. "Fast, block lower-upper symmetric Gauss-Seidel scheme for arbitrary grids." AIAA journal 38.12 (2000): 2238-2245.

[4] Cohen, Jonathan and Castonguay, Patrice, Efficient Graph Matching andColoring on the GPU, Available from:<http://on-demand.gputechconf.com/gtc/2012/presentations/S0332-Efficient-Graph-Matching-and-Coloring-on-GPUs.pdf>. [26 October 2015].

[5] Cumming, Ben, et al. "Application centric energy-efficiency study of dis-tributed multi-core and hybrid CPU-GPU systems." Proceedings of the Inter-national Conference for High Performance Computing, Networking, Storageand Analysis. IEEE Press, 2014.

[6] Dongarra , Luszczek , High-Performance Linpack benchmark. Availablefrom: <http://icl.cs.utk.edu/graphics/posters/files/SC14-HPL.pdf>. [26 Oc-tober 2015].

[7] Dongarra, Jack, and Michael A. Heroux. "Toward a new metric for rankinghigh performance computing systems." Sandia Report, SAND2013-4744 312(2013).

[8] Dongarra, Jack, Bunch, C. Moler and G. W. Stewart, 1979, LINPACK UsersGuide,SIAM, Philadelphia, PA.

43


44 References

[9] Harris, Mark, Unified Memory in CUDA 6, Available from:<http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/>. [26October 2015].

[10] Heroux, M. A., J. Dongarra, and P. Luszczek. "HPCG Technical speci-fication."Sandia report SAND2013-8752 (2013).

[11] Holden, Brian. "Latency comparison between HyperTransportTM andPCI-ExpressTM in communications systems." HyperTransportT M Consor-tium(2006).

[12] Kunkel, Julian M., Thomas Ludwig, and Hans Meuer, eds. Supercomput-ing: 29th International Conference, ISC 2014, Leipzig, Germany, June 22-26,2014, Proceedings. Vol. 8488. Springer, 2014.

[13] Lin, David, and Sally Lin, eds. Advances in Multimedia, Software Engi-neering and Computing Vol. 1: Proceedings of the 2011 MESC InternationalConference on Multimedia, Software Engineering and Computing, Novem-ber 26-27, Wuhan, China. Vol. 128. Springer Science & Business Media, 2011.

[14] M. T. Jones and P. E. Plassmann, A Parallel Graph Coloring Heuristic,SIAM Journal of Scientific Computing 14 (1993) 654.

[15] Oliver Ruebel, Cameron G R Geddes, Estelle Cormier-Michel, KeshengWu, Prabhat, Gunther H Weber, Daniela M Ushizima, Peter Messmer, HansHagen, Bernd Hamann, Wes Bethel. Automatic beam path analysis of laserwakefield particle acceleration data. Computational Science & Discovery, vol.2, no. 1, 2009.

[16] Owens, John and Luebke, David, Intro to Parallel Programming. Avail-able from:<http://www.nvidia.com/object/cuda_home_new.html>. [26 Oc-tober 2015]

[17] Park, Jongsoo, et al. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstruc-tured matrices. Proceedings of the International Conference for High Perfor-mance Computing, Networking, Storage and Analysis. IEEE Press, 2014.

[18]Park, Jongsoo, et al. Optimizations in a high-performance conjugate gra-dient benchmark for IA-based multi-and many-core processors. International


45

Journal of High Performance Computing Applications (2015): 1094342015593157.

[19] Phillips, Everett, and Massimiliano Fatica. "A CUDA Implementationof the High Performance Conjugate Gradient Benchmark." High PerformanceComputing Systems. Performance Modeling, Benchmarking, and Simulation.Springer International Publishing, 2014. 68-84.

[20] Richard F. Barrett, Michael A. Heroux, Paul T. Lin, Courtenay T. Vaughan,and Alan B. Williams. "Poster: mini-applications: vehicles for co-design.", inProceedings of the 2011 Companion on High Performance Computing Net-working, Storage and Analysis Companion (SC 11 Companion). ACM, NewYork, NY, USA, 1-2 (2011).

[21] Sridevi, Intel Optimized Technology Preview for High Performance Con-jugate Gradient Benchmark, Available from: <https://software.intel.com/en-us/articles/intel-optimized-technology-preview-for-high-performance-conjugate-gradient-benchmark>. [26 October 2015].

[22] Strohmaier,Erich and Dongarra, Jack, The Linpack Benchmark, Availablefrom: <http://www.top500.org/project/linpack/>. [26 October 2015].

[23] Yamanaka, N., et al. "A parallel algorithm for accurate dot product."Parallel Computing 34.6 (2008): 392-410.

[24] Zhang, Xianyi, et al. "Optimizing and scaling HPCG on tianhe-2: Earlyexperience." Algorithms and Architectures for Parallel Processing. SpringerInternational Publishing, 2014. 28-41.


Appendix A

Software Artifacts and Results

A.1 Software Artifacts

Figure A.1: CUDA Artifacts

46


§A.1 Software Artifacts 47

Figure A.2: OpenCL Artifacts


Matrix Size Cube MG (s) 50 Iters SPMV (s) 50 Iters WAXPBY (s) 50 Iter Dot Product (s) 50 Iter Difference from Actua Sets Benchmark Time Summary (s Floating Point Operations Summary GFLOP/s Summary

16 0.194153 0.055916 0.032426 0.030147 8.88E-16 188

DDOT: 5.97511 WAXPBY: 6.29936 SpMV: 9.54076 MG: 36.7595 Total: 61.0231

Raw DDOT: 2.32554e+08 Raw WAXPBY: 2.32554e+08 Raw SpMV: 1.86652e+09 Raw MG: 1.02469e+10 Total: 1.25785e+10

Raw DDOT: 0.0389205 Raw WAXPBY: 0.036917 Raw SpMV: 0.175864 Raw MG: 0.278755 Raw Total: 0.206127

32 2.020782 0.505701 0.107238 0.092962 1.55E-15 23




48 7.341133 1.823882 0.339721 0.28979 1.07E-14 7




64 18.855926 3.209015 0.751762 0.642552 2.12E-10 3




80 35.766502 7.774297 1.430135 1.208055 5.41E-09 2





16 0.13447 0.02371 0.001475 0.001491 9.99E-16 371




32 2.048975 0.371418 0.018702 0.017617 1.55E-15 25




48 7.051137 1.151115 0.077251 0.064792 4.96E-14 8




64 20.04783 3.671383 0.212781 0.159862 1.75E-10 3




80 40.557144 7.977185 0.437673 0.308116 6.88E-09 2




Parallel HPCG (Jetson)

Serial HPCG (Jetson)

A.2 Results (Jetson)



16 4.307976 0.254597 0.315318 0.366261 8.88E-16 46


Raw DDOT: 5.57711e+07Raw WAXPBY: 5.57711e+07Raw SpMV: 4.47746e+08Raw MG: 2.45708e+09Total: 3.01636e+09


24 17.264816 0.549339 0.718061 0.782925 1.11E-15 13




32 41.566034 1.175397 1.199361 1.235029 1.33E-15 6




40 88.494947 2.791359 2.049761 2.035296 1.67E-15 3




48 151.871517 5.14493 3.801919 3.499803 4.92E-14 2





16 4.279157 0.599189 0.023504 0.017209 9.99E-16 49




24 17.614337 2.116442 0.070665 0.058378 1.11E-15 12




32 44.946871 4.054921 0.149727 0.146932 1.55E-15 5




40 90.498102 7.772154 0.230257 0.25472 1.55E-15 3




48 160.416834 12.037678 0.375603 0.417012 4.96E-14 2




Parallel HPCG (FireFly)

Serial HPCG (FireFly)

A.3 Results (FireFly)


Appendix B

Project Proposal

B.1 Project Description

Project Title: Development and application of an HPCG benchmark for char-acterising the performance and energy efficiency of low power system on achip

Project Description: For many years the performance of high performancecomputer (HPC) systems has been measured and compared using the HighPerformance Linpack benchmark (HPL). This has been extremely valuable indriving system performance for computations that rely on dense linear alge-bra. Modern HPC applications are, however, increasingly based on sparsematrix operations. As such HPL poorly represents the requirements of theseapplications. In recognition of this Dongarra and Heroux recently proposeda new High Performance Conjugate Gradient (HPCG) benchmark. Currentlythere are only limited implementations of the HPCG benchmark. The pur-pose of this project is to develop robust implementations of HPCG that canbe used to assess the performance and power usage of a variety of low powersystem on a chip (LPSoC). This is of interest since LPSoCs are widely expectedto form the building blocks for the next generation of high end supercomput-ers. The project will require a literature search related to HPCG and HPL,identification of a range of LPSoC systems for assessment, development ofHPCG code to run over these range of systems, measurement of performanceand energy usage, analysis of results and identification of future work.

50


§B.2 Project Contract 51

B.2 Project Contract


52 Project Proposal


Appendix C

ReadMe File

——————–Acknowledgments———————–

The HPCG code was adopted from the official HPCG source websiteand can be accessed fromhttps://software.sandia.gov/hpcg/download.php

Please refer to Appendix A, for the list of files that were created and adoptedfrom HPCG for modifications.

A complete list of references is available in the project report document

———————————————————–

Two versions of the code have been included in the respective folders. Theinstructions and requirements for executing the OpenCL and CUDA versionsof HPCG are mentioned below.

The Directory structure for both versions is similar:

src : HPCG source filesbuild2: Makefile for projectbuild2/bin: Contains the compiled binariesbuild2/bin/xhpcg: Compiled executablebuild2/bin/hpcg.dat: File specifying problem dimensions and run time

OCL_src (OpenCL only): OpenCL source files and kernelsCUDA_src (CUDA only): CUDA source files and kernels

————————-CUDA-HPCG————————

53


54 ReadMe File

How to run the code:

Navigate to \CUDA-HPCG \build2Edit Makefile to include the correct CUDA library directoryCheck PATH system variable to include CUDA bin directoryCheck LD_LIBRARY_PATH system variable to include CUDA lib directoryExecute terminal command :make cleanExecute terminal command :make

Navigate to \CUDA-HPCG \build2 \bin

Edit hcpg.dat to specify the run time and problem sizeRun the benchmark : ./xhpcg

————————OpenCL-HPCG———————–

How to run the code:

Navigate to \OpenCl-HPCG \build2Edit Makefile to include the correct OpenCl library directoryCheck PATH system variable to include OpenCl bin directoryCheck LD_LIBRARY_PATH system variable to include OpenCl lib directoryExecute terminal command :make cleanExecute terminal command :make

Navigate to \OpenCl-HPCG \build2 \bin

Edit hcpg.dat to specify the run time and problem sizeRun the benchmark : ./xhpcg


HPCG benchmark for characterising performance...

Documents

Transcript of HPCG benchmark for characterising performance...