OpenCL-Based Mobile GPGPU Benchmarking: Methods and...

OpenCL-Based Mobile GPGPU Benchmarking:Methods and Challenges

Rotem AvivQualcomm Technologies, Inc.

5775 Morehouse Dr.,San Diego, CA 92121, USA

[email protected]

Guohui WangQualcomm Technologies, Inc.

3165 Kifer Rd.,Santa Clara, CA 95051, USA

[email protected]

ABSTRACTBenchmarking general-purpose computing on graphics pro-cessing unit (GPGPU) aims to profile and compare perfor-mance across different devices. Due to the low-level natureof most GPGPU APIs, GPGPU benchmarks are also use-ful for architectural exploration and program optimization.This can be challenging in mobile devices due to lack ofunderlying hardware details and limited profiling capabili-ties in some platforms. Measuring the performance of mo-bile GPU by executing benchmarks covering major hardwareand software features can reveal the strength and weaknessof a GPGPU system, enable better program optimizationand make automatic performance tuning possible.

In this paper, we will describe several design methods ofOpenCL-based mobile GPGPU benchmarking, and discusskey issues that one may encounter during development. Wewill also present design tips and guidelines to achieve more”fair” and accurate benchmarking results.

KeywordsMobile GPGPU; Benchmarking; OpenCL; Performance anal-ysis; Optimization

1. INTRODUCTIONAs image processing, computer vision and multimedia ap-

plications became more popular on mobile devices duringthe past several years, general-purpose computing on graph-ics processing units (GPGPU) in mobile devices have at-tracted much attention in both academic and industrial com-munity and have been successfully applied to many use casesand accelerated computationally-intensive algorithms [1, 5].

However, due to the lack of low-level hardware detailsand limited programming guidelines for some platforms, de-veloping optimized mobile GPGPU applications remains achallenge. Benchmarking has become a crucial tool for de-velopers not only for performance comparison, but also tobetter understand hardware and software capabilities.

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

IWOCL ’16 April 19-21, 2016, Vienna, Austria© 2016 Copyright held by the owner/author(s).

ACM ISBN 978-1-4503-4338-1/16/04.

DOI: http://dx.doi.org/10.1145/2909437.2909441

GPGPU benchmark workloads can be categorized intotwo categories based on the characteristics and purpose ofthe benchmark: low-level benchmark and high-level bench-mark. The SHOC benchmark uses similar categorizationinto benchmarking levels [2]. Low-level benchmarks (or micro-benchmarks) are mostly synthetic workloads containing ker-nels specifically designed to measure the performance of acertain hardware or software feature. Examples to low-levelbenchmarks are integer/floating-point ALU performance testand global memory cached read/write bandwidth test. Incontrast, high-level benchmarks include implementations offrequently used algorithms that usually serve as buildingblocks in larger-scale applications, such as histogram, boxfilter, face detection and fluid dynamics simulation.

Previous studies in the field of GPGPU benchmarking [2,3] have been focused mostly in desktop and server GPUs.In this paper, we will discuss mainly mobile GPGPU bench-marking, although a few topics may also be applicable fornon-mobile GPU devices. For low-level benchmarks, we willfocus on the design challenges and solutions to improve thebenchmark validity and to ensure that the correct feature isbeing profiled. For high-level benchmarks, we will discussresult verification as a key step in making sure the bench-mark runs as expected, and we will present the differentconsiderations one has to make when balancing benchmarkperformance with fairness and cross-platform design.

2. CHALLENGES AND SOLUTIONSIn this section, we discuss the challenges in developing mo-

bile GPGPU benchmarks, as well as the solutions to improvebenchmarking validity and accuracy.

2.1 Performance VarianceBenchmarks typically base their results on measuring time

required to complete a certain operation. Time measure-ment in mobile systems may have a certain degree of vari-ance due to several reasons. First, mobile systems em-ploy sophisticated power management controllers, which willthrottle the clock rate of different modules in the mobilesystem. Second, thermal protection design will limit perfor-mance to protect hardware from physical damage. Third,system memory in mobile systems serves multiple clients,therefore system memory will experience dynamic load. Asa result, the available memory bandwidth to the GPU maychange over time, adding to performance variance in memory-bound benchmarks. Other factors contributing to perfor-mance variance include timer accuracy, as well as timersthat run at variable rate in certain systems.

http://dx.doi.org/10.1145/2909437.2909441

ItemsProblem size

256× 256 1024× 1024Min time 26.55 msec 87.30 msecMax time 33.66 msec 90.89 msecAverage time 29.79 msec 89.13 msecTime variance 2.46 msec 1.09 msecVariance/Average % 8.25% 1.22%

Table 1: Performance Variance for 2D-FFT.

In this subsection, we describe methods to mitigate per-formance variance. During performance measurement, itis recommended to disable any task that is not necessaryfor the workload execution. To effectively mitigate the per-formance variance, we can design benchmarks with longerworkloads and record performance of multiple iterations.However, increasing workload length may trigger thermalprotection mechanism in the device. Running the bench-mark in a temperature-controlled environment is one op-tion; if such an option is not available, adding idle periodsbetween workloads may reduce chances of high system tem-perature. The benchmark can self-monitor the variance tomake sure it does not exceed reasonable level. The bench-mark can also monitor CPU and memory load before andduring benchmark run, to make sure benchmark result isobtained at reasonable system load.

It is a good practice to use both CPU timer and GPUtimer as well as any other system timer available, and cor-relate the measurements done by all timers. OpenCL 2.1 1

includes clGetDeviceAndHostTimer function, which al-lows synchronizing host and device timers, can be used forcorrelation of host and device time measurements. Usingtimers that run at constant rate is preferred, as it allows toeliminate the effects of clock throttling. It is also helpful torefer to platform hardware spec or consult with hardwarevendor to understand timer design features. Table 1 showsan example of 2D-FFT workload, tested on Qualcomm®

Adreno�A430 GPU2. With input size of 256 × 256, we ob-serve variance of 8.25%. To reduce the variance, we increasethe workload by changing the input size to 1024×1024. Wealso reduce host-device sync by running 10 frames consec-utively and calling clFinish once. As a result, performancevariance is effectively reduced to 1.22%.

2.2 GPU Driver OverheadThe GPU driver, which runs on the host, may have a

significant impact on benchmark performance. Frequentcompute API calls, in some cases, may increase the driverworkload, and shift the overall workload bottleneck from theGPU to the host. To minimize driver overhead, we recom-mend to develop longer kernels that process more data whilereducing the number of kernel launches. Merging function-ality of several small kernels into one larger kernel can alsoreduce the number of kernel launches.

In Listing 1, we show an OpenCL kernel sample code thatimplements floating-point ALU performance test with MADoperations. By tuning the iteration count of the loop inthe kernel (Niteration count), we can control the length ofthe workload performed by each work-item. Repeated en-queues of this kernel with too few iterations will increase the

1OpenCL 2.1 API Specification. https://www.khronos.org/registry/

cl/specs/opencl-2.1.pdf2Qualcomm® Adreno� is a product of Qualcomm Technologies, Inc.

https://www.qualcomm.com/products/snapdragon/processors

Listing 1 MAD ALU performance kernel code sample.

__kernel void mad_float(...){

...for (int j = 0; j < iterations; j ++){

s0 = s0 * s1 + s2;s0 = s0 * s1 + s2;s0 = s0 * s1 + s2;...

}...

}

0

20

40

60

80

100

120

0

50

100

150

200

250

300

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Nu

mb

er

of

Enq

ue

ue

s

GFL

OP

S

Intra loop count inside kernel

MAD float GFLOPs (CPU Timer)

Enqueue count GFLOPs

Figure 1: MAD floating-point ALU Performance.

driver overhead and reduce the measured performance. Inthis way, the goal of revealing the true peak performance ofthe device cannot be achieved. In contrast, increasing theiteration count and reducing the number of kernel enqueues(Nenqueue) will lower the driver overhead while maintainingthe same overall workload for the whole test. As a result, wecan measure higher performance, which will be close to thetrue peak performance of the ALU hardware of the device.Figure 1 shows performance data in the unit of GFLOPs onQualcomm® Adreno�A430 GPU, as we increase the itera-tion count. Please note that the total workload size remainsconstant (Nenqueue × Niteration count is a constant value).As can be seen, large iteration count with lower number ofkernel enqueue API calls leads to better performance.

It is recommended to limit kernel running time to reason-able length to avoid unexpected behavior caused by GPUwatchdog timer. In addition, benchmarks with long runningtime may trigger thermal gating in certain cases, resultingin lower performance result.

2.3 GPU Context SwitchAnother challenge during benchmark development is GPU

context switch. Typically, a GPU compute benchmark willlaunch workload through a single context. With OpenCL,the workload context is used for compute only. Graphics re-lated tasks, such as rendering 2D/3D content, displaying im-ages, video, etc. will be handled through a separate graph-ics context. The user interface (UI) may also run througha separate context. If a context switch occurs during work-load execution, the workload execution may be interruptedin some cases, affecting the measured performance.

We have experimented with 2D-FFT application, where 4kernels are enqueued sequentially on Qualcomm® Adreno�A430 GPU. After the kernels are enqueued, the applica-tion waits for clFInish to return. Measuring time from firstenqueue to clFinish determines performance. For this ex-periment, we synthetically trigger UI events during the ex-

https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf

https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf

https://www.qualcomm.com/products/snapdragon/processors

450

460

470

480

490

500

1 2 3 4 5 6 7 8 9 10

Tim

e (

mse

c)

Iteration

FFT2D

CL only CL + 1 UI event CL + 2 UI eventsCL only (average) CL + 1 UI event (average) CL + 2 UI events (average)

Figure 2: 2D-FFT Performance with UI Context.

periment in the form of screen print. Since printing to thescreen is handled by a different context, a context switchoccurs, and as a result measured performance is lower. Ascan be seen in Figure 2, performance slows down as thenumber of UI events increases. In general, benchmark de-velopers should avoid UI activity or context switch duringexperiments.

2.4 Work Group Size and Shape TuningPerformance of many use-cases may depend on work-group

size and shape. Different Work-group sizes and shapes canchange the memory access pattern, affecting aspects such asmemory bandwidth and cache utilization. There is no sim-ple method for detecting the best performing work-groupsize and shape. A good practice for developers is to try dif-ferent size and shape options for each device, and choose thebest performing one. The work-group tuning process can bemade automatically as part of the benchmark.

11

Work Size Tuning Example

Qualcomm Adreno A430 GPU

Qualcomm Adreno is a product of Qualcomm Technologies, Inc.

0

0.5

1

1.5

2

(448,1) (224,2) (112,4) (56,8) (28,16) (14,32) (7,64) (3,149) (1,448)

Ru

nn

ing

Tim

e

Work-Group Size

FFT1D

0

0.5

1

1.5

2

2.5

3

Rel

ativ

e r

un

tim

e

Work group size (2-dimension)

Matrix Multiply

Figure 3: Matrix multiply execution time (the lower,the better) with different work group shapes.

As shown in Figure 3, matrix multiply performance onQualcomm® Adreno�A430 GPU depends heavily on thework-group shape. The matrix multiply is memory bound,so it is sensitive to work-group size and shape. Althoughwork-group tuning may prove to be unnecessary in certaincases, because of the simplicity of the process and the poten-tial benefit it could generate, we strongly recommend per-forming work-group size and shape tuning for every kernelin the benchmark.

2.5 Compiler OptimizationWhen developing low-level benchmarks aimed to measure

performance of a specific hardware feature, it is importantthat the kernel functionality is not modified by compileroptimization. For example, if a kernel contains a series ofrepeated ADD operations, compiler may optimize these op-erations to a single MUL operation. Another common caseis that if an intermediate variable is not used or stored by theend of the kernel, all operations performed on that variablemay be skipped by compiler optimization. Careful atten-tion is required to make sure the code is not optimized in

these cases, and detecting compiler optimization often re-quires multiple experiments using different code versions.

2.6 Math Function BenchmarkingMany math functions are often implemented by a complex

set of instructions, which may check corner cases and havemultiple execution paths. As a result, math function per-formance may depend on input values. Since benchmarkingrequires large amount of repeated executions of a function,the input value may converge, in some functions, to a cor-ner case, yielding different performance than that of com-mon use cases. Listing 2a contains sample OpenCL kernelcode from a naıve implementation of log benchmark. Formost possible initial input values, the log output will con-verge to a constant value within a small number of log calls.In comparison, an improved implementation, as appears inListing 2b, can effectively prevent input values to log func-tion from converging into constant values. The performanceoverhead of a single multiplication is negligible in this case.

Listing 2 Two different implementations of OpenCL kernelsto benchmark log function.

// Code sample a: Naive implementationdata = log(data);data = log(data);data = log(data);

// Code sample b: New implementationdata = log(data);data *= data;data = log(data);data *= data;data = log(data);data *= data;

Figure 4 shows the measured performance of log func-tion on Qualcomm® Adreno�A430 GPU. As can be seen,the naıve implementation yields higher performance thanthe new implementation since the execution of log functionwith converged input values is faster. The random case,which may closely represent a real use-case, shows a 4× per-formance difference between the naıve implementation andthe new implementation. In general, math benchmarkingrequires special attention to functions which may performdifferently depending on the input values.

15

0

2

4

6

8

10

0.0 1.0 1.00E+10 inf -1.0 Random

GFL

OP

s

Initial Value

Log GFLOPs (OpenCL strict-math)(Qualcomm Adreno A430 GPU)

Naïve implementation New Implementation

Figure 4: GFLOPs of log function on Qualcomm®

Adreno�A430 GPU (OpenCL strict-math).

2.7 Performance Metrics for ProfilingLow-level benchmarking is a useful tool for developers to

understand the strength and weakness of a device. By mea-suring performance of multiple low-level features in a device,one can generate a ”device profile”, as shown in Figure 5.

Figure 5a shows throughput performance of selected floating-point ALU and math functions. Figure 5b shows the mem-

0102030405060708090

GB

/s

Memory Performance

0

50

100

150

200

250

300

350

GFL

OP

s

ALU & Math Performance

024681012141618

GFL

OP/

GB

Relative performance metrics(float4 MAD ALU/Memory BW)

(a) (b) (c)

Qualcomm Adreno A430 GPU A Non-Qualcomm Mobile GPU

Figure 5: Building a ”Device Profile” with Low-levelBenchmarks.

ory bandwidth performance. We define relative performancemetrics by dividing the MAD ALU performance by memorybandwidth as depicted in Figure 5c. These relative met-rics reveal performance ratio between the compute resources(ALU throughput) and memory resources (memory band-width) inside a GPU. Relative metrics can help in estimat-ing device utilization in some workloads. By setting up such”device profile”, we can choose the right optimization strat-egy for either compute-bound or memory-bound workloadsbased on the relative performance metrics.

2.8 Result Verification

18

Result verification

Process

sub-

windows

Input Image

1 2 N…T T

F F F

Cascade Architecture

Figure 6: Diagram of Viola-Jones face detection al-gorithm.

Result verification allows the benchmark to check if thedevice executes the functionality as expected, so as to val-idate the performance measurement. This is especially im-portant for benchmarks running across devices, since exe-cution errors can early terminate kernel execution, there-fore generating ”fake” high performance numbers. In addi-tion, performance may be data dependent in some use-cases.For example, the Viola-Jones Face Detection algorithm [4]is based on a cascade of logic modules, as illustrated in Fig-ure 6. Hardware differences across devices may cause dif-ferences in intermediate results, leading to workload incon-sistency and performance differences. To ensure workloadconsistency across devices, we recommend comparing work-load output with the reference data not only in the finalstage of computation but also in the intermediate stages.

3. DESIGN TRADE-OFFSDeveloping a high-level benchmark requires balancing be-

tween several considerations, or more specifically, the trade-off between peak performance vs. complexity and”fairness”. On the one hand, a correct assessment of per-formance requires running a workload that can maximizedevice utilization and efficiency. On the other hand, thebenchmark has to be designed in a cross-platform approach.

Improving benchmark performance often requires care-ful optimization and even using vendor libraries if possi-ble. In some cases, using dedicated hardware features for

better performance generates differences in results, mainlydue to hardware differences in features not defined in theAPI specification. The differences may expose variationsin workload across devices, making performance compari-son less ”fair”. Moreover, maximizing performance in multi-ple devices is very challenging, given the fact that hardwarefeatures, driver/compiler optimization and tool support aredifferent across devices.

In order for all devices to run exactly the same workloadand to make the benchmark a cross-platform tool, a devel-oper may compromise on performance, such as by using onlycommon optimization techniques supported by most devices.This approach is beneficial in terms of development timeand cross-platform capability, but will under-utilize the de-vices, making the benchmark result highly uncorrelated withthe true device capability. Benchmark developers shouldfind the right balance between these different considerations,and achieve ”fairness” by maintaining workload consistencyacross devices. Developers should try to improve perfor-mance by using common and device specific optimization.The advancement of compilers and drivers, as well as thegradually increasing availability of optimized libraries canbe leveraged by benchmark developers in creating more ac-curate and balanced benchmarks.

4. CONCLUSIONSGPGPU benchmarking has been proving important for

performance comparison, architecture exploration, algorithmdevelopment and optimization. Effective mobile GPGPUbenchmarking requires extensive amount of experience froma developer. In addition, mobile GPGPU performance issensitive to many system factors, making it even more dif-ficult to develop a high-quality mobile GPGPU benchmark.In this paper, we have discussed key issues that mobileGPGPU benchmarks developers can encounter, and describedthe potential solutions to these problems. But most impor-tantly, we believe that a good mobile benchmarking strategyshould carefully consider the balance between fairness andtrue-peak performance when using the benchmarking resultsto analyze and compare across different platforms.

AcknowledgmentsThe authors would like to thank Alex Bourd and Jay Yunfor the in-depth discussion and constructive suggestions.

5. REFERENCES[1] K.-T. Cheng and Y. Wang. Using mobile GPU for

general-purpose computing - a case study of face recognitionon smartphones. In VLSI-DAT, pages 1 –4, April 2011.

[2] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C.Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. Thescalable heterogeneous computing (SHOC) benchmark suite.In GPGPU, pages 63–74. ACM, 2010.

[3] N. Goswami, R. Shankar, M. Joshi, and T. Li. ExploringGPGPU workloads: Characterization methodology, analysisand microarchitecture evaluation implications. In IISWC,pages 1–10. IEEE, 2010.

[4] P. Viola and M. J. Jones. Robust real-time face detection.IJCV, 57(2):137–154, 2004.

[5] G. Wang, Y. Xiong, J. Yun, and J. R. Cavallaro. Computervision accelerators for mobile systems based on OpenCLGPGPU co-processing. Journal of Signal Processing Systems(JSPS), 76(3):283–299, Apr. 2014.

OpenCL-Based Mobile GPGPU Benchmarking: Methods and...

Documents

Transcript of OpenCL-Based Mobile GPGPU Benchmarking: Methods and...