Poster’PrintSize: ’ Change’Color’Theme:’ STATuner...

Poster Print Size: This poster template is 36” high by 48” wide. It can be used to print any poster with a 3:4 aspect raAo.

Placeholders: The various elements included in this poster are ones we oCen see in medical, research, and scienAfic posters. Feel free to edit, move, add, and delete items, or change the layout to suit your needs. Always check with your conference organizer for specific requirements.

Image Quality: You can place digital photos or logo art in your poster file by selecAng the Insert, Picture command, or by using standard copy & paste. For best results, all graphic elements should be at least 150-‐200 pixels per inch in their final printed size. For instance, a 1600 x 1200 pixel photo will usually look fine up to 8“-‐10” wide on your printed poster. To preview the print quality of images, select a magnificaAon of 100% when previewing your poster. This will give you a good idea of what it will look like in print. If you are laying out a large poster and using half-‐scale dimensions, be sure to preview your graphics at 200% to see them at their final printed size.

Please note that graphics from websites (such as the logo on your hospital's or university's home page) will only be 72dpi and not suitable for prinAng.

[This sidebar area does not print.]

Change Color Theme: This template is designed to use the built-‐in color themes in the newer versions of PowerPoint.

To change the color theme, select the Design tab, then select the Colors drop-‐down list.

The default color theme for this template is “Office”, so you can always return to that aCer trying some of the alternaAves.

PrinAng Your Poster: Once your poster file is ready, visit www.genigraphics.com to order a high-‐quality, affordable poster print. Every order receives a free design review and we can deliver as fast as next business day within the US and Canada.

Genigraphics® has been producing output from PowerPoint® longer than anyone in the industry; daAng back to when we helped MicrosoC® design the PowerPoint® soCware.

US and Canada: 1-‐800-‐790-‐4001 Email: [email protected]

[This sidebar area does not print.]

STATuner : Efficient Tuning of CUDA Kernels Parameters!Ravi Gupta1; Ignacio Laguna2; Dong H. Ahn2;!Todd Gamblin2; Saurabh Bagchi1; Felix Lin1;!

1 Purdue University , 2 Lawrence Livermore National Lab !INTRODUCTION! METHODOLOGY!

FURTHER WORK!

RESULTS!ABSTRACT

Figure 3. Occupancy Calculator Predicting Block_Size!

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

•  SVM Classifier with Kernel =‘rbf’ !•  ‘Classification’ than ‘Regression’!•  Multi Labeling better than Single Labeling!•  Train Set = 13 , Test Set = 5 , Features = 6 !!Occ_1 : Prediction 1 from Occupancy Calculator!Occ_2 : Prediction 2 from Occupancy Calculator!K=rbf,3 : SVM Prediction with ‘rbf’ Kernel, 3 labels !K=rbf,1 : SVM Prediction with ‘rbf’ Kernel, 1 label !

We extract the kernel code from the application code and use LLVM to identify static metrics to characterize the code. Only kernel code is relevant as that is the part of code that runs on the GPU. Some of the Static features we take into consideration –!-Total Insts. -Loops ! -Registers !-Branch Insts. ! -Shared Mem. ! -Memory Inst.!!To use a supervised machine learning technique we first need to build a labeled training and test set. We use applications from Rodinia benchmark suite which is a well known benchmark suite for gauging heterogeneous computing. !

•  Occupancy does not map very well to performance!•  Static metrics characterize applications!•  With enough metrics and training dataset, it is possible

to define a good model!•  Leverage Static metrics to reduce multiple trials!•  Bound the percent overshoot to a limit with single trial!

Heterogeneous computing systems consists of numerous GPUs and CPUs where the computation intensive and parallelizable region of the code runs on the GPU with very high throughput and relatively sequential part of the code runs on the CPU.!!The code region designated to run on GPU is called a kernel. Each instance of the kernel code is called a thread. Number of threads that are launched depends on the data set partition. One important parameter in using CUDA and launching applications on GPU is the grouping of threads into a block(Block_Size) to effectively use the resources on a multiprocessor(SM) of GPU and have high throughput. !!We propose a solution to predict the optimal Block_Size based on supervised machine learning and hence reduce effort related to tuning CUDA programs. !

Table 1. Training Input Matrix!

SVM : Training Input Matrix

S1 S2 S3 S4 S5 S6 S7 Label(supervised)

App1 X11 X12 X13 X14 X15 X16 X17 Y1

App2 X21 X22 X23 X24 X25 X26 X27 Y2

App3 X31 X32 X33 X34 X35 X36 X37 Y3

. . . . . . . . Y4

. . . . . . . . Y5

Appn Xn1 Xn2 Xn3 Xn4 Xn5 Xn6 Xn7 Y6

BACKGROUND!In the past, active blocks per SM has been linked with performance such that number of active warps should be high to hide memory latencies. “Occupancy Calculator” from NVIDIA is one such tool which tries to predict the number of active warps that will run on a GPU limited by three physical resources –!Registers Shared memory Blocks / SM!

CUDA Code Kernel Codes Kernel LLVM IR

Static Metrics

ML Model

Prediction Workflow

Predicted BlockSize

Compile & Run

Training the Support Vector Machine (SVM)

Next App.

Static

Metrics

Manual Runs

Opt. BlockSize

Training Set

SVM

More Y

N

Figure 5. Training Support Vector Machine Classifier!

Figure 4. Prediction Workflow!

Figure 2. Execution Time vs. Occupancy Correlation!

Average Percent Errors of the Models :!Occupancy_Calc: 6.6 % STATuner: 4.4 % !

CONCLUSIONS!

Figure 6. Percent Overshoot in Execution time!

We have here a very basic model with very basic features. We would like to find a more representative set of applications to base our training set on. We will improve our machine learning technique and also perform cross-validation to fine tune the model. We also plan to add more derived static features to characterize an application as well as dynamic metrics which can be collected from ‘nvprof’ NVIDIA profiler tool. We would like to extend our work to other known heterogeneous platforms like OpenCL and study its feasibility.!

To predict parameters used for launching k e r n e l s o n t h e a c c e l e r a t o r i n a Heterogeneous System. We study a well known tool “Occupancy Calculator” to see its effectiveness and identify its pitfalls.!!We identify other static metrics that can be used to characterize an application and build a Support Vector Classifier as the feature set. Finally, we compare our model with “Occupancy Calculator” and see that we get good prediction with first trial itself. We are able to predict parameters with an upper bound of 15% overshoot as against Occupancy Calculator with bound of 32% overshoot.!

Lawrence Livermore National Lab !

Figure 1. Inside a GPU Multiprocessor!

Poster’PrintSize: ’ Change’Color’Theme:’ STATuner...

Documents

Transcript of Poster’PrintSize: ’ Change’Color’Theme:’ STATuner...