Poster’PrintSize: ’ Change’Color’Theme:’ STATuner...

1
STATuner : Efficient Tuning of CUDA Kernels Parameters Ravi Gupta 1 ; Ignacio Laguna 2 ; Dong H. Ahn 2 ; Todd Gamblin 2 ; Saurabh Bagchi 1 ; Felix Lin 1 ; 1 Purdue University , 2 Lawrence Livermore National Lab INTRODUCTION METHODOLOGY FURTHER WORK RESULTS ABSTRACT Figure 3. Occupancy Calculator Predicting Block_Size This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 SVM Classifier with Kernel =‘rbf’ ‘Classification’ than ‘Regression’ Multi Labeling better than Single Labeling Train Set = 13 , Test Set = 5 , Features = 6 Occ_1 : Prediction 1 from Occupancy Calculator Occ_2 : Prediction 2 from Occupancy Calculator K=rbf,3 : SVM Prediction with ‘rbf’ Kernel, 3 labels K=rbf,1 : SVM Prediction with ‘rbf’ Kernel, 1 label We extract the kernel code from the application code and use LLVM to identify static metrics to characterize the code. Only kernel code is relevant as that is the part of code that runs on the GPU. Some of the Static features we take into consideration – -Total Insts. -Loops -Registers -Branch Insts. -Shared Mem. -Memory Inst. To use a supervised machine learning technique we first need to build a labeled training and test set. We use applications from Rodinia benchmark suite which is a well known benchmark suite for gauging heterogeneous computing. Occupancy does not map very well to performance Static metrics characterize applications With enough metrics and training dataset, it is possible to define a good model Leverage Static metrics to reduce multiple trials Bound the percent overshoot to a limit with single trial Heterogeneous computing systems consists of numerous GPUs and CPUs where the computation intensive and parallelizable region of the code runs on the GPU with very high throughput and relatively sequential part of the code runs on the CPU. The code region designated to run on GPU is called a kernel. Each instance of the kernel code is called a thread. Number of threads that are launched depends on the data set partition. One important parameter in using CUDA and launching applications on GPU is the grouping of threads into a block(Block_Size) to effectively use the resources on a multiprocessor(SM) of GPU and have high throughput. We propose a solution to predict the optimal Block_Size based on supervised machine learning and hence reduce effort related to tuning CUDA programs. Table 1. Training Input Matrix SVM : Training Input Matrix S1 S2 S3 S4 S5 S6 S7 Label(supervised) App 1 X 11 X 12 X 13 X 14 X 15 X 16 X 17 Y 1 App 2 X 21 X 22 X 23 X 24 X 25 X 26 X 27 Y 2 App 3 X 31 X 32 X 33 X 34 X 35 X 36 X 37 Y 3 . . . . . . . . Y 4 . . . . . . . . Y 5 App n X n1 X n2 X n3 X n4 X n5 X n6 X n7 Y 6 BACKGROUND In the past, active blocks per SM has been linked with performance such that number of active warps should be high to hide memory latencies. “Occupancy Calculator” from NVIDIA is one such tool which tries to predict the number of active warps that will run on a GPU limited by three physical resources – Registers Shared memory Blocks / SM CUDA Code Kernel Codes Kernel LLVM IR Static Metrics ML Model Prediction Workflow Predicted BlockSize Compile & Run Training the Support Vector Machine (SVM) Next App. Static Metrics Manual Runs Opt. BlockSize Training Set SVM More Y N Figure 5. Training Support Vector Machine Classifier Figure 4. Prediction Workflow Figure 2. Execution Time vs. Occupancy Correlation Average Percent Errors of the Models : Occupancy_Calc: 6.6 % STATuner: 4.4 % CONCLUSIONS Figure 6. Percent Overshoot in Execution time We have here a very basic model with very basic features. We would like to find a more representative set of applications to base our training set on. We will improve our machine learning technique and also perform cross- validation to fine tune the model. We also plan to add more derived static features to characterize an application as well as dynamic metrics which can be collected from ‘nvprof’ NVIDIA profiler tool. We would like to extend our work to other known heterogeneous platforms like OpenCL and study its feasibility. To predict parameters used for launching kernels on the accelerator in a Heterogeneous System. We study a well known tool “Occupancy Calculator” to see its effectiveness and identify its pitfalls. We identify other static metrics that can be used to characterize an application and build a Support Vector Classifier as the feature set. Finally, we compare our model with “Occupancy Calculator” and see that we get good prediction with first trial itself. We are able to predict parameters with an upper bound of 15% overshoot as against Occupancy Calculator with bound of 32% overshoot. Lawrence Livermore National Lab Figure 1. Inside a GPU Multiprocessor

Transcript of Poster’PrintSize: ’ Change’Color’Theme:’ STATuner...

Page 1: Poster’PrintSize: ’ Change’Color’Theme:’ STATuner …sc15.supercomputing.org/.../poster_files/post276s2-file2.pdfPoster’PrintSize: ’ This’poster’template’is’36”’high’by’

Poster  Print  Size:  This  poster  template  is  36”  high  by  48”  wide.  It  can  be  used  to  print  any  poster  with  a  3:4  aspect  raAo.  

Placeholders:  The  various  elements  included  in  this  poster  are  ones  we  oCen  see  in  medical,  research,  and  scienAfic  posters.  Feel  free  to  edit,  move,    add,  and  delete  items,  or  change  the  layout  to  suit  your  needs.  Always  check  with  your  conference  organizer  for  specific  requirements.  

Image  Quality:  You  can  place  digital  photos  or  logo  art  in  your  poster  file  by  selecAng  the  Insert,  Picture  command,  or  by  using  standard  copy  &  paste.  For  best  results,  all  graphic  elements  should  be  at  least  150-­‐200  pixels  per  inch  in  their  final  printed  size.  For  instance,  a  1600  x  1200  pixel  photo  will  usually  look  fine  up  to  8“-­‐10”  wide  on  your  printed  poster.  To  preview  the  print  quality  of  images,  select  a  magnificaAon  of  100%  when  previewing  your  poster.  This  will  give  you  a  good  idea  of  what  it  will  look  like  in  print.  If  you  are  laying  out  a  large  poster  and  using  half-­‐scale  dimensions,  be  sure  to  preview  your  graphics  at  200%  to  see  them  at  their  final  printed  size.  

Please  note  that  graphics  from  websites  (such  as  the  logo  on  your  hospital's  or  university's  home  page)  will  only  be  72dpi  and  not  suitable  for  prinAng.  

 [This  sidebar  area  does  not  print.]  

Change  Color  Theme:  This  template  is  designed  to  use  the  built-­‐in  color  themes  in  the  newer  versions  of  PowerPoint.  

To  change  the  color  theme,  select  the  Design  tab,  then  select  the  Colors  drop-­‐down  list.  

   

   

   

 

   

The  default  color  theme  for  this  template  is  “Office”,  so  you  can  always  return  to  that  aCer  trying  some  of  the  alternaAves.  

PrinAng  Your  Poster:  Once  your  poster  file  is  ready,  visit  www.genigraphics.com  to  order  a  high-­‐quality,  affordable  poster  print.  Every  order  receives  a  free  design  review  and  we  can  deliver  as  fast  as  next  business  day  within  the  US  and  Canada.    

Genigraphics®  has  been  producing  output  from  PowerPoint®  longer  than  anyone  in  the  industry;  daAng  back  to  when  we  helped  MicrosoC®  design  the  PowerPoint®  soCware.    

 US  and  Canada:    1-­‐800-­‐790-­‐4001  Email:  [email protected]  

 [This  sidebar  area  does  not  print.]  

STATuner : Efficient Tuning of CUDA Kernels Parameters!Ravi Gupta1; Ignacio Laguna2; Dong H. Ahn2;!Todd Gamblin2; Saurabh Bagchi1; Felix Lin1;!

1 Purdue University , 2 Lawrence Livermore National Lab !INTRODUCTION! METHODOLOGY!

FURTHER WORK!

RESULTS!ABSTRACT  

Figure 3. Occupancy Calculator Predicting Block_Size!

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344  

•  SVM Classifier with Kernel =‘rbf’ !•  ‘Classification’ than ‘Regression’!•  Multi Labeling better than Single Labeling!•  Train Set = 13 , Test Set = 5 , Features = 6 !!Occ_1 : Prediction 1 from Occupancy Calculator!Occ_2 : Prediction 2 from Occupancy Calculator!K=rbf,3 : SVM Prediction with ‘rbf’ Kernel, 3 labels !K=rbf,1 : SVM Prediction with ‘rbf’ Kernel, 1 label !

We extract the kernel code from the application code and use LLVM to identify static metrics to characterize the code. Only kernel code is relevant as that is the part of code that runs on the GPU. Some of the Static features we take into consideration –!-Total Insts. -Loops ! -Registers !-Branch Insts. ! -Shared Mem. ! -Memory Inst.!!To use a supervised machine learning technique we first need to build a labeled training and test set. We use applications from Rodinia benchmark suite which is a well known benchmark suite for gauging heterogeneous computing. !

•  Occupancy does not map very well to performance!•  Static metrics characterize applications!•  With enough metrics and training dataset, it is possible

to define a good model!•  Leverage Static metrics to reduce multiple trials!•  Bound the percent overshoot to a limit with single trial!

Heterogeneous computing systems consists of numerous GPUs and CPUs where the computation intensive and parallelizable region of the code runs on the GPU with very high throughput and relatively sequential part of the code runs on the CPU.!!The code region designated to run on GPU is called a kernel. Each instance of the kernel code is called a thread. Number of threads that are launched depends on the data set partition. One important parameter in using CUDA and launching applications on GPU is the grouping of threads into a block(Block_Size) to effectively use the resources on a multiprocessor(SM) of GPU and have high throughput. !!We propose a solution to predict the optimal Block_Size based on supervised machine learning and hence reduce effort related to tuning CUDA programs. !

Table 1. Training Input Matrix!

SVM : Training Input Matrix

S1 S2 S3 S4 S5 S6 S7 Label(supervised)

App1 X11 X12 X13 X14 X15 X16 X17 Y1

App2 X21 X22 X23 X24 X25 X26 X27 Y2

App3 X31 X32 X33 X34 X35 X36 X37 Y3

. . . . . . . . Y4

. . . . . . . . Y5

Appn Xn1 Xn2 Xn3 Xn4 Xn5 Xn6 Xn7 Y6

BACKGROUND!In the past, active blocks per SM has been linked with performance such that number of active warps should be high to hide memory latencies. “Occupancy Calculator” from NVIDIA is one such tool which tries to predict the number of active warps that will run on a GPU limited by three physical resources –!Registers Shared memory Blocks / SM!

CUDA Code Kernel Codes Kernel LLVM IR

Static Metrics

ML Model

Prediction Workflow

Predicted BlockSize

Compile & Run

Training the Support Vector Machine (SVM)

Next App.

Static

Metrics

Manual Runs

Opt. BlockSize

Training Set

SVM

More Y

N

Figure 5. Training Support Vector Machine Classifier!

Figure 4. Prediction Workflow!

Figure 2. Execution Time vs. Occupancy Correlation!

Average Percent Errors of the Models :!Occupancy_Calc: 6.6 % STATuner: 4.4 % !

CONCLUSIONS!

Figure 6. Percent Overshoot in Execution time!

We have here a very basic model with very basic features. We would like to find a more representative set of applications to base our training set on. We will improve our machine learning technique and also perform cross-validation to fine tune the model. We also plan to add more derived static features to characterize an application as well as dynamic metrics which can be collected from ‘nvprof’ NVIDIA profiler tool. We would like to extend our work to other known heterogeneous platforms like OpenCL and study its feasibility.!

To predict parameters used for launching k e r n e l s o n t h e a c c e l e r a t o r i n a Heterogeneous System. We study a well known tool “Occupancy Calculator” to see its effectiveness and identify its pitfalls.!!We identify other static metrics that can be used to characterize an application and build a Support Vector Classifier as the feature set. Finally, we compare our model with “Occupancy Calculator” and see that we get good prediction with first trial itself. We are able to predict parameters with an upper bound of 15% overshoot as against Occupancy Calculator with bound of 32% overshoot.!

Lawrence Livermore National Lab !

Figure 1. Inside a GPU Multiprocessor!