CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner...
Transcript of CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner...
CLTune:A Generic Auto-Tuner for OpenCL Kernels
Cedric Nugteren (presenter), Valeriu Codreanu
IEEE MCSoCSeptember 24, 2015
HPC center of the Netherlands
Example: convolution
2
Targets:• GPUs • Multi-core CPUs • Other OpenCL-capable devices
Example: blur filter
X
Y Yf
Xf A
X
B
filter size: Xf by Yf
outputinputinput output
filter coefficients
example: 3 by 3 filter
OpenCL 2D convolution
3
double for-loop
each thread: one output pixel
Thread coarsening (2D)?
Unroll loops?
OpenCL work-group size?
Vector data-types?
Caching in local memory?
Search-space explosion
4
16 x 2 x 16 x 4 x 5 = 10240
3424 configurations
filter illegalconfigurationsLarge search-space:
• Not feasible to explore manually • Perhaps not even feasible automatically?
Thread coarsening (2D)?
Unroll loops?
OpenCL work-group size?
Vector data-types?
Caching in local memory?
Why do we need an auto-tuner?
5
Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?
Wide variety of devices:• Different optimal kernels • Even from the same vendor
User-parameter dependent:• Examples: matrix sizes,
image size, filter sizes, etc.
Vector data-types?
Thread coarsening (2D)?
Unroll loops?
OpenCL work-group size?
Caching in local memory?
X
Y Yf
Xf A
X
B
Search strategies
6
Option 0: Full search☺Finds optimal solution ☹ Explores all options
3424 configurations on Tesla K40m GPU
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
mean
rotated histogram
Search strategies
7
Option 0: Full searchOption 1: Random search☺Explores arbitrary fraction ☹ Performance varies
Colours: 3 example runs
search progress (steps)
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
0 20 40 60 80 100
random on K40m
56%
94%
74%
Example: 107 out of 3424 configurations (1/32th)
Search strategies
8
Option 0: Full searchOption 1: Random searchOption 2: Simulated annealing☺Explores arbitrary fraction ☹ Performance varies ☹ Meta-parameter ☹ Local optima
search progress (steps)
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
0 20 40 60 80 100
SA T=4 on K40m91%
64%
84%
Example: 107 out of 3424 configurations (1/32th)
Colours: 3 example runs
Search strategies
9
Option 0: Full searchOption 1: Random searchOption 2: Simulated annealingOption 3: Particle swarm optimisation☺Explores arbitrary fraction ☹ Performance varies ☹ Meta-parameter ☹ Local optima
Colours: 3 example runs
search progress (steps)
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
0 20 40 60 80 100
PSO S=3 on K40m
43%
66%
44%
Example: 107 out of 3424 configurations (1/32th)
Line-types: 3 swarms
Search strategies evaluation
10
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
Each search: 107 out of 3424 configurations (1/32th)
average best result of 128 searches
meta-parameters for SA and PSO
Search strategies evaluation
11
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: HD7970
●
● ● ●● ●
●
searchspace
perfo
rman
ce [%
of b
est−
know
n]0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: Iris
● ● ● ●
● ●
●
searchspace
Conclusions:• Different per device • PSO performs poorly • Random search and SA perform well
Convolution case-study
12
perfo
rman
ce [%
of d
evic
e−pe
ak]
0
20
40
60
80
100
3x3 filter 7x7 filter 11x11 filter
device: GTX480
298
125
572
46
787
26GFLOPS
GB/s
GFLOPS
GB/s
GFLOPS
GB/s
Conclusions:• Different best parameters for different:
• devices (see paper) • filter-sizes
• Performance equal or better than the state-of-the-art [1]
[1]: B. Van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Optimizing Convolution Operations on GPUs Using Adaptive Tiling.
GEMM case-study
13pe
rform
ance
[% o
f dev
ice−
peak
]
0
20
40
60
80
100
K40m GTX480 HD7970 Iris
GEMM performance comparison this workclBLAScuBLAS
1933
1197
2870 845
318
8823144
2409
240
56
GFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPSGFLOPS
GFLOPS
GFLOPS
GFLOPS
cuBLASN/A
cuBLASN/A
X
C
+=
Mwg
M
Mwi
KwgK
Nwg
N
Nwi
N
Mwg
Nwg
Nwi
M
Mwi
Per thread tiling
Per workgroup tiling
BAT
Unrolled by a factor Kwi
Mwi x MdimC = Mwg
Conclusions:• Different best parameters for
different devices • Performance better than clBLAS,
but not as good as assembly-tuned cuBLAS
CLTune:A Generic Auto-Tuner for OpenCL Kernels
14
Auto-tuning OpenCL kernels:• Large search-space • Wide variety of devices • User-parameter dependent Advanced search strategies:• Simulated annealing • Particle swarm optimisation
Case-studies:• Fastest 2D convolution • Fast matrix-multiplication
perfo
rman
ce [%
of d
evic
e−pe
ak]
0
20
40
60
80
100
3x3 filter 7x7 filter 11x11 filter
device: GTX480
298
125
572
46
787
26GFLOPS
GB/s
GFLOPS
GB/s
GFLOPS
GB/s
[2]: T.L. Falch and A.C. Elster. Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability.
perfo
rman
ce [%
of b
est−
know
n]
0
20
40
60
80
100
random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6
device: K40m
●
●
● ●
●
●
●
searchspace
X
Y Yf
Xf A
X
B
Future: machine-learning [2]• Train a model on a small subset • Use the model to predict the remainder
Source-code on GitHub: https://github.com/CNugteren/CLTune