CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner...

CLTune:A Generic Auto-Tuner for OpenCL Kernels

Cedric Nugteren (presenter), Valeriu Codreanu

IEEE MCSoCSeptember 24, 2015

HPC center of the Netherlands

Example: convolution

2

Targets:• GPUs • Multi-core CPUs • Other OpenCL-capable devices

Example: blur filter

X

Y Yf

Xf A

X

B

filter size: Xf by Yf

outputinputinput output

filter coefficients

example: 3 by 3 filter

OpenCL 2D convolution

3

double for-loop

each thread: one output pixel

Thread coarsening (2D)?

Unroll loops?

OpenCL work-group size?

Vector data-types?

Caching in local memory?

Search-space explosion

4

16 x 2 x 16 x 4 x 5 = 10240

3424 configurations

filter illegalconfigurationsLarge search-space:

• Not feasible to explore manually • Perhaps not even feasible automatically?


Unroll loops?


Vector data-types?


Why do we need an auto-tuner?

5

Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?

Wide variety of devices:• Different optimal kernels • Even from the same vendor

User-parameter dependent:• Examples: matrix sizes,

image size, filter sizes, etc.

Vector data-types?


Unroll loops?



X

Y Yf

Xf A

X

B

Search strategies

6

Option 0: Full search☺Finds optimal solution ☹ Explores all options

3424 configurations on Tesla K40m GPU

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

●

●

● ●

●

●

●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100


device: K40m

●

●

● ●

●

●

●

searchspace

mean

rotated histogram

Search strategies

7

Option 0: Full searchOption 1: Random search☺Explores arbitrary fraction ☹ Performance varies

Colours: 3 example runs

search progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

random on K40m

56%

94%

74%

Example: 107 out of 3424 configurations (1/32th)

Search strategies

8

Option 0: Full searchOption 1: Random searchOption 2: Simulated annealing☺Explores arbitrary fraction ☹ Performance varies ☹ Meta-parameter ☹ Local optima


perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

SA T=4 on K40m91%

64%

84%



Search strategies

9

Option 0: Full searchOption 1: Random searchOption 2: Simulated annealingOption 3: Particle swarm optimisation☺Explores arbitrary fraction ☹ Performance varies ☹ Meta-parameter ☹ Local optima



perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

PSO S=3 on K40m

43%

66%

44%


Line-types: 3 swarms

Search strategies evaluation

10

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100


device: K40m

●

●

● ●

●

●

●

searchspace

Each search: 107 out of 3424 configurations (1/32th)

average best result of 128 searches

meta-parameters for SA and PSO

Search strategies evaluation

11

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100


device: K40m

●

●

● ●

●

●

●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100


device: HD7970

●

● ● ●● ●

●

searchspace

perfo

rman

ce [%

of b

est−

know

n]0

20

40

60

80

100


device: Iris

● ● ● ●

● ●

●

searchspace

Conclusions:• Different per device • PSO performs poorly • Random search and SA perform well

Convolution case-study

12

perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

3x3 filter 7x7 filter 11x11 filter

device: GTX480

298

125

572

46

787

26GFLOPS

GB/s

GFLOPS

GB/s

GFLOPS

GB/s

Conclusions:• Different best parameters for different:

• devices (see paper) • filter-sizes

• Performance equal or better than the state-of-the-art [1]

[1]: B. Van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Optimizing Convolution Operations on GPUs Using Adaptive Tiling.

GEMM case-study

13pe

rform

ance

[% o

f dev

ice−

peak

]

0

20

40

60

80

100

K40m GTX480 HD7970 Iris

GEMM performance comparison this workclBLAScuBLAS

1933

1197

2870 845

318

8823144

2409

240

56

GFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPS

GFLOPS

cuBLASN/A

cuBLASN/A

X

C

+=

Mwg

M

Mwi

KwgK

Nwg

N

Nwi

N

Mwg

Nwg

Nwi

M

Mwi

Per thread tiling

Per workgroup tiling

BAT

Unrolled by a factor Kwi

Mwi x MdimC = Mwg

Conclusions:• Different best parameters for

different devices • Performance better than clBLAS,

but not as good as assembly-tuned cuBLAS

CLTune:A Generic Auto-Tuner for OpenCL Kernels

14

Auto-tuning OpenCL kernels:• Large search-space • Wide variety of devices • User-parameter dependent Advanced search strategies:• Simulated annealing • Particle swarm optimisation

Case-studies:• Fastest 2D convolution • Fast matrix-multiplication

perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

3x3 filter 7x7 filter 11x11 filter

device: GTX480

298

125

572

46

787

26GFLOPS

GB/s

GFLOPS

GB/s

GFLOPS

GB/s

[2]: T.L. Falch and A.C. Elster. Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability.

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100


device: K40m

●

●

● ●

●

●

●

searchspace

X

Y Yf

Xf A

X

B

Future: machine-learning [2]• Train a model on a small subset • Use the model to predict the remainder

Source-code on GitHub: https://github.com/CNugteren/CLTune

https://github.com/CNugteren/CLTune

CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner...

Documents

Transcript of CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner...