CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner...

14
CLTune: A Generic Auto-Tuner for OpenCL Kernels Cedric Nugteren (presenter), Valeriu Codreanu IEEE MCSoC September 24, 2015 HPC center of the Netherlands

Transcript of CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner...

Page 1: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

CLTune:A Generic Auto-Tuner for OpenCL Kernels

Cedric Nugteren (presenter), Valeriu Codreanu

IEEE MCSoCSeptember 24, 2015

HPC center of the Netherlands

Page 2: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Example: convolution

2

Targets:• GPUs • Multi-core CPUs • Other OpenCL-capable devices

Example: blur filter

X

Y Yf

Xf A

X

B

filter size: Xf by Yf

outputinputinput output

filter coefficients

example: 3 by 3 filter

Page 3: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

OpenCL 2D convolution

3

double for-loop

each thread: one output pixel

Thread coarsening (2D)?

Unroll loops?

OpenCL work-group size?

Vector data-types?

Caching in local memory?

Page 4: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Search-space explosion

4

16 x 2 x 16 x 4 x 5 = 10240

3424 configurations

filter illegalconfigurationsLarge search-space:

• Not feasible to explore manually • Perhaps not even feasible automatically?

Thread coarsening (2D)?

Unroll loops?

OpenCL work-group size?

Vector data-types?

Caching in local memory?

Page 5: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Why do we need an auto-tuner?

5

Large search-space:• Not feasible to explore manually • Perhaps not even feasible automatically?

Wide variety of devices:• Different optimal kernels • Even from the same vendor

User-parameter dependent:• Examples: matrix sizes,

image size, filter sizes, etc.

Vector data-types?

Thread coarsening (2D)?

Unroll loops?

OpenCL work-group size?

Caching in local memory?

X

Y Yf

Xf A

X

B

Page 6: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Search strategies

6

Option 0: Full search☺Finds optimal solution ☹ Explores all options

3424 configurations on Tesla K40m GPU

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

mean

rotated histogram

Page 7: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Search strategies

7

Option 0: Full searchOption 1: Random search☺Explores arbitrary fraction ☹ Performance varies

Colours: 3 example runs

search progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

random on K40m

56%

94%

74%

Example: 107 out of 3424 configurations (1/32th)

Page 8: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Search strategies

8

Option 0: Full searchOption 1: Random searchOption 2: Simulated annealing☺Explores arbitrary fraction ☹ Performance varies ☹ Meta-parameter ☹ Local optima

search progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

SA T=4 on K40m91%

64%

84%

Example: 107 out of 3424 configurations (1/32th)

Colours: 3 example runs

Page 9: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Search strategies

9

Option 0: Full searchOption 1: Random searchOption 2: Simulated annealingOption 3: Particle swarm optimisation☺Explores arbitrary fraction ☹ Performance varies ☹ Meta-parameter ☹ Local optima

Colours: 3 example runs

search progress (steps)

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

0 20 40 60 80 100

PSO S=3 on K40m

43%

66%

44%

Example: 107 out of 3424 configurations (1/32th)

Line-types: 3 swarms

Page 10: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Search strategies evaluation

10

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

Each search: 107 out of 3424 configurations (1/32th)

average best result of 128 searches

meta-parameters for SA and PSO

Page 11: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Search strategies evaluation

11

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: HD7970

● ● ●● ●

searchspace

perfo

rman

ce [%

of b

est−

know

n]0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: Iris

● ● ● ●

● ●

searchspace

Conclusions:• Different per device • PSO performs poorly • Random search and SA perform well

Page 12: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

Convolution case-study

12

perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

3x3 filter 7x7 filter 11x11 filter

device: GTX480

298

125

572

46

787

26GFLOPS

GB/s

GFLOPS

GB/s

GFLOPS

GB/s

Conclusions:• Different best parameters for different:

• devices (see paper) • filter-sizes

• Performance equal or better than the state-of-the-art [1]

[1]: B. Van Werkhoven, J. Maassen, H.E. Bal, and F.J. Seinstra. Optimizing Convolution Operations on GPUs Using Adaptive Tiling.

Page 13: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

GEMM case-study

13pe

rform

ance

[% o

f dev

ice−

peak

]

0

20

40

60

80

100

K40m GTX480 HD7970 Iris

GEMM performance comparison this workclBLAScuBLAS

1933

1197

2870 845

318

8823144

2409

240

56

GFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPSGFLOPS

GFLOPS

GFLOPS

GFLOPS

cuBLASN/A

cuBLASN/A

X

C

+=

Mwg

M

Mwi

KwgK

Nwg

N

Nwi

N

Mwg

Nwg

Nwi

M

Mwi

Per thread tiling

Per workgroup tiling

BAT

Unrolled by a factor Kwi

Mwi x MdimC = Mwg

Conclusions:• Different best parameters for

different devices • Performance better than clBLAS,

but not as good as assembly-tuned cuBLAS

Page 14: CLTune: A Generic Auto-Tuner for OpenCL Kernels · • devices (see paper) ... A Generic Auto-Tuner for OpenCL Kernels 14 Auto-tuning OpenCL kernels: • Large search-space • Wide

CLTune:A Generic Auto-Tuner for OpenCL Kernels

14

Auto-tuning OpenCL kernels:• Large search-space • Wide variety of devices • User-parameter dependent Advanced search strategies:• Simulated annealing • Particle swarm optimisation

Case-studies:• Fastest 2D convolution • Fast matrix-multiplication

perfo

rman

ce [%

of d

evic

e−pe

ak]

0

20

40

60

80

100

3x3 filter 7x7 filter 11x11 filter

device: GTX480

298

125

572

46

787

26GFLOPS

GB/s

GFLOPS

GB/s

GFLOPS

GB/s

[2]: T.L. Falch and A.C. Elster. Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability.

perfo

rman

ce [%

of b

est−

know

n]

0

20

40

60

80

100

random SA T=2 SA T=4 SA T=6 PSO S=3 PSO S=6

device: K40m

● ●

searchspace

X

Y Yf

Xf A

X

B

Future: machine-learning [2]• Train a model on a small subset • Use the model to predict the remainder

Source-code on GitHub: https://github.com/CNugteren/CLTune