[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

153
Lecture #6: CUDA Ninja Tricks | March 1st, 2011 Nicolas Pinto (MIT, Harvard) [email protected] Massively Parallel Computing CS 264 / CSCI E-292

description

http://cs264.org

Transcript of [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Page 1: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Lecture #6: CUDA Ninja Tricks | March 1st, 2011

Nicolas Pinto (MIT, Harvard) [email protected]

Massively Parallel ComputingCS 264 / CSCI E-292

Page 2: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning
Page 3: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Lecture #6: CUDA Ninja Tricks | February 29th, 2011

Nicolas Pinto (MIT, Harvard) [email protected]

Massively Parallel ComputingCS 264 / CSCI E-292

GPU “Scripting”, Meta-programming, Auto-tuning

Page 4: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

News

Page 5: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for CS264

Page 6: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Todayyey!!

Page 7: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Outline

1. Scripting GPUs with PyCUDA

2.Meta-programming and RTCG

3.Case study in brain-inspired AI

Page 8: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Outline

1. Scripting GPUs with PyCUDA

2.Meta-programming and RTCG

3.Case study in brain-inspired AI

Page 9: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Why do Scripting for GPUs?

GPUs are everything that scripting

languages are not.

Highly parallel

Very architecture-sensitive

Built for maximum

compute/memory throughput

→ complement each other

CPU: largely restricted to control

tasks (∼1000/sec)

Scripting fast enough

Realize a promise: Use Scripting. . .

from first prototype

to full-scale production code.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 10: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Why do Scripting for GPUs?

GPUs are everything that scripting

languages are not.

Highly parallel

Very architecture-sensitive

Built for maximum

compute/memory throughput

→ complement each other

CPU: largely restricted to control

tasks (∼1000/sec)

Scripting fast enough

Realize a promise: Use Scripting. . .

from first prototype

to full-scale production code.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 11: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Why do Scripting for GPUs?

GPUs are everything that scripting

languages are not.

Highly parallel

Very architecture-sensitive

Built for maximum

compute/memory throughput

→ complement each other

CPU: largely restricted to control

tasks (∼1000/sec)

Scripting fast enough

Realize a promise: Use Scripting. . .

from first prototype

to full-scale production code.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 12: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

Why do Scripting for GPUs?

GPUs are everything that scriptinglanguages are not.

Highly parallelVery architecture-sensitiveBuilt for maximum FP/memorythroughput

→ complement each other

CPU: largely restricted to controltasks (∼1000/sec)

Scripting fast enough

Python + CUDA = PyCUDA

Python + OpenCL = PyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 13: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

How are High-Performance Codes constructed?

“Traditional” Construction of

High-Performance Codes:

C/C++/Fortran

Libraries

“Alternative” Construction of

High-Performance Codes:

Scripting for ‘brains’

GPUs for ‘inner loops’

Play to the strengths of each

programming environment.

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 14: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

Scripting: Python

One example of a scripting language: Python

Mature

Large and active community

Emphasizes readability

Written in widely-portable C

A ‘multi-paradigm’ language

Rich ecosystem of sci-comp related

software

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 15: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Scripting Languages

Python:

is discoverable and interactive.

has comprehensive built-in functionality.

manages resources automatically.

uses run-time typing.

works well for “gluing” lower-level blocks together.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 16: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Scripting: Goals

Scripting languages aim to reduce the load on the programmer:

Reduce required knowledge

Encourage experimentation

Eliminate sources of error

Encourage abstraction wherever possible

Value programmer time over computer time

Think about the tools you use.Use the right tool for the job.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 17: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Scripting: Goals

Scripting languages aim to reduce the load on the programmer:

Reduce required knowledge

Encourage experimentation

Eliminate sources of error

Encourage abstraction wherever possible

Value programmer time over computer time

Think about the tools you use.Use the right tool for the job.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial

Page 18: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Scripting: Goals

Scripting languages aim to reduce the load on the programmer:

Reduce required knowledge

Encourage experimentation

Eliminate sources of error

Encourage abstraction wherever possible

Value programmer time over computer time

Think about the tools you use.Use the right tool for the job.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial

Page 19: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Scripting: Speed

Usual answer to the “Speed

Question”:

Hybrid (“mixed”) Code.

Plays to the strengths of each

language.

But: Introduces (some)

complexity.

Observation: GPU code is already hybrid.

Consequence: No added complexity through hybrid code.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 20: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

Whetting your appetite

1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a)

[This is examples/demo.py in the PyCUDA distribution.]

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 21: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

Whetting your appetite

1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 {4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 }7 ”””)89 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a

Compute kernel

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Python

Page 22: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

Whetting your appetite

1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 {4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 }7 ”””)89 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a

Compute kernel

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 23: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Whetting your appetite, Part II

Did somebody say “Abstraction is good”?

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 24: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Whetting your appetite, Part II

1 import numpy2 import pycuda.autoinit3 from pycuda import gpuarray45 a cpu = numpy.random.randn(4,4).astype(numpy.float32)6 b cpu = numpy.random.randn(4,4).astype(numpy.float32)7 c cpu = a cpu ∗ b cpu89 a gpu = gpuarray.to gpu(a cpu)

10 b gpu = gpuarray.to gpu(b cpu)11 c gpu = (a gpu ∗ b gpu).get()1213 print c cpu − c gpu

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 25: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Remember me?

1 // trivia2 #include <stdio.h>3

4 #define CUDA CHK(NAME, ARGS) { \5 cudaError t cuda err code = NAME ARGS; \6 if (cuda err code != cudaSuccess) { \7 printf (”%s failed with code %d\n”, #NAME, cuda err code); \8 abort (); \9 } \

10 }11 // end12

13 // kernel14 global void square array ( float ∗a, float ∗b, int n)

15 {16 int i = (blockIdx .x ∗ blockDim.y + threadIdx.y)

17 ∗ blockDim.x + threadIdx.x;

18 if ( i < n)

19 a[ i ] = a[i ] ∗ b[i ];

20 }21 // end22

23 // main124 int main()

25 {26 cudaSetDevice(0); // EDIT ME27

28 const int n = 4096;

29

30 float ∗a host = (float ∗) malloc(n∗sizeof(float ));

31 float ∗b host = (float ∗) malloc(n∗sizeof(float ));

32

33 float ∗a device, ∗b device;

34 CUDA CHK(cudaMalloc, ((void ∗∗) &a device, n∗sizeof(float)));35 CUDA CHK(cudaMalloc, ((void ∗∗) &b device, n∗sizeof(float)));36 // end

1 // main22 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; }3

4 CUDA CHK(cudaMemcpy, (a device, a host, n∗sizeof(float),5 cudaMemcpyHostToDevice));

6 CUDA CHK(cudaMemcpy, (b device, b host, n∗sizeof(float),7 cudaMemcpyHostToDevice));

8

9 dim3 block dim(16, 16);

10 int block size = block dim.x∗block dim.y;

11 int n blocks = (n + block size−1) / block size ;

12 square array <<<n blocks, block dim>>>(a device, b device, n);

13 // end14

15 // main316 CUDA CHK(cudaMemcpy, (a host, a device, n∗sizeof(float),17 cudaMemcpyDeviceToHost));

18

19 for ( int i = 0; i < n; i++)

20 printf (”%.0f ”, a host [ i ]);

21 puts(”\n”);

22

23 free (a host );

24 CUDA CHK(cudaFree, (a device));

25 }26 // end

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 26: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

PyCUDA Philosophy

Provide complete access

Automatically manage resources

Provide abstractions

Check for and report errorsautomatically

Full documentation

Integrate tightly with numpy

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 27: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

PyCuda: Workflow

Edit

PyCuda

Run

SourceModule("...")

Cache!

nvcc .cubin

Upload to GPU

Run on GPU

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 28: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Automatic Cleanup

Reachable objects (memory,

streams, . . . ) are never destroyed.

Once unreachable, released at an

unspecified future time.

Scarce resources (memory) can be

explicitly freed. (obj.free())

Correctly deals with multiple

contexts and dependencies.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 29: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

gpuarray: Simple Linear Algebra

pycuda.gpuarray:Meant to look and feel just like numpy.

gpuarray.to gpu(numpy array)

numpy array = gpuarray.get()

No: nd indexing, slicing, etc. (yet!)

Yes: +, -, ∗, /, fill, sin, exp, rand, take, . . .

Random numbers using pycuda.curandom

Mixed types (int32 + float32 = float64)

print gpuarray for debugging.

Memory behind gpuarray available as .gpudataattribute.

Use as kernel arguments, textures, etc.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 30: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

What’s this “numpy”, anyway?

Numpy: package for large,multi-dimensional arrays.

Vectors, Matrices, . . .

A+B, sin(A), dot(A,B)

la.solve(A, b), la.eig(A)

cube[:, :, n-k:n+k], cube+5

All much faster than functional equivalents inPython.

“Python’s MATLAB”:Basis for SciPy, plotting, . . .

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 31: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

gpuarray: Elementwise expressions

Avoiding extra store-fetch cycles for elementwise math:

from pycuda.curandom import rand as curanda gpu = curand((50,))b gpu = curand((50,))

from pycuda.elementwise import ElementwiseKernellin comb = ElementwiseKernel(

” float a, float ∗x, float b, float ∗y, float ∗z”,”z[ i ] = a∗x[i ] + b∗y[i ]”)

c gpu = gpuarray.empty like (a gpu)lin comb(5, a gpu, 6, b gpu, c gpu)

assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 32: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

gpuarray: Reduction made easy

Example: A scalar product calculation

from pycuda.reduction import ReductionKerneldot = ReductionKernel(dtype out=numpy.float32, neutral=”0”,

reduce expr=”a+b”, map expr=”x[i]∗y[i]”,arguments=”const float ∗x, const float ∗y”)

from pycuda.curandom import rand as curandx = curand((1000∗1000), dtype=numpy.float32)y = curand((1000∗1000), dtype=numpy.float32)

x dot y = dot(x, y). get()x dot y cpu = numpy.dot(x.get(), y.get())

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 33: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python

Step 3: Usage

Complex numbers

. . . in GPUArray

. . . in user code

(pycuda-complex.hpp)

If/then/else for GPUArrays

Support for custom device pointers

Smarter device picking/context

creation

PyFFT: FFT for PyOpenCL and

PyCUDA

scikits.cuda: CUFFT, CUBLAS,

CULA

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 34: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python

Sparse Matrix-Vector on the GPU

New feature in 0.94:Sparse matrix-vectormultiplication

Uses “packeted format”by Garland and Bell (alsoincludes parts of their code)

Integrates with scipy.sparse.

Conjugate-gradients solverincluded

Deferred convergencechecking

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 35: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Kernel Invocation: Automatic Copies

mod = pycuda.driver.SourceModule(

” global my func(float ∗out, float ∗in ){...} ”)

func = mod.get function(”my func”)

src = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.empty like(src)

my func(

cuda.Out(dest),

cuda.In( src ),

block=(400,1,1))

“InOut” exists, too.

Only for immediate invocation style.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 36: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python

Step 4: Debugging

New in 0.94.1: Support for CUDA gdb:

$ cuda-gdb --args python -m

pycuda.debug demo.py

Automatically:

Sets Compiler flags

Retains source code

Disables compiler cache

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 37: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

CUDA APIs

Hardware

Kernel Driver

Driver API

Runtime API PyCuda

C/C++ Python CUDA has two Programming

Interfaces:

“Runtime” high-level

(libcudart.so, in the

“toolkit”)

“Driver” low-level

(libcuda.so, comes with

GPU driver)

(mutually exclusive)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 38: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

Runtime vs. Driver API

Runtime ↔ Driver differences:

Explicit initialization.

Code objects (“Modules”) become programming language

objects.

Texture handling requires slightly more work.

Only needs nvcc for compiling GPU code.

Driver API:

Conceptually cleaner

Less sugar-coating (provide in Python)

Not very different otherwise

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 39: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

PyCuda: API Tracing

With ./configure --cuda-trace=1:

import pycuda. driver as cuda

import pycuda. autoinit

import numpy

a = numpy.random.randn(4,4).astype(numpy.float32)

a gpu = cuda.mem alloc(a.nbytes)

cuda.memcpy htod(a gpu, a)

mod = cuda.SourceModule(”””

global void doublify ( float ∗a)

{int idx = threadIdx.x + threadIdx.y∗4;

a[ idx ] ∗= 2;

}”””)

func = mod.get function(”doublify”)

func(a gpu, block=(4,4,1))

a doubled = numpy.empty like(a)

cuda.memcpy dtoh(a doubled, a gpu)

print a doubled

print a

cuInit

cuDeviceGetCount

cuDeviceGet

cuCtxCreate

cuMemAlloc

cuMemcpyHtoD

cuCtxGetDevice

cuDeviceComputeCapability

cuModuleLoadData

cuModuleGetFunction

cuFuncSetBlockShape

cuParamSetv

cuParamSetSize

cuLaunchGrid

cuMemcpyDtoH

cuCtxPopCurrent

cuCtxPushCurrent

cuMemFree

cuCtxPopCurrent

cuCtxPushCurrent

cuModuleUnload

cuCtxPopCurrent

cuCtxDestroy

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Page 40: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

PyCUDA: Vital Information

http://mathema.tician.de/

software/pycuda

Complete documentation

MIT License

(no warranty, free for all use)

Requires: numpy, Python 2.4+

(Win/OS X/Linux)

Support via mailing list

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 41: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Sleepy ?

Page 42: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning
Page 43: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Outline

1. Scripting GPUs with PyCUDA

2.Meta-programming and RTCG

3.Case study in brain-inspired AI

Page 44: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

caching

... too much ?

bank conflicts

coalescing

partition campingclam

ping

mix

ed p

reci

sion

broadcasting

streamszero-copy

Page 45: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

can’t decide ?

Page 46: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

GPU Programming: Implementation Choices

Many difficult questions

Insufficient heuristics

Answers are hardware-specific andhave no lasting value

Proposed Solution: Tune automaticallyfor hardware at run time, cache tuningresults.

Decrease reliance on knowledge ofhardware internals

Shift emphasis fromtuning results to tuning ideas

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 47: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

GPU Programming: Implementation Choices

Many difficult questions

Insufficient heuristics

Answers are hardware-specific andhave no lasting value

Proposed Solution: Tune automaticallyfor hardware at run time, cache tuningresults.

Decrease reliance on knowledge ofhardware internals

Shift emphasis fromtuning results to tuning ideas

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 48: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 49: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 50: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 51: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 52: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 53: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 54: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Python

The News

4 Run-Time Code

Generation

WritingCode

whenthe most K

nowledge is Ava

ilable

Showcase

slide by Andreas Klockner (NYU)

Page 55: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDA

PyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 56: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Metaprogramming

Idea

Python Code

GPU Code

GPU Compiler

GPU Binary

GPU

Result

Machine

Human

In GPU scripting,GPU code doesnot need to bea compile-time

constant.

(Key: Code is data–it wants to bereasoned about at run time)

Good for codegeneration

PyCUDAPyOpenCL

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 57: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

Machine-generated Code

Why machine-generate code?

Automated Tuning(cf. ATLAS, FFTW)

Data types

Specialize code for given problem

Constants faster than variables(→ register pressure)

Loop Unrolling

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Pythonslide by Andreas Klockner (NYU)

Page 58: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

PyCuda: Support for Metaprogramming

Access properties of compiled code:

func.{num regs,shared size bytes,local size bytes}Exact GPU timing via events

Can calculate hardware-dependent MP occupancy

codepy (by Andreas):

Build C syntax trees from Python

Generates readable, indented C

Or use a templating engine (many available, e.g. Cheetah)

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorial

r

slide by Andreas Klockner (NYU)

Page 59: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Outline

1. Scripting GPUs with PyCUDA

2.Meta-programming and RTCG

3.Case study in brain-inspired AI (vision)

Page 60: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Motivation

Page 61: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

fastaccuratetolerant to variationseffortlesscritical to survival

Visual Object RecognitionThe Problem:

Page 62: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

The ApproachReverse and Forward Engineering the Brain

Page 63: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

The ApproachReverse and Forward Engineering the Brain

Build Artificial System

FORWARD REVERSE Study

Natural System

Page 64: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

The brain is a massively parallel computer

➡ Big models are paralyzingly slow to run

Why is modeling challenging?

Advice from Dave Cox:

“Don’t run anything that takes longer than a week to complete, because it will just crash halfway through anyways (or you’ll discover a bug) and you’ll never finish your Ph.D.”

Neural data only provides weak constraints

➡ Lots of parameters – hard to explore

Page 65: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Why is modeling challenging?

The brain is a massively parallel computer

➡ Big models are paralyzingly slow to run

Neural data only provides weak constraints

➡ Lots of parameters – hard to explore

Page 66: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Visual Cortex

brain = 20 petaflops !

Page 67: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

GPUs (since 2006)

7800 GTX(2006)

Monster16GPU(2008)

Tesla Cluster(2009)

OpenGL/Cg CUDA CUDA/OpenCL

C++/Python Python Python

Page 68: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Build your own!

Page 69: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Cell Broadband Engine (since 2007)

DiCarlo Lab / MIT Cox Lab / Harvard

Teraflop Playstation3 clusters:

Page 70: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

A Match Made in HeavenBrains are parallel, GPUs are parallel

Multiple scales of parallelism:“Embarrasingly” parallel: video frames, regionsFine-grained: independent “neurons,” operating on overlapping inputs

Page 71: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

A Match Made in HeavenImages In, Images Out

Image processing particularly well-suitedExcellent Arithmetic Intensity: very natural to load image patches into shared memoryData: 2D / 3D locality

Page 72: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Why is modeling challenging?

The brain is a massively parallel computer

➡ Big models are paralyzingly slow to run

Neural data only provides weak constraints

➡ Lots of parameters – hard to explore

Page 73: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Fukushima (1980)

Page 74: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

LeCun et al. (1989)

Page 75: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Riesenhuber & Poggio (1999)

Page 76: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Serre & Poggio (2007)

Page 77: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

L1

L2

L3

input

Read-out

n. of !lters

kernel size

kernel size

number of !lters

number of !lters

Learning

kernel size

normalizationneighborhood

normalizationneighborhood

normalizationneighborhood

norm strengththresh/sat

norm strengththresh/sat

norm strengththresh/sat

RateTrace“Temp. Adv.”“Auto-reset”

...

Learning

RateTrace“Temp. Adv.”“Auto-reset”

...

Learning

RateTrace“Temp. Adv.”“Auto-reset”

...

Page 78: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

L1

L2

L3

n. of !lters

kernel size

kernel size

number of !lters

Learning

normalizationneighborhood

normalizationneighborhood

neighborhood

norm strengththresh/sat

norm strengththresh/sat

RateTrace“Temp. Adv.”“Auto-reset”

...

Learning

RateTrace“Temp. Adv.”“Auto-reset”

...

RateTrace“Temp. Adv.”“Auto-reset”

...

Page 79: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

The brain is a massively parallel computer

➡ Big models are paralyzingly slow to run

Neural data only provides weak constraints

➡ Lots of parameters – hard to explore

How to optimize?

Two conflicting requirements

FAST

FLEXIBLE

Page 80: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

What’s the bottleneck?

Page 81: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

3D Filterbank Convolutions!

Page 82: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Fast vs Flexible: what can you do?

MATLAB/CUDA by Jim Mutch (2010)

- Make your code accessible

- No focus on raw performance

Examples:

by John Moore (1995)

Page 83: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Fast vs Flexible: what can you do?

- Use standard libraries (e.g. CUBLAS, CUFFT, Jacket)

- But: “remap” problem to fit?

- Memory issues (not always optimal)

Page 84: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Fast vs Flexible: what can you do?

- Fully optimized, by hand

- But for only a few input configurations...

Page 85: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Fast vs Flexible: what can you do?

- Focus on flexibility/accessibility first

- But add strong foundations for raw performance from the beginning

Example:

http://deeplearning.netby James Bergstra & Yoshua Bengio (2010)

Python/C/CUDA

(OpenCL*)

Page 86: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Our answer?

Page 87: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Meta-programmingand

Auto-tuning

Page 88: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

What?

Page 89: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Meta-programming !

Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW)• Dynamically compile specialized versions

of the same kernel for different conditions • Empirical run-time tuning• For free: smooth syntactic ugliness: unroll

loops, index un-indexable registers, etc.

Page 90: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

“Instrument” your solutions:• Block size • Work size• Loop unrolling• Pre-fetching• Spilling• etc.

Meta-programming !

Page 91: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Let the computer generate and find the optimal code:• brute-force search with a global objective• machine-learning approach with local

objectives and hidden variables (advanced)• e.g. PyCuda makes this easy:

Meta-programming !

Page 92: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Basic GPU Meta-programming System

GPU Meta-Programming: A Case Study

in Biologically-Inspired Machine Vision

[GPU Computing Gems]

Pinto N, Cox DD

Page 93: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for

Cheetah

Page 94: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for

#include <stdio.h>

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[4][4][4];

#define IMUL(a, b) __mul24(a, b)extern "C" {

__global__ void convolve_beta_j0(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j1(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+1)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j2(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+2)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j3(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+3)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

}

conv_kernel_template.cuconv_kernel_4x4x4.cu

Page 95: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for

conv_kernel_template.cu

conv_kernel_4x4x4.cu

20 kB

conv_kernel_8x8x4.cu

64 kB

Page 96: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Benefits?

Page 97: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Smooth syntactic ugliness

Page 98: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Smooth syntactic ugliness

Manipulations that are not easily accessible in CUDA C code:• variable-length argument lists

Page 99: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Smooth syntactic ugliness

Manipulations that are not easily accessible in CUDA C code:• syntax-level code control (e.g. conditionals)

Page 100: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Smooth syntactic ugliness

Manipulations that are not easily accessible in CUDA C code:• loop unrolling (possibly fine-controlled)

Page 101: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Smooth syntactic ugliness

Manipulations that are not easily accessible in CUDA C code:• fine-controlled loop unrolling

(...) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j1(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+1)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j2(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+2)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

__global__ void convolve_beta_j3(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+3)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; } if((threadIdx.x+128*1)<131) { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; } __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][1]; w = constant[1][1][0]; sum0 += v*w; w = constant[1][1][1]; sum1 += v*w; w = constant[1][1][2]; sum2 += v*w; w = constant[1][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][1]; w = constant[1][2][0]; sum0 += v*w; w = constant[1][2][1]; sum1 += v*w; w = constant[1][2][2]; sum2 += v*w; w = constant[1][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][1]; w = constant[1][3][0]; sum0 += v*w; w = constant[1][3][1]; sum1 += v*w; w = constant[1][3][2]; sum2 += v*w; w = constant[1][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][2]; w = constant[2][0][0]; sum0 += v*w; w = constant[2][0][1]; sum1 += v*w; w = constant[2][0][2]; sum2 += v*w; w = constant[2][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][2]; w = constant[2][1][0]; sum0 += v*w; w = constant[2][1][1]; sum1 += v*w; w = constant[2][1][2]; sum2 += v*w; w = constant[2][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][2]; w = constant[2][2][0]; sum0 += v*w; w = constant[2][2][1]; sum1 += v*w; w = constant[2][2][2]; sum2 += v*w; w = constant[2][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][2]; w = constant[2][3][0]; sum0 += v*w; w = constant[2][3][1]; sum1 += v*w; w = constant[2][3][2]; sum2 += v*w; w = constant[2][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][3]; w = constant[3][0][0]; sum0 += v*w; w = constant[3][0][1]; sum1 += v*w; w = constant[3][0][2]; sum2 += v*w; w = constant[3][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][3]; w = constant[3][1][0]; sum0 += v*w; w = constant[3][1][1]; sum1 += v*w; w = constant[3][1][2]; sum2 += v*w; w = constant[3][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][3]; w = constant[3][2][0]; sum0 += v*w; w = constant[3][2][1]; sum1 += v*w; w = constant[3][2][2]; sum2 += v*w; w = constant[3][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][3]; w = constant[3][3][0]; sum0 += v*w; w = constant[3][3][1]; sum1 += v*w; w = constant[3][3][2]; sum2 += v*w; w = constant[3][3][3]; sum3 += v*w; // -- store output output[out_idx].x += sum0; output[out_idx].y += sum1; output[out_idx].z += sum2; output[out_idx].w += sum3; }

}

Page 102: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

How about #pragma unroll ?(why don’t you trust the compiler?)

Page 103: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Using GPUs for Signal

Correlation

Michael Clarkwith

Paul La Plante and Lincoln Greenhill

The Murchison Widefield Array

Daniel A. Mitchell

Figure 3: On the left is an image of the J2107-2526 field produced by integrating 8-second snapshots over

the entire time interval without blanking. On the right is an image of the field after RFI blanking and peeling,

along with contours of the unpeeled image.

occasion, reflect or refract into the receivers at levels that are orders of magnitude above the noise

floor. During deep integrations the MWA real-time system will simply discard dubious data. This

will require a series of data-quality tests, of which the simple median-based detector shown here

will form an integral part.

References

[1] A.E.E. Rogers, RFI Statistics at Boolardy, EDGES Memo, 058, 2010.

[2] D.A. Mitchell, L.J. Greenhill, R.B. Wayth, R.J. Sault, C.J. Lonsdale, R.J. Cappallo, M.F. Morales, and

S.M. Ord, Real-Time Calibration of the Murchison Widefield Array, IEEE Journal of Selected Topics

in Signal Processing, 2 (5), 707–717, 2008, [astro-ph/0807.191

2].

[3] C.J. Lonsdale, et al., The Murchison Widefield Array: Design Overview, Proceedings of the IEEE, 97

(8), 1497–1506, 2009, [astro-ph/0903.182

8].

[4] S.M. Ord, L.J. Greenhill, R.B. Wayth, D.A. Mitchell, K. Dale, H. Pfister, and R.G. Edgar, Graphics

Processing Units for Data Processing in the Murchison Wide-field Array, ASP Conference Series,

411, 127, 2009.

[5] J.P. Hamaker, J.D. Bregman, and R.J. Sault, Understanding radio polarimetry. I. Mathematical

foundations, Astron. Astrophys. Suppl. Ser., 117, 137–147, 1996.

[6] J.P. Hamaker, Understanding radio polarimetry. IV. The full-coherency analogue of scalar

self-calibration: Self-alignment, dynamic range and polarimetric fidelity, Astron. Astrophys. Suppl.

Ser., 143, 515–543, 2000.

[7] S.M. Ord, et al., Wide-field interferometric imaging via the combination of warped snapshots, in prep.

[8] P.A. Fridman, Statistically Stable Estimates of Variance in Radio-Astronomy Observations as Tools

for Radio-Frequency Interference Mitigation, The Astronomical Journal, 135 (5), 1810–1824, 2008.

[9] A.E. Wright and R. Otrupcek, (Eds), Parkes Catalogue, Australia Telescope National Facility, 1990.

[10] J.E. Noordam, LOFAR Calibration Challenges, in Proc. SPIE: Groundbased Telescopes, 5489,

817–825, 2004.

6

Thursday, 27 January 2011

IICS‘2011

we are not alone....

Don’t trust compilers

• Compare these “identical” code fragments

a += b*c +

d*c + e*f

+ g*h;

a += b*c;

a += d*c;

a += e*f;

a += g*h;

1020 GFLOPS

770 GFLOPS

Thursday, 27 January 2011

Page 104: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Smooth syntactic ugliness

Manipulations that are not easily accessible in CUDA C code:• index un-indexable resources (e.g. regs)

Page 105: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Explore design decision space more freely

Page 106: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Basic GPU Meta-programming System

GPU Meta-Programming: A Case Study

in Biologically-Inspired Machine Vision

[GPU Computing Gems]

Pinto N, Cox DD

Page 107: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Exploring design decision space more freely

Meta-programming:

• enables efficient learning of the GPU hardware/software

• allows full exploitation of the GPU architecture

Page 108: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; }#end for

conv_kernel_beta_template.cu ...mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1mov.b32 $r1, c0[$ofs2+0x0008]mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4mov.b32 $r1, c0[$ofs2+0x000c]mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4mov.b32 $r1, c0[$ofs2+0x0010]mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4

...

version A

version B

...mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

...2x faster... Why ?

using decuda by Wladimir J. van der Laan

Page 109: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Exploring design decision space more freely

Page 110: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Exploring design decision space more freely

When USE_THREAD_PER_FILTER is True

• each thread will access different cmem locations (in order)

using the decuda disassembler by Wladimir J. van der Laan (Python-based)

Page 111: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Exploring design decision space more freely

When USE_THREAD_PER_FILTER is False

• each thread will access the same cmem locations (broadcast)

using the decuda disassembler by Wladimir J. van der Laan (Python-based)

Page 112: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Exploring design decision space more freely

2x faster... Why ?

v.s.

more registers

thread-dependent data movement

Page 113: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Strategy

• intermediate design decisions can be made explicit

• multiple “forks” in the path can be kept in place

• frees up the developer to revisit paste choices (without incurring a combinatoric explosion of separate pieces of code)

• retesting sets of assumptions can be done frequently and programmatically from the “outer” framework of code

Page 114: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah

Matmul Toy Example

Page 115: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Summary

Meta-programming:

• can assist exploration and manual optimization

• can de-clutter code

• is easy and flexible with the right tools (e.g. Python, Py{CUDA,CL}, Cheetah, decuda)

➡ facilitates auto-tuning!

Page 116: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Need a pause?

Page 117: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

How to get to the ninja level?

Page 118: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Practice, practice, practice...

Page 119: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Auto-tuning

Page 120: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Basic GPU Meta-programming System

GPU Meta-Programming: A Case Study

in Biologically-Inspired Machine Vision

[GPU Computing Gems]

Pinto N, Cox DD

Page 121: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Auto-tuning

The goal is to empirically optimize execution time given:

• the environment

- hardware (GPU, CPU, Memory, Mobo)

- software (SDK, Compiler suite)

• the data (input dimensions, repetitions, etc.)

Page 122: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Basic auto-tuning: pseudo-code (1/3)

Filter-bank Convolution / Correlation

Scripting, Py{CUDA,CL}

NoSQL (CouchDB, MongoDB) ?

Page 123: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Basic auto-tuning: pseudo-code (2/3)

PyCUDA/CL

Cheetah, Jinja, Mako

Page 124: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Basic auto-tuning: pseudo-code (3/3)

PyCUDA/CL

NoSQL (CouchDB, MongoDB)

Page 125: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Optimizing what?

Page 126: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Optimizing strategy

• Like many operations, filter-bank convolution is usually “communication bound” on the GPU:- compute is cheap- communication is expensive

• We must take advantage of all types of memory:- explicit: gmem (global), smem (shared), cmem

(constant), tmem (texture)- implicit: rmem (registers), bmem (bin-code?) *

• Different optimal access patterns

Page 127: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Example: thread gmem output size

stupid float4 xyzw trick

Page 128: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Example: multiple smem loads

Page 129: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Example: using texture fetches

Page 130: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Example: register spilling

Page 131: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Example: register pressure (nvcc)

Page 132: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Example: capitalizing on bmem (bin code) ??

input offset in cubin code?

multiple versions of the same function with different input offsets

Page 133: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Results

Page 134: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Results

GPU / SDK Input Filter-bank Meta-progdefault (gflops)

Meta-progauto-tuned (gflops)

Boost

9600M GTCUDA3.1

256x256x8 64x9x9x8 6.710 ± 0.005 36.584 ± 0.023 445.2 %512x512x4 32x13x13x4 13.606 ± 0.002 35.582 ± 0.003 161.5 %

1024x1024x8 16x5x5x8 20.034 ± 0.113 26.084 ± 6.243 30.2 %2048x2048x4 4x8x8x4 25.781 ± 0.044 46.945 ± 0.100 82.1 %

C1060CUDA2.3

256x256x8 64x9x9x8 104.188 ± 0.051 168.083 ± 0.372 61.3 %512x512x4 32x13x13x4 125.739 ± 0.109 234.053 ± 0.266 86.1 %

1024x1024x8 16x5x5x8 144.279 ± 0.764 243.697 ± 0.346 68.9 %2048x2048x4 4x8x8x4 180.060 ± 0.018 322.328 ± 0.348 79.0 %

GTX285CUDA2.3

256x256x8 64x9x9x8 123.396 ± 0.016 197.006 ± 0.219 59.7 %512x512x4 32x13x13x4 143.277 ± 0.044 270.206 ± 0.209 88.6 %

1024x1024x8 16x5x5x8 148.841 ± 0.465 310.276 ± 0.538 108.5 %2048x2048x4 4x8x8x4 205.152 ± 0.015 376.685 ± 0.070 83.6 %

GTX480CUDA3.1

256x256x8 64x9x9x8 467.631 ± 19.100 471.902 ± 11.419 0.9 %512x512x4 32x13x13x4 834.838 ± 8.275 974.266 ± 3.809 16.7 %

1024x1024x8 16x5x5x8 542.808 ± 1.135 614.019 ± 0.904 13.1 %2048x2048x4 4x8x8x4 378.165 ± 0.537 806.628 ± 0.168 113.3 %Pi

nto,

Cox

(Sub

mitt

ed)

Page 135: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Analysis

Page 136: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Analysis

Page 137: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Empirical results...

Performance (g!ops)

Q9450 (Matlab/C) [2008]

Q9450 (C/SSE) [2008]

7900GTX (Cg) [2006]

PS3/Cell (C/ASM) [2007]

8800GTX (CUDA1.x) [2007]

GTX280 (CUDA2.x) [2008]

GTX480 (CUDA3.x) [2010] 974.3

339.3

192.7

111.4

68.2

9.0

0.3

>1000X speedup is game changing...

Page 138: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Summary

Page 139: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Summary

• Meta-programming makes developing high-performing code for GPU easier

• Fantastic tools exist (e.g. PyCUDA) to help

• Interesting way to explore/learn about GPUs (hw/sw)

• Coarse auto-tuning yields good results

Page 140: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Future

• More fermi optimizations(L1 cache, concurrent kernels)

• OpenCL to optimize across vendors

• Smarter auto-tuning techniques (ML)- (boosted) decision trees- evolutionary programming strategies

Page 141: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

• Thu 3/31/11:PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)

• Tue 3/29/11:Algorithm Strategies (W. Hwu, UIUC)

• Tue 4/5/11:Analysis-driven Optimization (C.Wooley, NVIDIA)

• Thu 4/7/11:Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)

• Thu 4/14/11:Optimization for Ninjas (D.Merill, UVirg)

• ...

More ?

Page 142: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

iPhD one more thingor two...

Page 143: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Life/Code Hacking #2.xSpeed {listen,read,writ}ing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Page 144: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Life/Code Hacking #2.2bSpeed writing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Page 145: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Life/Code Hacking #2.2bSpeed writing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

RSI ?

Page 146: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Life/Code Hacking #2.2bSpeed writing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

RSI ?

Page 147: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Life/Code Hacking #2.2bSpeed writing

Page 148: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Life/Code Hacking #2.3Speed reading

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Page 149: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Life/Code Hacking #2.3Speed reading

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

1. Collect many papers, docs, chapters, etc. (100)

2. Skim through them quickly / select (50)

3. Read w/o full understanding / select (25)

4. Read completely w/ full understanding / select (10)

5. Complete mastery + reproduction (5)

Page 150: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Life/Code Hacking #2.3Speed reading

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

http://readerssoft.com/speed_reading_obstacles.php

Page 151: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

vs.

Speed reading

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

http://readerssoft.com/speed_reading_obstacles.php

Life/Code Hacking #2.3

normal reading

speed reading

Page 152: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Speed reading

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Life/Code Hacking #2.3

like David Guetta, use one finger !

Page 153: [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

COME