CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

43
CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011

Transcript of CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Page 1: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

CUDA Performance Considerations

Patrick CozziUniversity of PennsylvaniaCIS 565 - Spring 2011

Page 2: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Agenda

Data Prefetching Loop Unrolling Thread Granularity

Page 3: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];

float f = a * b + c * d;

float f2 = m * f;

Page 4: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];

float f = a * b + c * d;

float f2 = m * f;

Read global memory

Page 5: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];

float f = a * b + c * d;

float f2 = m * f;

Execute instructions that are not dependent on memory read

Page 6: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];

float f = a * b + c * d;

float f2 = m * f; Use global memory after the above line executes in enough warps hide the memory latency

Page 7: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use

Page 8: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Recall tiled matrix multiply:

for (/* ... */)

{

// Load current tile into shared memory

__syncthreads();

// Accumulate dot product

__syncthreads();

}

Page 9: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */)

{

// Deposit registers into shared memory

__syncthreads();

// Load next tile into registers

// Accumulate dot product

__syncthreads();

}

Page 10: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */)

{

// Deposit registers into shared memory

__syncthreads();

// Load next tile into registers

// Accumulate dot product

__syncthreads();

}

Page 11: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */)

{

// Deposit registers into shared memory

__syncthreads();

// Load next tile into registers

// Accumulate dot product

__syncthreads();

}

Prefetch for next iteration of the loop

Page 12: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */)

{

// Deposit registers into shared memory

__syncthreads();

// Load next tile into registers

// Accumulate dot product

__syncthreads();

}

These instructions executed by enough warps will hide the memory latency of the prefetch

Page 13: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

CostAdded complexityMore registers – what does this imply?

Page 14: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Instructions per iterationOne floating-point multipleOne floating-point addWhat else?

Page 15: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Other instructions per iterationUpdate loop counter

Page 16: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Other instructions per iterationUpdate loop counterBranch

Page 17: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Other instructions per iterationUpdate loop counterBranchAddress arithmetic

Page 18: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Instruction Mix2 floating-point arithmetic instructions1 loop branch instruction2 address arithmetic instructions1 loop counter increment instruction

Page 19: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrolling

Only 1/3 are floating-point calculations But I want my full theoretical 346.5 GFLOPs

(G80) Consider loop unrolling

Page 20: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop UnrollingPvalue +=

Ms[ty][0] * Ns[0][tx] +

Ms[ty][1] * Ns[1][tx] +

...

Ms[ty][15] * Ns[15][tx]; // BLOCK_SIZE = 16

No more loopNo loop count updateNo branchConstant indices – no address arithmetic

instructions

Page 21: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Thread Granularity

How much work should one thread do?Parallel Reduction

Reduce two elements?

Matrix multiply Compute one element of Pd?

Page 22: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Thread Granularity

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Matrix Multiple

Page 23: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Thread Granularity

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Matrix MultipleBoth elements of Pd

require the same row of Md

Page 24: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Thread Granularity

Matrix MultipleCompute both Pd elements in the same thread

Reduces global memory access by ¼ Increases number of independent instructions

What is the benefit?

New kernel uses more registers and shared memory What does that imply?

Page 25: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

What improves performance?Prefetching?Loop unrolling?Thread granularity?

For what inputs?

Page 26: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Page 27: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

8x8 Tiles• Coarser thread granularity helps• Prefetching doesn’t• Loop unrolling doesn’t

Page 28: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

16x16 Tiles• Coarser thread granularity helps

Page 29: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

16x16 Tiles• Full loop unrolling can help

Page 30: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

16x16 Tiles• Prefetch helps for 1x1 tiling

Page 31: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 31

Floating-Point Considerations

Page 32: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 32

What is IEEE floating-point format? A floating point binary number consists of three parts:

sign (S), exponent (E), and mantissa (M). Each (S, E, M) pattern uniquely identifies a floating point

number.

For each bit pattern, its IEEE floating-point value is derived as:

value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B

The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

Page 33: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

IEEE 754 Format

http://kipirvine.com/asm/workbook/floating_tut.htm

Single Precision 1 bit sign, 8 bit exponent (bias-127), 23 bit fraction

Double Precision1 bit sign, 11 bit exponent (1023-bias), 52 bit fraction

Page 34: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Mantissa

-3.154 x 105 as an example, the sign is negative, the mantissa is 3.154, and the exponent is 5.

The fractional portion of the mantissa is the sum of each digit multiplied by a power of 10:

.154 = 1/10 + 5/100 + 4/1000

A binary floating-point number is similar. For example, in the number +11.1011 x 23, the sign is positive, the mantissa is 11.1011, and the exponent is 3. The fractional portion of the mantissa is the sum of successive powers of 2. In

our example, it is expressed as:.1011 = 1/2 + 0/4 + 1/8 + 1/16 =0.6875DD

Combined with the left-hand side of 11.1011, the decimal value of the number is 3.6875.

http://kipirvine.com/asm/workbook/floating_tut.htm

Page 35: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Normalizing the Mantissa

Before a floating-point binary number can be stored correctly, its mantissa must be normalized. The process is basically the same as when normalizing a floating-point decimal number.

For example, decimal 1234.567 is normalized as 1.234567 x 103 by moving the decimal point so that only one digit appears before the decimal.

http://kipirvine.com/asm/workbook/floating_tut.htm

Page 36: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

The Exponent

8-bit unsigned integers with a bias of 127. An example: 1.101 x 25 . The exponent (5) is

added to 127(2n-1-1) and the sum (132) is binary 10100010.

http://kipirvine.com/asm/workbook/floating_tut.htm

Page 37: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Creating the IEEE Bit Representation 1.101 x 20 is stored as sign = 0 (positive), mantissa = 101, and exponent = 01111111

(the exponent value is added to 127). The "1" to the left of the decimal point is dropped from the mantissa. Here are more examples:

http://kipirvine.com/asm/workbook/floating_tut.htm

Page 38: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

University of Illinois, Urbana-Champaign 38

Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4

cycles per warp int multiply (*) is by default 32-bit

requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int

multiply

Integer divide and modulo are expensive Compiler will convert literal power-of-2 divides to shifts Be explicit in cases where compiler can’t tell that divisor is a

power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2

Page 39: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

University of Illinois, Urbana-Champaign 39

Arithmetic Instruction Throughput

Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp These are the versions prefixed with “__” Examples:__rcp(), __sin(), __exp()

Other functions are combinations of the above y / x == rcp(x) * y == 20 cycles per warp sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp

Page 40: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 40

Runtime Math Library

There are two types of runtime math operations __func(): direct mapping to hardware ISA

Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y)

func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or

less) Examples: sin(x), exp(x), pow(x,y)

The -use_fast_math compiler option forces every func() to compile to __func()

Page 41: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 41

Make your program float-safe! Future hardware will have double precision support

G80 is single-precision only Double precision will have additional performance cost Careless use of double or undeclared types may run more slowly on

G80+ Important to be float-safe (be explicit whenever you want

single precision) to avoid using double precision where it is not needed Add ‘f’ specifier on float literals:

foo = bar * 0.123; // double assumed foo = bar * 0.123f; // float explicit

Use float version of standard library functions foo = sin(bar); // double assumed foo = sinf(bar); // single precision explicit

Page 42: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 42

Deviations from IEEE-754

Addition and Multiplication are IEEE 754 compliant Maximum 0.5 ulp (units in the least place) error

However, often combined into multiply-add (FMAD) Intermediate result is truncated

Division is non-compliant (2 ulp) Not all rounding modes are supported Denormalized numbers are not supported No mechanism to detect floating-point exceptions

Page 43: CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 43

GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zeroSupported,1000’s of cycles

Supported,1000’s of cycles

Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm

Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy

24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy

23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy

23 bit No 12 bit No