CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring...

44
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011

Transcript of CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring...

Page 1: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

CUDA Performance Considerations(2 of 2)Patrick CozziUniversity of PennsylvaniaCIS 565 - Spring 2011

Page 2: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Administrivia

Friday 03/04, 11:59pmAssignment 4 duePresentation date change due via email

Not bonus day eligible

Course NetworkingMonday 03/14 orFriday 04/29

Page 3: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Survey

What are you interested in? More performance More parallel algorithms, e.g., sorting OpenCL Fermi, e.g., NVIDIA GeForce GTX 480

Page 4: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Agenda

Data Prefetching Loop Unrolling Thread Granularity Bank Conflicts Review Final Project

Page 5: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];

float f = a * b + c * d;

float f2 = m * f;

Page 6: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];

float f = a * b + c * d;

float f2 = m * f;

Read global memory

Page 7: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];

float f = a * b + c * d;

float f2 = m * f;

Execute instructions that are not dependent on memory read

Page 8: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];

float f = a * b + c * d;

float f2 = m * f; Use global memory after the above line executes in enough warps hide the memory latency

Page 9: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use

Page 10: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Recall tiled matrix multiply:

for (/* ... */)

{

// Load current tile into shared memory

__syncthreads();

// Accumulate dot product

__syncthreads();

}

Page 11: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */)

{

// Deposit registers into shared memory

__syncthreads();

// Load next tile into registers

// Accumulate dot product

__syncthreads();

}

Page 12: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */)

{

// Deposit registers into shared memory

__syncthreads();

// Load next tile into registers

// Accumulate dot product

__syncthreads();

}

Page 13: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */)

{

// Deposit registers into shared memory

__syncthreads();

// Load next tile into registers

// Accumulate dot product

__syncthreads();

}

Prefetch for next iteration of the loop

Page 14: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */)

{

// Deposit registers into shared memory

__syncthreads();

// Load next tile into registers

// Accumulate dot product

__syncthreads();

}

These instructions executed by enough warps will hide the memory latency of the prefetch

Page 15: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Data Prefetching

CostAdded complexityMore registers – what does this imply?

Page 16: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Instructions per iterationOne floating-point multipleOne floating-point addWhat else?

Page 17: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Other instructions per iterationUpdate loop counter

Page 18: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Other instructions per iterationUpdate loop counterBranch

Page 19: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Other instructions per iterationUpdate loop counterBranchAddress arithmetic

Page 20: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Instruction Mix2 floating-point arithmetic instructions1 loop branch instruction2 address arithmetic instructions1 loop counter increment instruction

Page 21: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop Unrolling

Only 1/3 are floating-point calculations But I want my full theoretical 346.5 GFLOPs

(G80) Consider loop unrolling

Page 22: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Loop UnrollingPvalue +=

Ms[ty][0] * Ns[0][tx] +

Ms[ty][1] * Ns[1][tx] +

...

Ms[ty][15] * Ns[15][tx]; // BLOCK_SIZE = 16

No more loopNo loop count updateNo branchConstant indices – no address arithmetic

instructions

Page 23: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Thread Granularity

How much work should one thread do?Parallel Reduction

Reduce two elements?

Matrix multiply Compute one element of Pd?

Page 24: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Thread Granularity

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Matrix Multiple

Page 25: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Thread Granularity

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Matrix MultipleBoth elements of Pd

require the same row of Md

Page 26: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Thread Granularity

Matrix MultipleCompute both Pd elements in the same thread

Reduces global memory access by ¼ Increases number of independent instructions

What is the benefit?

New kernel uses more registers and shared memory What does that imply?

Page 27: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

What improves performance?Prefetching?Loop unrolling?Thread granularity?

For what inputs?

Page 28: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Page 29: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

8x8 Tiles• Coarser thread granularity helps• Prefetching doesn’t• Loop unrolling doesn’t

Page 30: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

16x16 Tiles• Coarser thread granularity helps

Page 31: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

16x16 Tiles• Full loop unrolling can help

Page 32: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Matrix Multiply

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

16x16 Tiles• Prefetch helps for 1x1 tiling

Page 33: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts

Shared memory is the same speed as registers…usuallyRegisters – per threadShared memory – per block

Registers

Thread block slots

Thread slots

Shared memory

SM

8

768

8K registers

16K

G80 Limits Shared memory

access patterns can affect performance. Why?

Page 34: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Shared MemorySometimes called a parallel data cache

Multiple threads can access shared memory at the same time

To achieve high bandwidth, memory is divided into banks

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 35: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

G80 Banks16 banks. Why?Per-bank bandwidth: 32-bits per two

cyclesSuccessive 32-bit words are assigned

to successive banks Bank = address % 16 Why? Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 36: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

BanksEach bank can service one address per

two cycleBank Conflict: Two simultaneous

accesses to the same bank, but not the same address

Serialized

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 37: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Bank Conflicts? Linear addressing

stride == 1

Bank Conflicts? Random 1:1 Permutation

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Page 38: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Bank Conflicts? Linear addressing

stride == 2

Bank Conflicts? Linear addressing

stride == 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0

x8

x8

Page 39: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Fast Path 1All threads in a half-warp

access different banks

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Page 40: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Fast Path 2All threads in a half-warp

access the same address

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Sameaddress

Page 41: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Slow PathMultiple threads in a half-warp

access the same bankAccess is serializedWhat is the cost?

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 42: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts__shared__ float shared[256];

// ...

float f = shared[index + s * threadIdx.x];

For what values of s is this conflict free?Hint: The G80 has 16 banks

Page 43: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts__shared__ float shared[256];

// ...

float f = shared[index + s * threadIdx.x];

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

s=1

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

s=3

Image from http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Page 44: CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Bank Conflicts Without using a profiler, how can we tell what kind of

speedup we can expect by removing bank conflicts? What happens if more than one thread in a warp

writes to the same shared memory address (non-atomic instruction)?