Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample:...

27
Optimization Strategies Optimization Strategies Optimization Strategies Global Memory Access Pattern and Control Flow Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Transcript of Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample:...

Page 1: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

Optimization StrategiesOptimization Strategies─ Global Memory Access Pattern and Control Flow

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 2: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

Obj tiObj tiObjectivesObjectives

Optimization StrategiesOptimization StrategiesGlobal Memory Access Pattern (Coalescing)Control Flow (Divergent branch)Control Flow (Divergent branch)

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 3: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

Gl b l M AGl b l M AGlobal Memory AccessGlobal Memory AccessHighest latency instructions: 400-600 clock cycles

Likely to be performance bottleneck

Optimizations can greatly increase performanceBest access pattern: CoalescingBest access pattern: CoalescingUp to 10x speedup

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 4: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

C l d M AC l d M ACoalesced Memory AccessCoalesced Memory AccessA coordinated read by a half-warp (16 threads)A contiguous region of global memory:

64 bytes - each thread reads a word: int, float, …128 bytes - each thread reads a double-word: int2, float2, …128 bytes each thread reads a double word: int2, float2, …256 bytes – each thread reads a quad-word: int4, float4, …

Additional restrictions on G8X architecture:Starting address for a region must be a multiple of region sizeThe kth thread in a half-warp must access the kth element in a block being read

Exception: not all threads must be participatingPredicated access divergence within a halfwarp

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Predicated access, divergence within a halfwarp

Page 5: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

C l d A R di g fl tC l d A R di g fl tCoalesced Access: Reading floatsCoalesced Access: Reading floats

All threads participate

Some threads do not participate

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 6: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

NN C l d A R di g fl tC l d A R di g fl tNonNon--Coalesced Access: Reading floatsCoalesced Access: Reading floats

Permuted access by threads

Misaligned starting address (not a multiple of 64)

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 7: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

E l NE l N l d fl t3 dl d fl t3 dExample: NonExample: Non--coalesced float3 readcoalesced float3 read__global__ void accessFloat3(float3 *d_in, float3 d_out)

{{int index = blockIdx.x * blockDim.x + threadIdx.x;float3 a = d_in[index];a.x += 2;a.y += 2;a.z += 2;d_out[index] = a;

}

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 8: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

E l NE l N l d fl t3 d (C t’)l d fl t3 d (C t’)Example: NonExample: Non--coalesced float3 read (Cont’)coalesced float3 read (Cont’)

float3 is 12 bytesEach thread ends up executing 3 reads

sizeof(float3) ≠ 4, 8, or 12Half-warp reads three 64B non-contiguous regions

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 9: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

E l NE l N l d fl t3 d (2)l d fl t3 d (2)Example: NonExample: Non--coalesced float3 read (2)coalesced float3 read (2)

Similarly, step 3 start at offset 512

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 10: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

E l NE l N l d fl t3 d (3)l d fl t3 d (3)Example: NonExample: Non--coalesced float3 read (3)coalesced float3 read (3)

Use shared memory to allow coalescingy gNeed sizeof(float3)*(threads/block) bytes of SMEMEach thread reads 3 scalar floats:

Offsets: 0, (threads/block), 2*(threads/block)These will likely be processed by other threads, so sync

ProcessingEach thread retrieves its float3 from SMEM arrayy

Cast the SMEM pointer to (float3*)Use thread ID as index

Rest of the compute code does not change!

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 11: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

E l Fi l C l d C dE l Fi l C l d C dExample: Final Coalesced CodeExample: Final Coalesced Code

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 12: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

Coalescing: Structure of Size ≠ 4 8 16 BytesCoalescing: Structure of Size ≠ 4 8 16 BytesCoalescing: Structure of Size ≠ 4, 8, 16 BytesCoalescing: Structure of Size ≠ 4, 8, 16 Bytes

Use a structure of arrays instead of Array of Structure y yIf Array of Structure is not viable:

Force structure alignment: __align(X), where X = 4, 8, or 16Use SMEM to achieve coalescing

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 13: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

C t l Fl C t l Fl I t ti i GPUI t ti i GPUControl Flow Control Flow Instructions in GPUsInstructions in GPUs

Main performance concern with branching is Main performance concern with branching is divergence

Threads within a single warp take different pathsDifferent execution paths are serializedDifferent execution paths are serialized

The control paths taken by the threads in a warp are traversed one at a time until there is no more.

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 14: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

Di g t B hDi g t B hDivergent BranchDivergent Branch

A common case: avoid divergence when branch condition i f i f h d IDis a function of thread ID

Example with divergence: If (threadIdx.x > 2) { }This creates two different control paths for threads in a blockThis creates two different control paths for threads in a blockBranch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp

Example without divergence:If (threadIdx.x / WARP_SIZE > 2) { }Also creates two different control paths for threads in a blockBranch granularity is a whole multiple of warp size; all threads in any given warp follow the same pathy g p p

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

14

Page 15: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

P ll l P ll l R d tiR d tiParallel Parallel ReductionReduction

Given an array of values, “reduce” th t i l l i ll lthem to a single value in parallelExamples

sum reduction: sum of all values in the arrayMax reduction: maximum of all values in the array

T i ll ll l i l t tiTypically parallel implementation:Recursively halve # threads, add two values per threadTakes log(n) steps for n elements Takes log(n) steps for n elements, requires n/2 threads

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

15

Page 16: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

A V t R d ti E lA V t R d ti E lA Vector Reduction ExampleA Vector Reduction Example

Assume an in place reduction using shared Assume an in-place reduction using shared memory

The original vector is in device global memoryThe original vector is in device global memoryThe shared memory used to hold a partial sum vectorEach iteration brings the partial sum vector closer to the final sumThe final solution will be in element 0The final solution will be in element 0

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

16

Page 17: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

V t R d tiV t R d tiVector ReductionVector ReductionArray elements

0 1 2 3 4 5 76 1098 11

0+1 2+3 4+5 6+7 10+118+91

0...3 4..7 8..112

0..7 8..153

iterations

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

iterations

Page 18: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

A i l i l t tiA i l i l t tiA simple implementationA simple implementation

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

18

Page 19: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

I t l d R d tiI t l d R d tiInterleaved ReductionInterleaved Reduction

2 4 6 8 10 12 14

4 8 12

8

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 20: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

S Ob tiS Ob tiSome ObservationsSome ObservationsIn each iterations, two control flow paths will be sequentially traversed for each warptraversed for each warp

Threads that perform addition and threads that do notThreads that do not perform addition may cost extra cycles depending on the implementation of divergence

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

20

Page 21: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

S S Ob ti (C t’)Ob ti (C t’)Some Some Observations (Cont’)Observations (Cont’)No more than half of threads will be executing at any time

All odd index threads are disabled right from the beginning!All odd index threads are disabled right from the beginning!On average, less than ¼ of the threads will be activated for all warps over time.After the 5th iteration, entire warps in each block will be disabled, poor resource utilization but no divergence.

This can go on for a while, up to 4 more iterations (512/32=16= 24), where each iteration only has one thread activated until all warps retire

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

21

Page 22: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

O ti i ti 1O ti i ti 1Optimization 1:Optimization 1:

Replace divergent branch

With strided index and non divergent branchWith strided index and non-divergent branch

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

22

Page 23: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

O ti i ti 1 (C t’)O ti i ti 1 (C t’)Optimization 1: (Cont’)Optimization 1: (Cont’)

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

No divergence until less than 16 sub sum.

Page 24: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

O ti i ti 1 B k C fli t IO ti i ti 1 B k C fli t IOptimization 1: Bank Conflict IssueOptimization 1: Bank Conflict Issue

Bank Conflict due to the Strided Addressing

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

24

Page 25: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

O ti i ti 2 S ti l Add i gO ti i ti 2 S ti l Add i gOptimization 2: Sequential AddressingOptimization 2: Sequential Addressing

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 26: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

O ti i ti 2 (C t’)O ti i ti 2 (C t’)Optimization 2: (Cont’)Optimization 2: (Cont’)

Replace strided indexing

With reversed loop and threadID based indexingWith reversed loop and threadID-based indexing

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

Page 27: Optimization Strategies - Virginia Techaccel.cs.vt.edu/files/lecture12.pdf · N ElE xample: Nonon--coacoa 3d t l f ldl esced float3 read __global__ void accessFloat3(float3 *d_in,

Optimization Strategies

Some Observations About the New ImplementationSome Observations About the New Implementation

Only the last 5 iterations will have divergenceOnly the last 5 iterations will have divergenceEntire warps will be shut down as iterations progressprogress

For a 512-thread block, 4 iterations to shut down all but one warps in each blockpBetter resource utilization, will likely retire warps and thus blocks faster

Recall, no bank conflicts either

Copyright © 2009 by Yong Cao, Referencing UIUC ECE498AL Course Notes

27