ECE 8823A GPU Architectures...1 1 ECE 8823A GPU Architectures Module 5: Execution and Resources -I...

1

1

ECE 8823A

GPU Architectures

Module 5: Execution and Resources - I

Reading Assignment

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6

• CUDA Programming Guide– http://docs.nvidia.com/cuda/cuda-c-programming-

guide/#abstract

2

2

Objective

• To understand the implications of programming model constructs on demand for execution resources

• To be able to reason about performance consequences of programming model parameters– Thread blocks, warps, memory behaviors, etc.– Need deeper understanding of architecture to be really valuable

(later)• To understand DRAM bandwidth

– Cause of the DRAM bandwidth problem– Programming techniques that address the problem: memory

coalescing, corner turning,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, 2007-2012

3

Closer Look: Formation of Warps

4

• How do you form warps out of multidimensional arrays of threads?– Linearize thread IDs

Grid 1

Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Block (1,1)

Thread(0,0,0)Thread

(0,1,3)Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2) (1,0,3)

warp1D Thread Block

3D Thread Block

3

Formation of Warps

5

Grid 1

Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Block (1,1)

Thread(0,0,0)Thread

(0,1,3)Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2) (1,0,3)

T0,0,0 T0,0,1 T0,0,2 T0,0,3 T0,1,0 T0,1,1 T0,1,2 T0,1,3 T1,0,0 T1,0,1 T1,0,2 T1,0,3 T1,1,0 T1,1,1 T1,1,2 T1,1,3

linear order

2D Thread Block3D Thread Block

Mapping Thread Blocks to Warps

6

T0,0

T7,3

Warp 0

Warp 1

Thread Bock

T7,0

T3,3T3,0

T0,3

An Example with a warp size of 16 threads

• Follow row major order through the Z-dimension• Linearize and then split into warps• Understanding becomes important when optimizing

global memory accesses

4

Execution of Warps


7

• Each warp executed as SIMD bundle• How do we handle divergent control flow among threads

in a warp? – Execution semantics– How is it implemented? (later)– How can we optimize against it?

warp

Impact of Control Divergence

8

• Occurs within a warp• Branches lead serialization branch dependent code

v Performance issue: low warp utilization

if(…)

{… }

else {…}

Idle threads

Reconvergence!

Serialization

5

Causes

• Traditional nested branches• Loops

– Variable number of iterations/thread– Loop condition based on thread ID?

• Switching on thread IDif(threadIDx.x > 5) {}


9

Control Divergence Mitigation: Algorithmic Approach

10

Benefits of SIMD execution Flexibility of MIMD control flow +

Can algorithmic techniques maximize utilizations achieved by a warp?

6

Reduction• A commonly used strategy for processing large

input data sets– There is no required order of processing elements in a

data set (associative and commutative)– Partition the data set into smaller chunks– Have each thread to process a chunk– Use a reduction tree to summarize the results from

each chunk into the final answer• We will focus on the reduction tree step for now.• Google and Hadoop MapReduce frameworks are

examples of this pattern

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

11

A parallel reduction tree algorithm performs N-1 Operations in log(N) steps


12

3 1 7 0 4 1 6 3

3 7 4 6

max maxmaxmax

maxmax

7 6

max

7

7

Reduction: Approach 1


13

1. __shared__ float partialsum[];..

2. unsigned int t = threadIDx.x;3. For (unsigned int stride =1; stride <blockDim.x; stride *=2)4. {5. __syncthread();6. If(t%(2*stride) == 0)7. partialsum[t] +=partialsum[t+stride];8. }

0 1 2 43 5 66

0+1 2+3 4+5 6+7

0..3 4..7

0..7

threadID.x

Thread Block

Data in shared memory

• O(N) additions and therefore work efficient?

• Hardware efficiency?

thread thread thread thread

A Better Strategy

• Principle: Shift the index usage to ensure high thread utilization in warp– Remap thread indices

• Keep the active threads consecutive


14

8


15

Thread 0

An Example of 16 threads

0 1 2 3 … 13 1514 181716 19

0+16 15+31

Thread 1 Thread 2 Thread 14Thread 15

No Divergence

Reduction: Approach 2


16

1. __shared__ float partialsum[];..

2. unsigned int t = threadIDx.x;3. For (unsigned int stride = blockDim.x; stride>1; stride /=2)4. {5. __syncthread();6. If(t < stride)7. partialsum[t] +=partialsum[t+stride];8. }

• Difference is in which threads diverge!• For a thread block of 512 threads

– Threads 0-255 take the branch, 256-511 do not• For a warp size of 32, all threads in a warp have identical

branch conditions à no divergence!• When #active threads <warp-size, à old problem

9

Global Memory Bandwidth

• How can we map thread access patterns to global memory addresses to maximize bandwidth utilization?

• Need to understand the organization of DRAMs!– Hierarchy of latencies


17

Basic Organization

deco

de

0 1 1

Sense amps and buffer

Mux

18

Example: 32x32 = 1024 bit array

I/O pins©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

10

1Gb Micron DD2 SDRAM

19Row access time

Column access time

Technology Trends

20

Past two decades,• Data rate increase ~

1000x• RAS/CAS latency

decrease = 56%

Courtesy: Synopsis DesignWare Technical Bulletin

How? à increasing burst length

11

DRAM Bursting for a 8x2 Bank

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012

time

Address bits to decoder

Core Array access delay2 bitsto pin

2 bitsto pin

Non-burst timing

Burst timing

Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations.

21

Multiple DRAM Banks


deco

de

Sense amps

Mux

deco

de

Sense amps

Mux

0 1 10

Bank 0 Bank 1

22

12

DRAM Bursting for the 8x2 Bank


time

Address bits to decoder

Core Array access delay2 bitsto pin

2 bitsto pin

Single-Bank burst timing, dead time on interface

Multi-Bank burst timing, reduced dead time

23

First-order Look at the GPU off-chip

memory subsystem

• nVidia V100 Volta GPU:

– Peak global memory bandwidth = 900 GB/s

• Global memory (HBM2) interface @ 4096 bits

• Prior generation GPUs (e.g., Keplar) 384 bit wide

GDDR5 @ 224GBytes/sec


24

13

Multiple Memory Channels

• Divide the memory address space into N parts– N is number of memory channels– Assign each portion to a channel


Channel 0

Channel 1

Channel 2

Channel 3

Bank Bank Bank Bank

25

“You can buy bandwidth but you can’t bribe God”

-- Unknown

Lessons

• Organize data accesses to maximize burst mode bandwidth – Access consecutive locations– Algorithmic strategies + data layout

• Thread blocks issue warp-size load/store instructions– 32 addresses for a warp size of 32– Coalesce these accesses to create smaller number of

memory transactions à maximize memory bandwidth– More later as we discuss microarchitecture

26

14

Memory Coalescing

• Memory references are coalesced into sequence of memory transactions– Accesses to a segment are coalesced, e.g., 128 byte

segments)• Ability and extent of coalescing depends on

compute capability 27

LD LD LD LD

Warp

Implications of Memory Coalescing

• Reduce the request rate to L1 and DRAM

• Distinct from CPU optimizations – why?

• Need to be able to re-map entries from each access back to threads

28

Warp Schedulers

Register File

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

L1/Shared Memory

DRAMDRAM

DRAMDRAM

L1 access bandwidth

DRAM access bandwidth

15

M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

linearized order in increasing address

Placing a 2D C array into linear memory space


Base Matrix Multiplication Kernel

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){// Calculate the row index of the Pd element and Mint Row = blockIdx.y*TILE_WIDTH + threadIdx.y;// Calculate the column index of Pd and Nint Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;// each thread computes one element of the block sub-

matrixfor (int k = 0; k < Width; ++k)Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col];

d_P[Row*Width+Col] = Pvalue;} 30

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois, 2007-2012

16

Two Access Patterns

31

d_M d_N

WIDTH

WIDTH

Thread 1Thread 2

(a) (b)

d_M[Row*Width+k] d_N[k*Width+Col]

k is loop counter in the inner product loop of the kernel code


Lets look at these access

patterns

32

N accesses are coalesced.

NT0 T1 T2 T3

Load iteration 0

T0 T1 T2 T3

Load iteration 1

Access direction in kernel code (one thread)

…

N0,2

N1,1

N0,1N0,0

N1,0

N0,3

N1,2 N1,3

N2,1N2,0 N2,2 N2,3

N3,1N3,0 N3,2 N3,3

N0,2N0,1N0,0 N0,3 N1,1N1,0 N1,2 N1,3 N2,1N2,0 N2,2 N2,3 N3,1N3,0 N3,2 N3,3

Across successive threads in a warp

d_N[k*Width+Col]


(each thread)

17

M accesses are not coalesced.

33

MT0 T1 T2 T3

Load iteration 0

T0 T1 T2 T3

Load iteration 1

Access direction in Kernel code (in a thread)

…

M0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3 M3,1M3,0 M3,2 M3,3

d_M[Row*Width+k]

Access across successive threads

in a warp

Using Shared Memory

34

d_M d_N

WIDTH

d_M d_N

Original AccessPattern

Tiled AccessPattern

Copy into scratchpad

memory

Perform multiplication

with scratchpad values

WIDTH


18

Shared Memory Accesses

35

• Shared memory is banked – No coalescing

• Data access patterns should be structured to avoid bank conflicts

• Low order interleaved mapping?

__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;// Identify the row and column of the d_P element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;// Loop over the d_M and d_N tiles required to compute the d_P element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {

// Collaborative loading of d_M and d_N tiles into shared memory9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx];10. Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col];11. __syncthreads();12. for (int k = 0; k < TILE_WIDTH; ++k)13. Pvalue += Mds[tx][k] * Nds[k][ty];14. __synchthreads();

}15. d_P[Row*Width+Col] = Pvalue;

}

• Accesses from shared memory, hence coalescing is not necessary

• Consider bank conflicts

19

Coalescing Behavior


37

d_M

d_N

d_P

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

m*TILE_W

IDTH

m*TILE_WIDTH

Col

Row…

…

Thread Granularity

38

Warp Schedulers

Register File

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

L1/Shared Memory

DRAMDRAM

DRAMDRAM

• Consider instruction bandwidth vs. memory bandwidth

• Control amount of work per thread

Fetch/Decode

20

Thread Granularity Tradeoffs

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 39

d_M

d_N

d_P

Pdsub

TILE_WIDTH

WIDTHWIDTH


TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

m*TILE_W

IDTH

m*TILE_WIDTH

Col

Row

…

…

• Preserving instruction bandwidth (memory bandwidth)– Increase thread granularity– Merge adjacent tiles: sharing tile

data

Thread Granularity Tradeoffs (2)

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2012 40

d_M

d_N

d_P

Pdsub

TILE_WIDTH

WIDTHWIDTH


TILE_WIDTH

TILE_WIDTH

TILE_WIDTHE

WIDTH

WIDTH

m*TILE_W

IDTH

m*TILE_WIDTH

Col

Row

…

…

• Impact on parallelism– #TBs, #registers/thread– Need to explore impact à

autotuning

21

ANY MORE QUESTIONS?READ CHAPTER 6!

41

ECE 8823A GPU Architectures...1 1 ECE 8823A GPU Architectures Module 5: Execution and Resources -I...

Documents

Transcript of ECE 8823A GPU Architectures...1 1 ECE 8823A GPU Architectures Module 5: Execution and Resources -I...