D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

29
D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27 th , 2014

description

D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs. D. Anoushe Jamshidi , Mehrzad Samadi, and Scott Mahlke University of Michigan PACT-23 August 27 th , 2014. Achieving Peak GPU Performance: Theory and Practice. Matrix Multiplication. Not easy to fully utilize GPU capabilities!. - PowerPoint PPT Presentation

Transcript of D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

Page 1: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs

D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke

University of Michigan

PACT-23August 27th, 2014

Page 2: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

2

Achieving Peak GPU Performance: Theory and Practice

2007 2008 2009 2010 2011 2012 20130

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

GFL

OPS

GTX 780

GTX 680

GTX 580

GTX 480

GTX 2809800

8800SDK

CUBLAS

Peak

Matrix Multiplication

Not easy to fully utilize GPU capabilities!

Page 3: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

3

A Quick Overview of GPUs

SMs

Interconnect

L2 $

Chip

LD/ST

Writeback

Register File

SharedMemory

L1D $

DRAMDRAM

DRAM

IssueDecodeFetch

Data

DataDataResult

SPs

Result

Result Result

Page 4: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

4

A Quick Overview of GPUs

SMs

Interconnect

L2 $

Chip

LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

SPs

DRAMDRAM

DRAM

~100’s of cycles

Page 5: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

5

Cache Line

How do GPUs Achieve Great Performance?

• Effectively use available memory bandwidth– Exploit data reuse when

possible

SP SP SP SP

Store Store Store Store

Page 6: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

6

Cache Line

How do GPUs Achieve Great Performance?

• Effectively use available memory bandwidth– Exploit data reuse when

possible– Regular, well coalesced

memory accessesSP SP SP SP

Store

Cache Line

Page 7: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

7

Buffering to Optimize Bandwidth

SMs

Interconnect

L2 $

Chip

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

DRAMDRAM

DRAM

~100’s of cycles

IssueDecodeFetch

<10 cycles

Buffer data in fast Shared Memory

Tile[0]

Tile[1]Tile[2]

Page 8: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

8

SMs

Interconnect

L2 $

Chip

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

DRAMDRAM

DRAM

IssueDecodeFetch

Tile[0]

Tile[0]

Tile[0]

Tile[0]

Tile[0]

Tile[1]Tile[2]

Tile[1]Tile[2]

Tile[1]Tile[2]

Duplicated data in Shared Mem, Caches, Reg. File

Roundabout path to Shared Memory

Tile[1]Tile[2]

Buffering Problem 1: Wasted Storage

Page 9: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

9

Buffering Problem 2: Code Expansion

cvt.s64.s32 %rl6, %r13;add.s64 %rl7, %rl5, %rl6;shl.b64 %rl8, %rl7, 2;mov.u64 %rl9, __cuda_local_var_42177_35_non_const_block;add.s64 %rl10, %rl9, %rl8;cvta.to.global.u64 %rl11, %rl2;mul.wide.u32 %rl12, %r15, 4;add.s64 %rl13, %rl11, %rl12;ld.global.f32 %f1, [%rl13];st.shared.f32 [%rl10], %f1; cvt.u64.u32 %rl14, %r1; add.s64 %rl15, %rl14, %rl4;shl.b64 %rl16, %rl15, 2;add.s64 %rl17, %rl11, %rl16;ld.global.f32 %f2, [%rl17];st.shared.f32 [%rl10+132], %f2; shl.b32 %r21, %r1, 1;cvt.u64.u32 %rl18, %r21;add.s64 %rl19, %rl18, %rl4;shl.b64 %rl20, %rl19, 2;add.s64 %rl21, %rl11, %rl20;ld.global.f32 %f3, [%rl21];st.shared.f32 [%rl10+264], %f3; mul.lo.s32 %r24, %r1, 3;cvt.u64.u32 %rl22, %r24;add.s64 %rl23, %rl22, %rl4;shl.b64 %rl24, %rl23, 2;add.s64 %rl25, %rl11, %rl24;ld.global.f32 %f4, [%rl25];st.shared.f32 [%rl10+396], %f4; shl.b32 %r27, %r1, 2;cvt.u64.u32 %rl26, %r27;add.s64 %rl27, %rl26, %rl4;shl.b64 %rl28, %rl27, 2;add.s64 %rl29, %rl11, %rl28;ld.global.f32 %f5, [%rl29];st.shared.f32 [%rl10+528], %f5; mul.lo.s32 %r30, %r1, 5;cvt.u64.u32 %rl30, %r30;add.s64 %rl31, %rl30, %rl4;shl.b64 %rl32, %rl31, 2;add.s64 %rl33, %rl11, %rl32;ld.global.f32 %f6, [%rl33];st.shared.f32 [%rl10+660], %f6; mul.lo.s32 %r33, %r1, 6;cvt.u64.u32 %rl34, %r33;add.s64 %rl35, %rl34, %rl4;shl.b64 %rl36, %rl35, 2;add.s64 %rl37, %rl11, %rl36;ld.global.f32 %f7, [%rl37];st.shared.f32 [%rl10+792], %f7;mul.lo.s32 %r36, %r1, 7;cvt.u64.u32 %rl38, %r36;add.s64 %rl39, %rl38, %rl4;shl.b64 %rl40, %rl39, 2;add.s64 %rl41, %rl11, %rl40;ld.global.f32 %f8, [%rl41];st.shared.f32 [%rl10+924], %f8;bar.sync 15;

IADD R4.CC, R7, c [0x0] [0x150];SHL.W R18, R7, 0x2;IMUL.U32.U32.HI R20, R7, 0x4;MOV R12, c [0x0] [0x150];IADD.X R5, RZ, RZ;IADD R2.CC, R18, c [0x0] [0x148];IADD.X R3, R20, c [0x0] [0x14c];IADD R0, R0, R19;IMAD.U32.U32 R6.CC, R12, 0x2, R7;LD.E R14, [R2];IADD.X R8, RZ, RZ;IMAD R10.CC, R12, 0x3, R7;SHL R21, R0, 0x2;IADD.X R9, RZ, RZ;IMAD.U32.U32 R11.CC, R12, 0x4, R7;STS [R21], R14;SHR.U32 R0, R4, 0x1e;SHL R22, R4, 0x2;IADD.X R4, RZ, RZ;IMAD R27.CC, R12, 0x5, R7;SHR.U32 R13, R6, 0x1e;SHL R24, R6, 0x2;IADD.X R6, RZ, RZ;ISCADD R23, R5, R0, 0x2;IMAD R0.CC, R12, 0x6, R7;IADD.X R5, RZ, RZ;IMAD R33.CC, R12, 0x7, R7;SHR.U32 R15, R10, 0x1e;SHL R26, R10, 0x2;SHR.U32 R10, R11, 0x1e;SHL R28, R11, 0x2;IADD.X R11, RZ, RZ;IADD R12.CC, R22, c [0x0] [0x148];ISCADD R25, R8, R13, 0x2;IADD.X R13, R23, c [0x0] [0x14c];IADD R8.CC, R24, c [0x0] [0x148];SHR.U32 R7, R27, 0x1e;LD.E R13, [R12];SHL R30, R27, 0x2;STS [R21+0x84], R13;ISCADD R27, R9, R15, 0x2;IADD.X R9, R25, c [0x0] [0x14c];IADD R2.CC, R26, c [0x0] [0x148];ISCADD R29, R4, R10, 0x2;IADD.X R3, R27, c [0x0] [0x14c];IADD R4.CC, R28, c [0x0] [0x148];SHR.U32 R10, R0, 0x1e;ISCADD R31, R6, R7, 0x2;ISCADD R32, R5, R10, 0x2;LD.E R9, [R8];IADD.X R5, R29, c [0x0] [0x14c];IADD R6.CC, R30, c [0x0] [0x148];SHL R0, R0, 0x2;IADD.X R7, R31, c [0x0] [0x14c];SHR.U32 R34, R33, 0x1e;IADD R10.CC, R0, c [0x0] [0x148];SHL R33, R33, 0x2;LD.E R3, [R2];ISCADD R34, R11, R34, 0x2;LD.E R5, [R4];IADD.X R11, R32, c [0x0] [0x14c];IADD R14.CC, R33, c [0x0] [0x148];LD.E R6, [R6];IADD.X R15, R34, c [0x0] [0x14c];LD.E R8, [R10];LD.E R2, [R14];STS [R21+0x108], R9;STS [R21+0x18c], R3;STS [R21+0x210], R5;STS [R21+0x294], R6;STS [R21+0x318], R8;STS [R21+0x39c], R2;BAR.SYNC 0xf; PTX

59 Instructions

SASS73 Instructions

CUDA4 Lines

__global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[TILE_HEIGHT * STRIDE];

// Preliminary address calculations … float *tile_ptr = tile + <offset>;

// Buffer into shared memory #pragma unroll for(unsigned int i = 0; i < TILE_SIZE; i++) tile_ptr[i * STRIDE] = src[i * ImgStride]; __syncthreads();

// Processing data …}

Each tile transfer requires many arithmetic ops to calculate addresses

Address generation consumes ~50% of tile transfer cycles

Page 10: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

10

Objective

• A tool to help achieve better memory performance• Inspired by Direct Memory Access (DMA)

CPU DRAMDMA$

!!

Page 11: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

11

Objective

• A tool to help achieve better memory performance• Inspired by Direct Memory Access (DMA)

CPU DRAMDMA$

!

GPU

$$$

SM

$?

!! Not interruptible!

Heavy bookkeeping!

Page 12: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

12

D2MA: The Big Picture

DRAM

GPU

$$$

SM

$D2MA

Page 13: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

13

D2MA: Data-Parallel Direct Memory Access

• Take advantage of regular memory accesses & unified L1D/Shared Memory space

• Decouple tile transfers from SM resources– Simplify address generation– Improve memory pipelining

• Direct path to shared memory

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR Tile[0]

Page 14: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

14

D2MA Programming Model__global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[T_HEIGHT * T_STRIDE];

int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow;

//process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE);

for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE];}

__global__ void D2MAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[T_HEIGHT * T_STRIDE];

int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT, ImgStride) + blockIdx.x * T_WIDTH; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow;

//process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE);

for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE];}Original Code D2MA-Optimized Code

CUDA: 4 LinesPTX: 59 Instructions

CUDA: 4 LinesPTX: 12 Instructions

#pragma unroll for(unsigned int i = 0; i < T_SIZE; i++) tile_ptr[i * T_STRIDE] = src[i * ImgStride]; __syncthreads();

d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);

Page 15: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

15

D2MA Engine

D2MA Overview

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR

ControllerGlob. Addr

Shr. Addr

# Elements

Elem. SizeStride

Buf. 0Buf. 1Buf. 2Buf. 3

AGEN Logic Consistency Checker

Page 16: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

16

D2MA Operation: Configuration

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR

01101010110110

D2MA EngineController

Glob. Addr

Shr. Addr

# Elements

Elem. SizeStride

Buf. 0Buf. 1Buf. 2Buf. 3

0x1020 0x20 64 4 1

AGEN Logic Consistency Checker

d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride);d2ma_set_datatype_float();d2ma_enable_shmem_blank_col();d2ma_ignite_buffer(0);

ConfigConfig

Page 17: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

17

D2MA Operation: Addr. Generation

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR

0111000

D2MA EngineController

Glob. Addr

Shr. Addr

# Elements

Elem. SizeStride

Buf. 0Buf. 1Buf. 2Buf. 3

AGEN Logic Consistency Checker

AGEN Logic Global Mem. AGEN

Shared Mem. AGENControl

0x1020 0x20 64 4 10x1020 0x20 64 4 1

0x1020

0x20

d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride);d2ma_set_datatype_float();d2ma_enable_shmem_blank_col();d2ma_ignite_buffer(0);

Ignite #0

Page 18: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

18

D2MA Operation: Memory Transfer

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR

D2MA EngineController

Glob. Addr

Shr. Addr

# Elements

Elem. SizeStride

Buf. 0Buf. 1Buf. 2Buf. 3

AGEN Logic Consistency Checker

AGEN Logic Global Mem. AGEN

Shared Mem. AGENControl

0x1020 0x20 64 4 1

0x1020

0x20

Glob. Addr Shr. Addr …0x2000 0xFF …0xFFFF 0xFF …0xFFFF 0xFF …0x1020 0x20 …

0x10A0

0xA0

0x10A0 0xA0 …

Page 19: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

190x10A0 0xA0 …0x10A0 0xA0 …

D2MA Operation: Memory Transfer

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR

D2MA EngineController

Glob. Addr

Shr. Addr

# Elements

Elem. SizeStride

Buf. 0Buf. 1Buf. 2Buf. 3

AGEN Logic Consistency Checker

AGEN Logic Global Mem. AGEN

Shared Mem. AGENControl

0x1020 0x20 64 4 1

Glob. Addr Shr. Addr …0x2000 0xFF …0x1020 0x20 …0x1020 0x20 …

&0x1020&0x20

&0x10A0&0xA0

Page 20: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

20

D2MA Operation: Enforcing Synchronization

Without D2MA With D2MA

Thread Block 1

Thread Block 2

Start TX 1

Load from buffer

End TX 1Start TX 2

Load from bufferRe-exec

load

Thread Block 1

Thread Block 2

Start TX 1, Thread barrier

Barrier satisfied, End TX 1

Load from buffer

Start TX 2, Thread barrier

No warp ready to schedule

Code independent

of buffer

syncthreads()No syncthreads()!

Independent code executes

Synchronization handled transparently by H/W

Programmer must guarantee consistency

Page 21: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

21

Experimental Evaluation

• GPGPU-Sim v3.2.1

• Benchmarks from NVIDIA CUDA SDK, Rodinia– Must perform shared memory buffering

Number of SMs 15Thread blocks/SM 8Shared memory/SM 48 KBD2MA engines/SM 1 Controllers/engine 8 Buffers/controller 4 AGEN/engine 1Warp scheduling policy Greedy-then-oldestL1 cache (size/assoc/block size) 16KB/4-way/128BL2 cache (size/assoc/block size) 768KB/16-way/128BL2/DRAM latency 240/440 cycles

Page 22: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

22

Results: Performance

DCT MMDWT

TRANS1

TRANS2 FWGAUSS

MEANSOBEL

PATH

SGEMVLU

D1LU

D2

GEOMEAN0

0.5

1

1.5

2

2.5

Spee

dup

Geomean speedup: 1.36x

Page 23: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

23

Results: Cycle Breakdown

DCT MMDWT

TRANS1

TRANS2 FWGAUSS

MEANSOBEL

PATH

SGEMVLU

D1LU

D2AVG

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

AGEN MEM

Brea

kdow

n of

Tile

Tra

nsfe

r Cyc

les

BaselineD 2MA

Avg TX cycles: ~5x reduction

Addr. Gen: improved by 98% Mem. TX: reduced by 66%

Page 24: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

24

Results: Overheads

• Model of D2MA Engine synthesized using Synopsys• Compared to NVIDIA GTX 480

– Die area: 529 mm2

– TDP: 250 W– One D2MA Engine per SM (15 SMs):

• Area overhead: 0.016%• Power overhead: 0.022%

92%

5% 3%

Engine Area

Controller Array AGEN LogicConsistency Checker

Page 25: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

25

Conclusion

• Programmer must optimize memory traffic to achieve good performance on GPUs– Shared memory buffering improves b/w utilization– Buffering still has overheads

• D2MA decouples tiled data buffering from existing SM resources– Reduces costs of address generation by 98%– Improves memory transfer times by 66%– Performance improves by 1.36x– Dynamic instructions executed reduced by 7%– Enforces synchronization transparently– Low area and power overheads (<0.03%)

Page 26: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

26

Thank You!

• Questions?

Image credits:http://www.opengraphicdesign.com/web/ajax-loading-graphics/http://www.barriersandbollards.com/html-pages/mb50-1.png

Page 27: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs

D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke

University of Michigan

PACT-23August 27th, 2014

Page 28: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

28

Special Addressing Modes

Blank Column Mode Halo Addressing

Mode

Page 29: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

29

Results: Dynamic Instruction Count Reduction

DCT MMDWT

TRANS1

TRANS2 FWGAUSS

MEANSOBEL

PATH

SGEMVLU

D1LU

D2

GEOMEAN0

0.2

0.4

0.6

0.8

1

1.2

Dyn.

Inst

ruct

ion

Coun

t Rel

ativ

e to

Bas

elin

e