D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs

D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke

University of Michigan

PACT-23August 27th, 2014

2

Achieving Peak GPU Performance: Theory and Practice

2007 2008 2009 2010 2011 2012 20130

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

GFL

OPS

GTX 780

GTX 680

GTX 580

GTX 480

GTX 2809800

8800SDK

CUBLAS

Peak

Matrix Multiplication

Not easy to fully utilize GPU capabilities!

3

A Quick Overview of GPUs

SMs

Interconnect

…

L2 $

Chip

LD/ST

Writeback

Register File

SharedMemory

L1D $

DRAMDRAM

DRAM

IssueDecodeFetch

Data

DataDataResult

SPs

Result

Result Result

4

A Quick Overview of GPUs

SMs

Interconnect

…

L2 $

Chip

LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

SPs

DRAMDRAM

DRAM

~100’s of cycles

5

Cache Line

How do GPUs Achieve Great Performance?

• Effectively use available memory bandwidth– Exploit data reuse when

possible

SP SP SP SP

Store Store Store Store

6

Cache Line

How do GPUs Achieve Great Performance?

• Effectively use available memory bandwidth– Exploit data reuse when

possible– Regular, well coalesced

memory accessesSP SP SP SP

Store

Cache Line

7

Buffering to Optimize Bandwidth

SMs

Interconnect

…

L2 $

Chip

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

DRAMDRAM

DRAM

~100’s of cycles

IssueDecodeFetch

<10 cycles

Buffer data in fast Shared Memory

Tile[0]

Tile[1]Tile[2]

8

SMs

Interconnect

…

L2 $

Chip

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

DRAMDRAM

DRAM

IssueDecodeFetch

Tile[0]

Tile[0]

Tile[0]

Tile[0]

Tile[0]

Tile[1]Tile[2]

Tile[1]Tile[2]

Tile[1]Tile[2]

Duplicated data in Shared Mem, Caches, Reg. File

Roundabout path to Shared Memory

Tile[1]Tile[2]

Buffering Problem 1: Wasted Storage

9

Buffering Problem 2: Code Expansion

cvt.s64.s32 %rl6, %r13;add.s64 %rl7, %rl5, %rl6;shl.b64 %rl8, %rl7, 2;mov.u64 %rl9, __cuda_local_var_42177_35_non_const_block;add.s64 %rl10, %rl9, %rl8;cvta.to.global.u64 %rl11, %rl2;mul.wide.u32 %rl12, %r15, 4;add.s64 %rl13, %rl11, %rl12;ld.global.f32 %f1, [%rl13];st.shared.f32 [%rl10], %f1; cvt.u64.u32 %rl14, %r1; add.s64 %rl15, %rl14, %rl4;shl.b64 %rl16, %rl15, 2;add.s64 %rl17, %rl11, %rl16;ld.global.f32 %f2, [%rl17];st.shared.f32 [%rl10+132], %f2; shl.b32 %r21, %r1, 1;cvt.u64.u32 %rl18, %r21;add.s64 %rl19, %rl18, %rl4;shl.b64 %rl20, %rl19, 2;add.s64 %rl21, %rl11, %rl20;ld.global.f32 %f3, [%rl21];st.shared.f32 [%rl10+264], %f3; mul.lo.s32 %r24, %r1, 3;cvt.u64.u32 %rl22, %r24;add.s64 %rl23, %rl22, %rl4;shl.b64 %rl24, %rl23, 2;add.s64 %rl25, %rl11, %rl24;ld.global.f32 %f4, [%rl25];st.shared.f32 [%rl10+396], %f4; shl.b32 %r27, %r1, 2;cvt.u64.u32 %rl26, %r27;add.s64 %rl27, %rl26, %rl4;shl.b64 %rl28, %rl27, 2;add.s64 %rl29, %rl11, %rl28;ld.global.f32 %f5, [%rl29];st.shared.f32 [%rl10+528], %f5; mul.lo.s32 %r30, %r1, 5;cvt.u64.u32 %rl30, %r30;add.s64 %rl31, %rl30, %rl4;shl.b64 %rl32, %rl31, 2;add.s64 %rl33, %rl11, %rl32;ld.global.f32 %f6, [%rl33];st.shared.f32 [%rl10+660], %f6; mul.lo.s32 %r33, %r1, 6;cvt.u64.u32 %rl34, %r33;add.s64 %rl35, %rl34, %rl4;shl.b64 %rl36, %rl35, 2;add.s64 %rl37, %rl11, %rl36;ld.global.f32 %f7, [%rl37];st.shared.f32 [%rl10+792], %f7;mul.lo.s32 %r36, %r1, 7;cvt.u64.u32 %rl38, %r36;add.s64 %rl39, %rl38, %rl4;shl.b64 %rl40, %rl39, 2;add.s64 %rl41, %rl11, %rl40;ld.global.f32 %f8, [%rl41];st.shared.f32 [%rl10+924], %f8;bar.sync 15;

IADD R4.CC, R7, c [0x0] [0x150];SHL.W R18, R7, 0x2;IMUL.U32.U32.HI R20, R7, 0x4;MOV R12, c [0x0] [0x150];IADD.X R5, RZ, RZ;IADD R2.CC, R18, c [0x0] [0x148];IADD.X R3, R20, c [0x0] [0x14c];IADD R0, R0, R19;IMAD.U32.U32 R6.CC, R12, 0x2, R7;LD.E R14, [R2];IADD.X R8, RZ, RZ;IMAD R10.CC, R12, 0x3, R7;SHL R21, R0, 0x2;IADD.X R9, RZ, RZ;IMAD.U32.U32 R11.CC, R12, 0x4, R7;STS [R21], R14;SHR.U32 R0, R4, 0x1e;SHL R22, R4, 0x2;IADD.X R4, RZ, RZ;IMAD R27.CC, R12, 0x5, R7;SHR.U32 R13, R6, 0x1e;SHL R24, R6, 0x2;IADD.X R6, RZ, RZ;ISCADD R23, R5, R0, 0x2;IMAD R0.CC, R12, 0x6, R7;IADD.X R5, RZ, RZ;IMAD R33.CC, R12, 0x7, R7;SHR.U32 R15, R10, 0x1e;SHL R26, R10, 0x2;SHR.U32 R10, R11, 0x1e;SHL R28, R11, 0x2;IADD.X R11, RZ, RZ;IADD R12.CC, R22, c [0x0] [0x148];ISCADD R25, R8, R13, 0x2;IADD.X R13, R23, c [0x0] [0x14c];IADD R8.CC, R24, c [0x0] [0x148];SHR.U32 R7, R27, 0x1e;LD.E R13, [R12];SHL R30, R27, 0x2;STS [R21+0x84], R13;ISCADD R27, R9, R15, 0x2;IADD.X R9, R25, c [0x0] [0x14c];IADD R2.CC, R26, c [0x0] [0x148];ISCADD R29, R4, R10, 0x2;IADD.X R3, R27, c [0x0] [0x14c];IADD R4.CC, R28, c [0x0] [0x148];SHR.U32 R10, R0, 0x1e;ISCADD R31, R6, R7, 0x2;ISCADD R32, R5, R10, 0x2;LD.E R9, [R8];IADD.X R5, R29, c [0x0] [0x14c];IADD R6.CC, R30, c [0x0] [0x148];SHL R0, R0, 0x2;IADD.X R7, R31, c [0x0] [0x14c];SHR.U32 R34, R33, 0x1e;IADD R10.CC, R0, c [0x0] [0x148];SHL R33, R33, 0x2;LD.E R3, [R2];ISCADD R34, R11, R34, 0x2;LD.E R5, [R4];IADD.X R11, R32, c [0x0] [0x14c];IADD R14.CC, R33, c [0x0] [0x148];LD.E R6, [R6];IADD.X R15, R34, c [0x0] [0x14c];LD.E R8, [R10];LD.E R2, [R14];STS [R21+0x108], R9;STS [R21+0x18c], R3;STS [R21+0x210], R5;STS [R21+0x294], R6;STS [R21+0x318], R8;STS [R21+0x39c], R2;BAR.SYNC 0xf; PTX

59 Instructions

SASS73 Instructions

CUDA4 Lines

__global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[TILE_HEIGHT * STRIDE];

// Preliminary address calculations … float *tile_ptr = tile + <offset>;

// Buffer into shared memory #pragma unroll for(unsigned int i = 0; i < TILE_SIZE; i++) tile_ptr[i * STRIDE] = src[i * ImgStride]; __syncthreads();

// Processing data …}

Each tile transfer requires many arithmetic ops to calculate addresses

Address generation consumes ~50% of tile transfer cycles

10

Objective

• A tool to help achieve better memory performance• Inspired by Direct Memory Access (DMA)

CPU DRAMDMA$

!!

11

Objective

• A tool to help achieve better memory performance• Inspired by Direct Memory Access (DMA)

CPU DRAMDMA$

!

GPU

$$$

SM

$?

!! Not interruptible!

Heavy bookkeeping!

12

D2MA: The Big Picture

DRAM

GPU

$$$

SM

$D2MA

13

D2MA: Data-Parallel Direct Memory Access

• Take advantage of regular memory accesses & unified L1D/Shared Memory space

• Decouple tile transfers from SM resources– Simplify address generation– Improve memory pipelining

• Direct path to shared memory

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR Tile[0]

14

D2MA Programming Model__global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[T_HEIGHT * T_STRIDE];

int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow;

//process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE);

for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE];}

__global__ void D2MAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[T_HEIGHT * T_STRIDE];

int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT, ImgStride) + blockIdx.x * T_WIDTH; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow;

//process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE);

for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE];}Original Code D2MA-Optimized Code

CUDA: 4 LinesPTX: 59 Instructions

CUDA: 4 LinesPTX: 12 Instructions

#pragma unroll for(unsigned int i = 0; i < T_SIZE; i++) tile_ptr[i * T_STRIDE] = src[i * ImgStride]; __syncthreads();

d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);

15

D2MA Engine

D2MA Overview

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR

ControllerGlob. Addr

Shr. Addr

# Elements

Elem. SizeStride

Buf. 0Buf. 1Buf. 2Buf. 3

AGEN Logic Consistency Checker

16

D2MA Operation: Configuration

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR

01101010110110

D2MA EngineController

Glob. Addr

Shr. Addr

# Elements

Elem. SizeStride


0x1020 0x20 64 4 1


d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride);d2ma_set_datatype_float();d2ma_enable_shmem_blank_col();d2ma_ignite_buffer(0);

ConfigConfig

17

D2MA Operation: Addr. Generation

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR

0111000


Glob. Addr

Shr. Addr

# Elements

Elem. SizeStride



AGEN Logic Global Mem. AGEN

Shared Mem. AGENControl

0x1020 0x20 64 4 10x1020 0x20 64 4 1

0x1020

0x20

d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride);d2ma_set_datatype_float();d2ma_enable_shmem_blank_col();d2ma_ignite_buffer(0);

Ignite #0

18

D2MA Operation: Memory Transfer

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR


Glob. Addr

Shr. Addr

# Elements

Elem. SizeStride





0x1020 0x20 64 4 1

0x1020

0x20

Glob. Addr Shr. Addr …0x2000 0xFF …0xFFFF 0xFF …0xFFFF 0xFF …0x1020 0x20 …

0x10A0

0xA0

0x10A0 0xA0 …

190x10A0 0xA0 …0x10A0 0xA0 …

D2MA Operation: Memory Transfer

SM

SPs LD/ST

Writeback

Register File

SharedMemory

L1D $

IssueDecodeFetch

D2MA

MSHR


Glob. Addr

Shr. Addr

# Elements

Elem. SizeStride





0x1020 0x20 64 4 1

Glob. Addr Shr. Addr …0x2000 0xFF …0x1020 0x20 …0x1020 0x20 …

&0x1020&0x20

&0x10A0&0xA0

20

D2MA Operation: Enforcing Synchronization

Without D2MA With D2MA

Thread Block 1

Thread Block 2

Start TX 1

Load from buffer

End TX 1Start TX 2

Load from bufferRe-exec

load

Thread Block 1

Thread Block 2

Start TX 1, Thread barrier

Barrier satisfied, End TX 1

Load from buffer

Start TX 2, Thread barrier

No warp ready to schedule

Code independent

of buffer

syncthreads()No syncthreads()!

Independent code executes

Synchronization handled transparently by H/W

Programmer must guarantee consistency

21

Experimental Evaluation

• GPGPU-Sim v3.2.1

• Benchmarks from NVIDIA CUDA SDK, Rodinia– Must perform shared memory buffering

Number of SMs 15Thread blocks/SM 8Shared memory/SM 48 KBD2MA engines/SM 1 Controllers/engine 8 Buffers/controller 4 AGEN/engine 1Warp scheduling policy Greedy-then-oldestL1 cache (size/assoc/block size) 16KB/4-way/128BL2 cache (size/assoc/block size) 768KB/16-way/128BL2/DRAM latency 240/440 cycles

22

Results: Performance

DCT MMDWT

TRANS1

TRANS2 FWGAUSS

MEANSOBEL

PATH

SGEMVLU

D1LU

D2

GEOMEAN0

0.5

1

1.5

2

2.5

Spee

dup

Geomean speedup: 1.36x

23

Results: Cycle Breakdown

DCT MMDWT

TRANS1

TRANS2 FWGAUSS

MEANSOBEL

PATH

SGEMVLU

D1LU

D2AVG

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

AGEN MEM

Brea

kdow

n of

Tile

Tra

nsfe

r Cyc

les

BaselineD 2MA

Avg TX cycles: ~5x reduction

Addr. Gen: improved by 98% Mem. TX: reduced by 66%

24

Results: Overheads

• Model of D2MA Engine synthesized using Synopsys• Compared to NVIDIA GTX 480

– Die area: 529 mm2

– TDP: 250 W– One D2MA Engine per SM (15 SMs):

• Area overhead: 0.016%• Power overhead: 0.022%

92%

5% 3%

Engine Area

Controller Array AGEN LogicConsistency Checker

25

Conclusion

• Programmer must optimize memory traffic to achieve good performance on GPUs– Shared memory buffering improves b/w utilization– Buffering still has overheads

• D2MA decouples tiled data buffering from existing SM resources– Reduces costs of address generation by 98%– Improves memory transfer times by 66%– Performance improves by 1.36x– Dynamic instructions executed reduced by 7%– Enforces synchronization transparently– Low area and power overheads (<0.03%)

26

Thank You!

• Questions?

Image credits:http://www.opengraphicdesign.com/web/ajax-loading-graphics/http://www.barriersandbollards.com/html-pages/mb50-1.png

http://www.opengraphicdesign.com/web/ajax-loading-graphics/

http://www.opengraphicdesign.com/web/ajax-loading-graphics/

http://www.barriersandbollards.com/html-pages/mb50-1.png

http://www.barriersandbollards.com/html-pages/mb50-1.png

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs

D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke

University of Michigan

PACT-23August 27th, 2014

28

Special Addressing Modes

Blank Column Mode Halo Addressing

Mode

29

Results: Dynamic Instruction Count Reduction

DCT MMDWT

TRANS1

TRANS2 FWGAUSS

MEANSOBEL

PATH

SGEMVLU

D1LU

D2

GEOMEAN0

0.2

0.4

0.6

0.8

1

1.2

Dyn.

Inst

ruct

ion

Coun

t Rel

ativ

e to

Bas

elin

e

D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

Documents

Transcript of D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs