Multi-core Architectures with Coarse-grained Dynamically ...
D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs
-
Upload
leigh-simpson -
Category
Documents
-
view
31 -
download
0
description
Transcript of D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs
D2MA: Accelerating Coarse-Grained Data Transfer for GPUs
D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke
University of Michigan
PACT-23August 27th, 2014
2
Achieving Peak GPU Performance: Theory and Practice
2007 2008 2009 2010 2011 2012 20130
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
GFL
OPS
GTX 780
GTX 680
GTX 580
GTX 480
GTX 2809800
8800SDK
CUBLAS
Peak
Matrix Multiplication
Not easy to fully utilize GPU capabilities!
3
A Quick Overview of GPUs
SMs
Interconnect
…
L2 $
Chip
LD/ST
Writeback
Register File
SharedMemory
L1D $
DRAMDRAM
DRAM
IssueDecodeFetch
Data
DataDataResult
SPs
Result
Result Result
4
A Quick Overview of GPUs
SMs
Interconnect
…
L2 $
Chip
LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
SPs
DRAMDRAM
DRAM
~100’s of cycles
5
Cache Line
How do GPUs Achieve Great Performance?
• Effectively use available memory bandwidth– Exploit data reuse when
possible
SP SP SP SP
Store Store Store Store
6
Cache Line
How do GPUs Achieve Great Performance?
• Effectively use available memory bandwidth– Exploit data reuse when
possible– Regular, well coalesced
memory accessesSP SP SP SP
Store
Cache Line
7
Buffering to Optimize Bandwidth
SMs
Interconnect
…
L2 $
Chip
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
DRAMDRAM
DRAM
~100’s of cycles
IssueDecodeFetch
<10 cycles
Buffer data in fast Shared Memory
Tile[0]
Tile[1]Tile[2]
8
SMs
Interconnect
…
L2 $
Chip
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
DRAMDRAM
DRAM
IssueDecodeFetch
Tile[0]
Tile[0]
Tile[0]
Tile[0]
Tile[0]
Tile[1]Tile[2]
Tile[1]Tile[2]
Tile[1]Tile[2]
Duplicated data in Shared Mem, Caches, Reg. File
Roundabout path to Shared Memory
Tile[1]Tile[2]
Buffering Problem 1: Wasted Storage
9
Buffering Problem 2: Code Expansion
cvt.s64.s32 %rl6, %r13;add.s64 %rl7, %rl5, %rl6;shl.b64 %rl8, %rl7, 2;mov.u64 %rl9, __cuda_local_var_42177_35_non_const_block;add.s64 %rl10, %rl9, %rl8;cvta.to.global.u64 %rl11, %rl2;mul.wide.u32 %rl12, %r15, 4;add.s64 %rl13, %rl11, %rl12;ld.global.f32 %f1, [%rl13];st.shared.f32 [%rl10], %f1; cvt.u64.u32 %rl14, %r1; add.s64 %rl15, %rl14, %rl4;shl.b64 %rl16, %rl15, 2;add.s64 %rl17, %rl11, %rl16;ld.global.f32 %f2, [%rl17];st.shared.f32 [%rl10+132], %f2; shl.b32 %r21, %r1, 1;cvt.u64.u32 %rl18, %r21;add.s64 %rl19, %rl18, %rl4;shl.b64 %rl20, %rl19, 2;add.s64 %rl21, %rl11, %rl20;ld.global.f32 %f3, [%rl21];st.shared.f32 [%rl10+264], %f3; mul.lo.s32 %r24, %r1, 3;cvt.u64.u32 %rl22, %r24;add.s64 %rl23, %rl22, %rl4;shl.b64 %rl24, %rl23, 2;add.s64 %rl25, %rl11, %rl24;ld.global.f32 %f4, [%rl25];st.shared.f32 [%rl10+396], %f4; shl.b32 %r27, %r1, 2;cvt.u64.u32 %rl26, %r27;add.s64 %rl27, %rl26, %rl4;shl.b64 %rl28, %rl27, 2;add.s64 %rl29, %rl11, %rl28;ld.global.f32 %f5, [%rl29];st.shared.f32 [%rl10+528], %f5; mul.lo.s32 %r30, %r1, 5;cvt.u64.u32 %rl30, %r30;add.s64 %rl31, %rl30, %rl4;shl.b64 %rl32, %rl31, 2;add.s64 %rl33, %rl11, %rl32;ld.global.f32 %f6, [%rl33];st.shared.f32 [%rl10+660], %f6; mul.lo.s32 %r33, %r1, 6;cvt.u64.u32 %rl34, %r33;add.s64 %rl35, %rl34, %rl4;shl.b64 %rl36, %rl35, 2;add.s64 %rl37, %rl11, %rl36;ld.global.f32 %f7, [%rl37];st.shared.f32 [%rl10+792], %f7;mul.lo.s32 %r36, %r1, 7;cvt.u64.u32 %rl38, %r36;add.s64 %rl39, %rl38, %rl4;shl.b64 %rl40, %rl39, 2;add.s64 %rl41, %rl11, %rl40;ld.global.f32 %f8, [%rl41];st.shared.f32 [%rl10+924], %f8;bar.sync 15;
IADD R4.CC, R7, c [0x0] [0x150];SHL.W R18, R7, 0x2;IMUL.U32.U32.HI R20, R7, 0x4;MOV R12, c [0x0] [0x150];IADD.X R5, RZ, RZ;IADD R2.CC, R18, c [0x0] [0x148];IADD.X R3, R20, c [0x0] [0x14c];IADD R0, R0, R19;IMAD.U32.U32 R6.CC, R12, 0x2, R7;LD.E R14, [R2];IADD.X R8, RZ, RZ;IMAD R10.CC, R12, 0x3, R7;SHL R21, R0, 0x2;IADD.X R9, RZ, RZ;IMAD.U32.U32 R11.CC, R12, 0x4, R7;STS [R21], R14;SHR.U32 R0, R4, 0x1e;SHL R22, R4, 0x2;IADD.X R4, RZ, RZ;IMAD R27.CC, R12, 0x5, R7;SHR.U32 R13, R6, 0x1e;SHL R24, R6, 0x2;IADD.X R6, RZ, RZ;ISCADD R23, R5, R0, 0x2;IMAD R0.CC, R12, 0x6, R7;IADD.X R5, RZ, RZ;IMAD R33.CC, R12, 0x7, R7;SHR.U32 R15, R10, 0x1e;SHL R26, R10, 0x2;SHR.U32 R10, R11, 0x1e;SHL R28, R11, 0x2;IADD.X R11, RZ, RZ;IADD R12.CC, R22, c [0x0] [0x148];ISCADD R25, R8, R13, 0x2;IADD.X R13, R23, c [0x0] [0x14c];IADD R8.CC, R24, c [0x0] [0x148];SHR.U32 R7, R27, 0x1e;LD.E R13, [R12];SHL R30, R27, 0x2;STS [R21+0x84], R13;ISCADD R27, R9, R15, 0x2;IADD.X R9, R25, c [0x0] [0x14c];IADD R2.CC, R26, c [0x0] [0x148];ISCADD R29, R4, R10, 0x2;IADD.X R3, R27, c [0x0] [0x14c];IADD R4.CC, R28, c [0x0] [0x148];SHR.U32 R10, R0, 0x1e;ISCADD R31, R6, R7, 0x2;ISCADD R32, R5, R10, 0x2;LD.E R9, [R8];IADD.X R5, R29, c [0x0] [0x14c];IADD R6.CC, R30, c [0x0] [0x148];SHL R0, R0, 0x2;IADD.X R7, R31, c [0x0] [0x14c];SHR.U32 R34, R33, 0x1e;IADD R10.CC, R0, c [0x0] [0x148];SHL R33, R33, 0x2;LD.E R3, [R2];ISCADD R34, R11, R34, 0x2;LD.E R5, [R4];IADD.X R11, R32, c [0x0] [0x14c];IADD R14.CC, R33, c [0x0] [0x148];LD.E R6, [R6];IADD.X R15, R34, c [0x0] [0x14c];LD.E R8, [R10];LD.E R2, [R14];STS [R21+0x108], R9;STS [R21+0x18c], R3;STS [R21+0x210], R5;STS [R21+0x294], R6;STS [R21+0x318], R8;STS [R21+0x39c], R2;BAR.SYNC 0xf; PTX
59 Instructions
SASS73 Instructions
CUDA4 Lines
__global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[TILE_HEIGHT * STRIDE];
// Preliminary address calculations … float *tile_ptr = tile + <offset>;
// Buffer into shared memory #pragma unroll for(unsigned int i = 0; i < TILE_SIZE; i++) tile_ptr[i * STRIDE] = src[i * ImgStride]; __syncthreads();
// Processing data …}
Each tile transfer requires many arithmetic ops to calculate addresses
Address generation consumes ~50% of tile transfer cycles
10
Objective
• A tool to help achieve better memory performance• Inspired by Direct Memory Access (DMA)
CPU DRAMDMA$
!!
11
Objective
• A tool to help achieve better memory performance• Inspired by Direct Memory Access (DMA)
CPU DRAMDMA$
!
GPU
$$$
SM
$?
!! Not interruptible!
Heavy bookkeeping!
12
D2MA: The Big Picture
DRAM
GPU
$$$
SM
$D2MA
13
D2MA: Data-Parallel Direct Memory Access
• Take advantage of regular memory accesses & unified L1D/Shared Memory space
• Decouple tile transfers from SM resources– Simplify address generation– Improve memory pipelining
• Direct path to shared memory
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR Tile[0]
14
D2MA Programming Model__global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[T_HEIGHT * T_STRIDE];
int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow;
//process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE);
for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE];}
__global__ void D2MAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[T_HEIGHT * T_STRIDE];
int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT, ImgStride) + blockIdx.x * T_WIDTH; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow;
//process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE);
for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE];}Original Code D2MA-Optimized Code
CUDA: 4 LinesPTX: 59 Instructions
CUDA: 4 LinesPTX: 12 Instructions
#pragma unroll for(unsigned int i = 0; i < T_SIZE; i++) tile_ptr[i * T_STRIDE] = src[i * ImgStride]; __syncthreads();
d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);
15
D2MA Engine
D2MA Overview
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
ControllerGlob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
AGEN Logic Consistency Checker
16
D2MA Operation: Configuration
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
01101010110110
D2MA EngineController
Glob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
0x1020 0x20 64 4 1
AGEN Logic Consistency Checker
d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride);d2ma_set_datatype_float();d2ma_enable_shmem_blank_col();d2ma_ignite_buffer(0);
ConfigConfig
17
D2MA Operation: Addr. Generation
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
0111000
D2MA EngineController
Glob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
AGEN Logic Consistency Checker
AGEN Logic Global Mem. AGEN
Shared Mem. AGENControl
0x1020 0x20 64 4 10x1020 0x20 64 4 1
0x1020
0x20
d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride);d2ma_set_datatype_float();d2ma_enable_shmem_blank_col();d2ma_ignite_buffer(0);
Ignite #0
18
D2MA Operation: Memory Transfer
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
D2MA EngineController
Glob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
AGEN Logic Consistency Checker
AGEN Logic Global Mem. AGEN
Shared Mem. AGENControl
0x1020 0x20 64 4 1
0x1020
0x20
Glob. Addr Shr. Addr …0x2000 0xFF …0xFFFF 0xFF …0xFFFF 0xFF …0x1020 0x20 …
0x10A0
0xA0
0x10A0 0xA0 …
190x10A0 0xA0 …0x10A0 0xA0 …
D2MA Operation: Memory Transfer
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
D2MA EngineController
Glob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
AGEN Logic Consistency Checker
AGEN Logic Global Mem. AGEN
Shared Mem. AGENControl
0x1020 0x20 64 4 1
Glob. Addr Shr. Addr …0x2000 0xFF …0x1020 0x20 …0x1020 0x20 …
&0x1020&0x20
&0x10A0&0xA0
20
D2MA Operation: Enforcing Synchronization
Without D2MA With D2MA
Thread Block 1
Thread Block 2
Start TX 1
Load from buffer
End TX 1Start TX 2
Load from bufferRe-exec
load
Thread Block 1
Thread Block 2
Start TX 1, Thread barrier
Barrier satisfied, End TX 1
Load from buffer
Start TX 2, Thread barrier
No warp ready to schedule
Code independent
of buffer
syncthreads()No syncthreads()!
Independent code executes
Synchronization handled transparently by H/W
Programmer must guarantee consistency
21
Experimental Evaluation
• GPGPU-Sim v3.2.1
• Benchmarks from NVIDIA CUDA SDK, Rodinia– Must perform shared memory buffering
Number of SMs 15Thread blocks/SM 8Shared memory/SM 48 KBD2MA engines/SM 1 Controllers/engine 8 Buffers/controller 4 AGEN/engine 1Warp scheduling policy Greedy-then-oldestL1 cache (size/assoc/block size) 16KB/4-way/128BL2 cache (size/assoc/block size) 768KB/16-way/128BL2/DRAM latency 240/440 cycles
22
Results: Performance
DCT MMDWT
TRANS1
TRANS2 FWGAUSS
MEANSOBEL
PATH
SGEMVLU
D1LU
D2
GEOMEAN0
0.5
1
1.5
2
2.5
Spee
dup
Geomean speedup: 1.36x
23
Results: Cycle Breakdown
DCT MMDWT
TRANS1
TRANS2 FWGAUSS
MEANSOBEL
PATH
SGEMVLU
D1LU
D2AVG
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
AGEN MEM
Brea
kdow
n of
Tile
Tra
nsfe
r Cyc
les
BaselineD 2MA
Avg TX cycles: ~5x reduction
Addr. Gen: improved by 98% Mem. TX: reduced by 66%
24
Results: Overheads
• Model of D2MA Engine synthesized using Synopsys• Compared to NVIDIA GTX 480
– Die area: 529 mm2
– TDP: 250 W– One D2MA Engine per SM (15 SMs):
• Area overhead: 0.016%• Power overhead: 0.022%
92%
5% 3%
Engine Area
Controller Array AGEN LogicConsistency Checker
25
Conclusion
• Programmer must optimize memory traffic to achieve good performance on GPUs– Shared memory buffering improves b/w utilization– Buffering still has overheads
• D2MA decouples tiled data buffering from existing SM resources– Reduces costs of address generation by 98%– Improves memory transfer times by 66%– Performance improves by 1.36x– Dynamic instructions executed reduced by 7%– Enforces synchronization transparently– Low area and power overheads (<0.03%)
26
Thank You!
• Questions?
Image credits:http://www.opengraphicdesign.com/web/ajax-loading-graphics/http://www.barriersandbollards.com/html-pages/mb50-1.png
D2MA: Accelerating Coarse-Grained Data Transfer for GPUs
D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke
University of Michigan
PACT-23August 27th, 2014
28
Special Addressing Modes
Blank Column Mode Halo Addressing
Mode
29
Results: Dynamic Instruction Count Reduction
DCT MMDWT
TRANS1
TRANS2 FWGAUSS
MEANSOBEL
PATH
SGEMVLU
D1LU
D2
GEOMEAN0
0.2
0.4
0.6
0.8
1
1.2
Dyn.
Inst
ruct
ion
Coun
t Rel
ativ
e to
Bas
elin
e