Automated Dynamic Analysis of CUDA Programs
description
Transcript of Automated Dynamic Analysis of CUDA Programs
![Page 1: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/1.jpg)
1
UNIVERSITY OF VIRGINIA
Automated Dynamic Analysisof CUDA Programs
Michael Boyer, Kevin Skadron*, and Westley WeimerUniversity of Virginia
{boyer,skadron,weimer}@cs.virginia.edu
* currently on sabbatical with NVIDIA Research
![Page 2: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/2.jpg)
2
UNIVERSITY OF VIRGINIA
Outline
GPGPU CUDA Automated analyses
– Correctness: race conditions– Performance: bank conflicts
Preliminary results Future work Conclusion
![Page 3: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/3.jpg)
3
UNIVERSITY OF VIRGINIA
Why GPGPU?
From: NVIDIA CUDA Programming Guide, Version 1.1
![Page 4: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/4.jpg)
4
UNIVERSITY OF VIRGINIA
CPU vs. GPU Design
Single-Thread Latency
Aggregate Throughput
From: NVIDIA CUDA Programming Guide, Version 1.1
![Page 5: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/5.jpg)
5
UNIVERSITY OF VIRGINIA
GPGPU Programming
Traditional approach: graphics APIs ATI/AMD: Close-to-the-Metal (CTM) NVIDIA: Compute Unified Device
Architecture (CUDA)
![Page 6: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/6.jpg)
6
UNIVERSITY OF VIRGINIA
CUDA: Abstractions
Kernel functions Scratchpad memory Barrier synchronization
![Page 7: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/7.jpg)
7
UNIVERSITY OF VIRGINIA
CUDA: Example Program__host__ void example(int *cpu_mem) {
cudaMalloc(&gpu_mem, mem_size);cudaMemcpy(gpu_mem, cpu_mem, HostToDevice);kernel <<< grid, threads, mem_size >>>
(gpu_mem);cudaMemcpy(cpu_mem, gpu_mem, DeviceToHost);
}
__global__ void kernel(int *mem) {int thread_id = threadIdx.x;mem[thread_id] = thread_id;
}
![Page 8: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/8.jpg)
8
UNIVERSITY OF VIRGINIA
GPU
Multiprocessor 2Multiprocessor 1 Multiprocessor N● ● ●
CUDA: Hardware
Global Device Memory
Multiprocessor
● ● ●
Per-Block Shared Memory (PBSM)
Processing Element 1
Processing Element 2
Processing Element M
Instruction Unit
Registers Registers Registers
![Page 9: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/9.jpg)
9
UNIVERSITY OF VIRGINIA
Outline
GPGPU CUDA Automated analyses
– Correctness: race conditions– Performance: bank conflicts
Preliminary results Future work Conclusion
![Page 10: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/10.jpg)
10
UNIVERSITY OF VIRGINIA
Race Conditions
Ordering of instructions among multiple threads is arbitrary
Relaxed memory consistency model
Synchronization: __syncthreads()– Barrier / memory fence
![Page 11: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/11.jpg)
11
UNIVERSITY OF VIRGINIA
Race Conditions: Example
1 extern __shared__ int s[ ];23 __global__ void kernel(int *out)
{4 int id = threadIdx.x;5 int nt = blockDim.x;67 s[id] = id;8 out = s[(id + 1) % nt];9 }
W
W
W
W
W
W
0
1
2
3
4
5
threads
R
R
R
R
R
R
s0
1
2
3
4
5
8 out = s[(id + 1) % nt];
![Page 12: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/12.jpg)
12
UNIVERSITY OF VIRGINIA
Automatic InstrumentationOriginal CUDA Source Code
Intermediate Representation
Instrumented CUDA Source
Code
Instrumentation
Compile
Execute
Output: Race Conditions Detected?
![Page 13: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/13.jpg)
13
UNIVERSITY OF VIRGINIA
Race Condition Instrumentation
Two global bookkeeping arrays:– Reads & writes of all threads
Two per-thread bookkeeping arrays:– Reads & writes of a single thread
After each shared memory access:– Update bookkeeping arrays– Detect & report race conditions
![Page 14: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/14.jpg)
14
UNIVERSITY OF VIRGINIA
Add synchronization between lines 7 and 8No race conditions detected
Race Condition Detection
Original codeRAW hazard at expression:
#line 8 out[id] = s[(id + 1) % nt];
![Page 15: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/15.jpg)
15
UNIVERSITY OF VIRGINIA
Outline
GPGPU CUDA Automated analyses
– Correctness: race conditions– Performance: bank conflicts
Preliminary results Future work Conclusion
![Page 16: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/16.jpg)
16
UNIVERSITY OF VIRGINIA
Bank Conflicts
PBSM is fast– Much faster than global memory– Potentially as fast as register access
…assuming no bank conflicts– Bank conflicts cause serialized access
![Page 17: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/17.jpg)
17
UNIVERSITY OF VIRGINIA
Non-Conflicting Access Patterns
0 1 2 3 4 5 6 7Threads
0 1 2 3 4 5 6 7Banks
Stride = 1
0 1 2 3 4 5 6 7Threads
0 1 2 3 4 5 6 7Banks
Stride = 3
![Page 18: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/18.jpg)
18
UNIVERSITY OF VIRGINIA
Conflicting Access Patterns
0 1 2 3 4 5 6 7Threads
0 1 2 3 4 5 6 7Banks
Stride = 4
0 1 2 3 4 5 6 7Threads
0 1 2 3 4 5 6 7Banks
Stride = 16
![Page 19: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/19.jpg)
19
UNIVERSITY OF VIRGINIA
0
1
2
3
4
5
6
7
8
0 0.2 0.4 0.6 0.8 1
Iterations (Millions)
Ru
nti
me
(Sec
on
ds)
No Bank Conflicts Maximal Bank Conflicts
0
1
2
3
4
5
6
7
8
0 0.2 0.4 0.6 0.8 1
Iterations (Millions)
Ru
nti
me
(Sec
on
ds)
No Bank Conflicts Maximal Bank Conflicts Global Memory
Impact of Bank Conflicts
![Page 20: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/20.jpg)
20
UNIVERSITY OF VIRGINIA
Output: Race Conditions Detected?
Automatic InstrumentationOriginal CUDA Source Code
Intermediate Representation
Instrumented CUDA Source
Code
Instrumentation
Compile
Execute
Output: Bank Conflicts
Detected?
![Page 21: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/21.jpg)
21
UNIVERSITY OF VIRGINIA
Bank Conflict Instrumentation
Global bookkeeping array:– Tracks address accessed by each thread
After each PBSM access:– Each thread updates its entry– One thread computes and reports bank conflicts
![Page 22: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/22.jpg)
22
UNIVERSITY OF VIRGINIA
CAUSE_BANK_CONFLICTS = trueBank conflicts at: #line 14 mem[j]++
Bank: 0 1 2 3 4 5 6 7 8 9 …
Accesses: 16 0 0 0 0 0 0 0 0 0 …
Bank Conflict Detection
CAUSE_BANK_CONFLICTS = falseNo bank conflicts at:
#line 14 mem[j]++
![Page 23: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/23.jpg)
23
UNIVERSITY OF VIRGINIA
Preliminary Results
Scan– Included in CUDA SDK– All-prefix sums operation– 400 lines of code– Explicitly prevents race conditions and bank conflicts
![Page 24: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/24.jpg)
24
UNIVERSITY OF VIRGINIA
Preliminary Results:Race Condition Detection
Original code:– No race conditions detected
Remove any synchronization calls:– Race conditions detected
![Page 25: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/25.jpg)
25
UNIVERSITY OF VIRGINIA
Preliminary Results:Bank Conflict Detection
Original code:– Small number of minor bank conflicts
Enable bank conflict avoidance macro:– Bank conflicts increased!– Confirmed by manual analysis– Culprit: incorrect emulation mode
![Page 26: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/26.jpg)
26
UNIVERSITY OF VIRGINIA
Instrumentation Overhead
Two sources:– Emulation– Instrumentation
Assumption: for debugging, programmers will already use emulation mode
![Page 27: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/27.jpg)
27
UNIVERSITY OF VIRGINIA
Instrumentation Overhead
Code Version
Execution Environmen
t
Average Runtim
e
Slowdown (Relative to Native)
Slowdown (Relative
to Emulation
)
Original Native 0.4 ms
Original Emulation 27 ms 62x
Instrumented (bank conflicts)
Emulation 71 ms 163x 2.6x
Instrumented (race
conditions)Emulation 324 ms 739x 12x
![Page 28: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/28.jpg)
28
UNIVERSITY OF VIRGINIA
Future Work
Find more types of bugs– Correctness: array bounds checking– Performance: memory coalescing
Reduce instrumentation overhead– Execute instrumented code natively
![Page 29: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/29.jpg)
29
UNIVERSITY OF VIRGINIA
Conclusion
GPGPU: enormous performance potential– But parallel programming is challenging
Automated instrumentation can help– Find synchronization bugs– Identify inefficient memory accesses– And more…
![Page 30: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/30.jpg)
30
UNIVERSITY OF VIRGINIA
Questions?
Instrumentation tool will be available at:http://www.cs.virginia.edu/~mwb7w/cuda
![Page 31: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/31.jpg)
31
UNIVERSITY OF VIRGINIA
Domain Mapping
From: NVIDIA CUDA Programming Guide, Version 1.1
![Page 32: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/32.jpg)
32
UNIVERSITY OF VIRGINIA
Coalesced Accesses
From: NVIDIA CUDA Programming Guide, Version 1.1
![Page 33: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/33.jpg)
33
UNIVERSITY OF VIRGINIA
Non-Coalesced Accesses
From: NVIDIA CUDA Programming Guide, Version 1.1
![Page 34: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/34.jpg)
34
UNIVERSITY OF VIRGINIA
Race Condition Detection Algorithm
A thread t knows a race condition exists at shared memory location m if:– Location m has been read from and written to– One of the accesses to m came from t– One of the accesses to m came from a thread other than t
Note that we are only checking for RAW and WAR hazards
![Page 35: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/35.jpg)
35
UNIVERSITY OF VIRGINIA
Bank Conflicts: Example
extern __shared__ int mem[];
__global__ void kernel(int iters) {int min, stride, max, id = threadIdx.x;
if (CAUSE_BANK_CONFLICTS)// Set stride to cause bank conflicts
else// Set stride to avoid bank conflicts
for (int i = 0; i < iters; i++)
for (int j = min; j < max; j += stride)mem[j]++;
}
![Page 36: Automated Dynamic Analysis of CUDA Programs](https://reader036.fdocuments.net/reader036/viewer/2022081512/56814689550346895db3ad3d/html5/thumbnails/36.jpg)
36
UNIVERSITY OF VIRGINIA
Instrumented Code Example
extern __shared__ int s[];
__global__ void kernel() {int id = threadIdx.x;int nt = blockDim.x *
blockDim.y * blockDim.z;
s[id] = id;int temp = s[(nt+id-1) % nt];
}
extern __shared__ int s[] ;
__global__ void kernel(void) ;void kernel(void) {
// Instrumentation codeint block_size = blockDim.x * blockDim.y * blockDim.z;int thread_id = threadIdx.x + (threadIdx.y * blockDim.x) +
(threadIdx.z * blockDim.x * blockDim.y);__shared__ char mem_reads[PUT_ARRAY_SIZE_HERE];__shared__ char mem_writes[PUT_ARRAY_SIZE_HERE];if (thread_id == 0) {
for (int i = 0; i < block_size; i++) {mem_reads[i] = 0;mem_writes[i] = 0;
}}__syncthreads();char hazard = 0;
int id ; int nt ; int temp ;
{id = (int )threadIdx.x; nt = (int )((blockDim.x * blockDim.y) * blockDim.z);
//#line 9 s[id] = id;
// Instrumentation code mem_writes[id] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] &&
mem_writes[i]) { hazard =
1; break; } } if (hazard) printf("WAR hazard at expression:
#line 9 s[id] = id;\n"); hazard = 0; } //#line 10 temp = s[((nt + id) - 1) % nt];
// Instrumentation code mem_reads[((nt + id) - 1) % nt] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] &&
mem_writes[i]) { hazard =
1; break; } } if (hazard) printf("RAW hazard at expression:
#line 10 temp = s[((nt + id) - 1) %% nt];\n"); hazard = 0; } //#line 11 return;}}
Original Code
Instrumentation RAW hazard at expression:
#line 10 temp = s[((nt + id) - 1) % nt];