Automated Dynamic Analysis of CUDA Programs

1

UNIVERSITY OF VIRGINIA

Automated Dynamic Analysisof CUDA Programs

Michael Boyer, Kevin Skadron*, and Westley WeimerUniversity of Virginia

{boyer,skadron,weimer}@cs.virginia.edu

* currently on sabbatical with NVIDIA Research

2


Outline

GPGPU CUDA Automated analyses

– Correctness: race conditions– Performance: bank conflicts

Preliminary results Future work Conclusion

3


Why GPGPU?

From: NVIDIA CUDA Programming Guide, Version 1.1

4


CPU vs. GPU Design

Single-Thread Latency

Aggregate Throughput


5


GPGPU Programming

Traditional approach: graphics APIs ATI/AMD: Close-to-the-Metal (CTM) NVIDIA: Compute Unified Device

Architecture (CUDA)

6


CUDA: Abstractions

Kernel functions Scratchpad memory Barrier synchronization

7


CUDA: Example Program__host__ void example(int *cpu_mem) {

cudaMalloc(&gpu_mem, mem_size);cudaMemcpy(gpu_mem, cpu_mem, HostToDevice);kernel <<< grid, threads, mem_size >>>

(gpu_mem);cudaMemcpy(cpu_mem, gpu_mem, DeviceToHost);

}

__global__ void kernel(int *mem) {int thread_id = threadIdx.x;mem[thread_id] = thread_id;

}

8


GPU

Multiprocessor 2Multiprocessor 1 Multiprocessor N● ● ●

CUDA: Hardware

Global Device Memory

Multiprocessor

● ● ●

Per-Block Shared Memory (PBSM)

Processing Element 1

Processing Element 2

Processing Element M

Instruction Unit

Registers Registers Registers

9


Outline




10


Race Conditions

Ordering of instructions among multiple threads is arbitrary

Relaxed memory consistency model

Synchronization: __syncthreads()– Barrier / memory fence

11


Race Conditions: Example

1 extern __shared__ int s[ ];23 __global__ void kernel(int *out)

{4 int id = threadIdx.x;5 int nt = blockDim.x;67 s[id] = id;8 out = s[(id + 1) % nt];9 }

W

W

W

W

W

W

0

1

2

3

4

5

threads

R

R

R

R

R

R

s0

1

2

3

4

5

8 out = s[(id + 1) % nt];

12


Automatic InstrumentationOriginal CUDA Source Code

Intermediate Representation

Instrumented CUDA Source

Code

Instrumentation

Compile

Execute

Output: Race Conditions Detected?

13


Race Condition Instrumentation

Two global bookkeeping arrays:– Reads & writes of all threads

Two per-thread bookkeeping arrays:– Reads & writes of a single thread

After each shared memory access:– Update bookkeeping arrays– Detect & report race conditions

14


Add synchronization between lines 7 and 8No race conditions detected

Race Condition Detection

Original codeRAW hazard at expression:

#line 8 out[id] = s[(id + 1) % nt];

15


Outline




16


Bank Conflicts

PBSM is fast– Much faster than global memory– Potentially as fast as register access

…assuming no bank conflicts– Bank conflicts cause serialized access

17


Non-Conflicting Access Patterns

0 1 2 3 4 5 6 7Threads

0 1 2 3 4 5 6 7Banks

Stride = 1

0 1 2 3 4 5 6 7Threads

0 1 2 3 4 5 6 7Banks

Stride = 3

18


Conflicting Access Patterns

0 1 2 3 4 5 6 7Threads

0 1 2 3 4 5 6 7Banks

Stride = 4

0 1 2 3 4 5 6 7Threads

0 1 2 3 4 5 6 7Banks

Stride = 16

19


0

1

2

3

4

5

6

7

8

0 0.2 0.4 0.6 0.8 1

Iterations (Millions)

Ru

nti

me

(Sec

on

ds)

No Bank Conflicts Maximal Bank Conflicts

0

1

2

3

4

5

6

7

8

0 0.2 0.4 0.6 0.8 1

Iterations (Millions)

Ru

nti

me

(Sec

on

ds)

No Bank Conflicts Maximal Bank Conflicts Global Memory

Impact of Bank Conflicts

20


Output: Race Conditions Detected?

Automatic InstrumentationOriginal CUDA Source Code

Intermediate Representation

Instrumented CUDA Source

Code

Instrumentation

Compile

Execute

Output: Bank Conflicts

Detected?

21


Bank Conflict Instrumentation

Global bookkeeping array:– Tracks address accessed by each thread

After each PBSM access:– Each thread updates its entry– One thread computes and reports bank conflicts

22


CAUSE_BANK_CONFLICTS = trueBank conflicts at: #line 14 mem[j]++

Bank: 0 1 2 3 4 5 6 7 8 9 …

Accesses: 16 0 0 0 0 0 0 0 0 0 …

Bank Conflict Detection

CAUSE_BANK_CONFLICTS = falseNo bank conflicts at:

#line 14 mem[j]++

23


Preliminary Results

Scan– Included in CUDA SDK– All-prefix sums operation– 400 lines of code– Explicitly prevents race conditions and bank conflicts

24


Preliminary Results:Race Condition Detection

Original code:– No race conditions detected

Remove any synchronization calls:– Race conditions detected

25


Preliminary Results:Bank Conflict Detection

Original code:– Small number of minor bank conflicts

Enable bank conflict avoidance macro:– Bank conflicts increased!– Confirmed by manual analysis– Culprit: incorrect emulation mode

26


Instrumentation Overhead

Two sources:– Emulation– Instrumentation

Assumption: for debugging, programmers will already use emulation mode

27


Instrumentation Overhead

Code Version

Execution Environmen

t

Average Runtim

e

Slowdown (Relative to Native)

Slowdown (Relative

to Emulation

)

Original Native 0.4 ms

Original Emulation 27 ms 62x

Instrumented (bank conflicts)

Emulation 71 ms 163x 2.6x

Instrumented (race

conditions)Emulation 324 ms 739x 12x

28


Future Work

Find more types of bugs– Correctness: array bounds checking– Performance: memory coalescing

Reduce instrumentation overhead– Execute instrumented code natively

29


Conclusion

GPGPU: enormous performance potential– But parallel programming is challenging

Automated instrumentation can help– Find synchronization bugs– Identify inefficient memory accesses– And more…

30


Questions?

Instrumentation tool will be available at:http://www.cs.virginia.edu/~mwb7w/cuda

31


Domain Mapping


32


Coalesced Accesses


33


Non-Coalesced Accesses


34


Race Condition Detection Algorithm

A thread t knows a race condition exists at shared memory location m if:– Location m has been read from and written to– One of the accesses to m came from t– One of the accesses to m came from a thread other than t

Note that we are only checking for RAW and WAR hazards

35


Bank Conflicts: Example

extern __shared__ int mem[];

__global__ void kernel(int iters) {int min, stride, max, id = threadIdx.x;

if (CAUSE_BANK_CONFLICTS)// Set stride to cause bank conflicts

else// Set stride to avoid bank conflicts

for (int i = 0; i < iters; i++)

for (int j = min; j < max; j += stride)mem[j]++;

}

36


Instrumented Code Example

extern __shared__ int s[];

__global__ void kernel() {int id = threadIdx.x;int nt = blockDim.x *

blockDim.y * blockDim.z;

s[id] = id;int temp = s[(nt+id-1) % nt];

}

extern __shared__ int s[] ;

__global__ void kernel(void) ;void kernel(void) {

// Instrumentation codeint block_size = blockDim.x * blockDim.y * blockDim.z;int thread_id = threadIdx.x + (threadIdx.y * blockDim.x) +

(threadIdx.z * blockDim.x * blockDim.y);__shared__ char mem_reads[PUT_ARRAY_SIZE_HERE];__shared__ char mem_writes[PUT_ARRAY_SIZE_HERE];if (thread_id == 0) {

for (int i = 0; i < block_size; i++) {mem_reads[i] = 0;mem_writes[i] = 0;

}}__syncthreads();char hazard = 0;

int id ; int nt ; int temp ;

{id = (int )threadIdx.x; nt = (int )((blockDim.x * blockDim.y) * blockDim.z);

//#line 9 s[id] = id;

// Instrumentation code mem_writes[id] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] &&

mem_writes[i]) { hazard =

1; break; } } if (hazard) printf("WAR hazard at expression:

#line 9 s[id] = id;\n"); hazard = 0; } //#line 10 temp = s[((nt + id) - 1) % nt];

// Instrumentation code mem_reads[((nt + id) - 1) % nt] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] &&

mem_writes[i]) { hazard =

1; break; } } if (hazard) printf("RAW hazard at expression:

#line 10 temp = s[((nt + id) - 1) %% nt];\n"); hazard = 0; } //#line 11 return;}}

Original Code

Instrumentation RAW hazard at expression:

#line 10 temp = s[((nt + id) - 1) % nt];

Automated Dynamic Analysis of CUDA Programs

Documents

Transcript of Automated Dynamic Analysis of CUDA Programs