© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU...
-
Upload
mervyn-willis -
Category
Documents
-
view
222 -
download
0
Transcript of © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU...
![Page 1: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/1.jpg)
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012
1
CS/EE 217
GPU Architecture and Parallel Programming
Lecture 15: Atomic Operations and Histogramming - Part 2
![Page 2: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/2.jpg)
2
Objective
• To learn practical histogram programming techniques– Basic histogram algorithm using atomic operations– Privatization
![Page 3: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/3.jpg)
Review: A Histogram Example
• In phrase “Programming Massively Parallel Processors” build a histogram of frequencies of each letter
• A(4), C(1), E(1), G(1), …
3
• How do you do this in parallel?– Have each thread to take a section of the input– For each input letter, use atomic operations to build the
histogram
![Page 4: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/4.jpg)
Iteration #1 – 1st letter in each section
4
P R O G R A M M I N G M A VISS YLE P
Thread 0 Thread 1 Thread 2 Thread 3
A B C D E1
F G H I J K L M2
N O P1
Q R S T U V
![Page 5: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/5.jpg)
Iteration #2 – 2nd letter in each section
5
P R O G R A M M I N G M A VISS YLE P
Thread 0 Thread 1 Thread 2 Thread 3
A1
B C D E1
F G H I J K L1
M3
N O P1
Q R1
S T U V
![Page 6: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/6.jpg)
Iteration #3
6
P R O G R A M M I N G M A VISS YLE P
Thread 0 Thread 1 Thread 2 Thread 3
A1
B C D E1
F G H I1
J K L1
M3
N O1
P1
Q R1
S1
T U V
![Page 7: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/7.jpg)
Iteration #4
7
P R O G R A M M I N G M A VISS YLE P
Thread 0 Thread 1 Thread 2 Thread 3
A1
B C D E1
F G1
H I1
J K L1
M3
N1
O1
P1
Q R1
S2
T U V
![Page 8: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/8.jpg)
Iteration #5
8
P R O G R A M M I N G M A VISS YLE P
Thread 0 Thread 1 Thread 2 Thread 3
A1
B C D E1
F G1
H I1
J K L1
M3
N1
O1
P2
Q R2
S2
T U V
![Page 9: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/9.jpg)
What is wrong with the algorithm?• Reads from the input array are not coalesced
– Assign inputs to each thread in a strided pattern– Adjacent threads process adjacent input letters
9
P R O G R A M M I N G M A VISS YLE P
Thread 0 Thread 1 Thread 2 Thread 3
A B C D E F G1
H I J K L M N O1
P1
Q R1
S T U V
![Page 10: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/10.jpg)
Iteration 2
10
P R O G R A M M I N G M A VISS YLE P
Thread 0 Thread 1 Thread 2 Thread 3
A1
B C D E F G1
H I J K L M2
N O1
P1
Q R2
S T U V
• All threads move to the next section of input
![Page 11: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/11.jpg)
A Histogram Kernel
• The kernel receives a pointer to the input buffer • Each thread process the input in a strided pattern
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
// stride is total number of threads
int stride = blockDim.x * gridDim.x;11
![Page 12: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/12.jpg)
More on the Histogram Kernel
// All threads handle blockDim.x * gridDim.x
// consecutive elements
while (i < size) {
atomicAdd( &(histo[buffer[i]]), 1);
i += stride;
}
}
12
![Page 13: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/13.jpg)
Atomic Operations on DRAM
• An atomic operation starts with a read, with a latency of a few hundred cycles
13
![Page 14: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/14.jpg)
Atomic Operations on DRAM
• An atomic operation starts with a read, with a latency of a few hundred cycles
• The atomic operation ends with a write, with a latency of a few hundred cycles
• During this whole time, no one else can access the location
14
![Page 15: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/15.jpg)
Atomic Operations on DRAM
• Each Load-Modify-Store has two full memory access delays – All atomic operations on the same variable (RAM location)
are serialized
DRAM delay DRAM delay
transfer delay
internal routingDRAM delay
transfer delay
internal routing
..
atomic operation N atomic operation N+1
time
15
![Page 16: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/16.jpg)
Latency determines throughput of atomic operations
• Throughput of an atomic operation is the rate at which the application can execute an atomic operation on a particular location.
• The rate is limited by the total latency of the read-modify-write sequence, typically more than 1000 cycles for global memory (DRAM) locations.
• This means that if many threads attempt to do atomic operation on the same location (contention), the memory bandwidth is reduced to < 1/1000!
16
![Page 17: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/17.jpg)
You may have a similar experience in supermarket checkout
• Some customers realize that they missed an item after they started to check out
• They run to the isle and get the item while the line waits– The rate of check is reduced due to the long latency of
running to the isle and back.
• Imagine a store where every customer starts the check out before they even fetch any of the items– The rate of the checkout will be 1 / (entire shopping time of
each customer)
17
![Page 18: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/18.jpg)
Hardware Improvements (cont.)
• Atomic operations on Fermi L2 cache– medium latency, but still serialized– Global to all blocks– “Free improvement” on Global Memory atomics
internal routing
..
atomic operation N atomic operation N+1
time
data transfer data transfer
18
![Page 19: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/19.jpg)
Hardware Improvements
• Atomic operations on Shared Memory– Very short latency, but still serialized– Private to each thread block– Need algorithm work by programmers (more later)
internal routing
..
atomic operation N atomic operation N+1
time
data transfer
19
![Page 20: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/20.jpg)
Atomics in Shared Memory Requires Privatization
• Create private copies of the histo[] array for each thread block
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo)
{
__shared__ unsigned int histo_private[256];
if (threadIdx.x < 256) histo_private[threadidx.x] = 0;
__syncthreads();
20
![Page 21: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/21.jpg)
Build Private Histogram
int i = threadIdx.x + blockIdx.x * blockDim.x;
// stride is total number of threads
int stride = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd( &(histo_private[buffer[i]), 1);
i += stride;
}
21
![Page 22: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/22.jpg)
Build Final Histogram
// wait for all other threads in the block to finish
__syncthreads();
if (threadIdx.x < 256)
atomicAdd( &(histo[threadIdx.x]), histo_private[threadIdx.x] );
}
22
![Page 23: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/23.jpg)
More on Privatization• Privatization is a powerful and frequently used
techniques for parallelizing applications
• The operation needs to be associative and commutative– Histogram add operation is associative and commutative
• The histogram size needs to be small– Fits into shared memory
• What if the histogram is too large to privatize?23
![Page 24: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/24.jpg)
Other Atomic operations
• atomicCAS (int *p, int cmp, int val) – CAS = compare and swap
//atomically perform the followingint old = *p;if(cmp == old) *p = v;return old;
• AtomicExch – unconditional version of CASint old = *p;*p = v;return old
• What are these used for?
24
![Page 25: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/25.jpg)
Locking causes control divergence in GPUs
25
Divergence deadlock if locking thread idles
![Page 26: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/26.jpg)
Alternatives to locking?
• Lock-free algorithms/data structures– Update a private copy– Try to atomically update a global data structure using
compare and swap or similar– Retry if failed– Need data structures that support this kind of operation
• Wait-free algorithms/data structures– Similar to histogramming – don’t wait, but atomic update– But applies only to some algorithms
26
![Page 27: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/27.jpg)
Lock free vs. locking
27
Example from Nvidia presentation at GTC 2013
![Page 28: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/28.jpg)
Parallel Linked List Example
28
![Page 29: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/29.jpg)
29
![Page 30: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/30.jpg)
30
![Page 31: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/31.jpg)
31
![Page 32: © David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.](https://reader035.fdocuments.net/reader035/viewer/2022062301/5697bfe31a28abf838cb4e74/html5/thumbnails/32.jpg)
ANY MORE QUESTIONS?
32