Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU...
-
Upload
baldwin-oliver -
Category
Documents
-
view
216 -
download
0
Transcript of Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU...
![Page 1: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/1.jpg)
1
Tyler SorensenAdviser: Jade Alglave
University College London
WPLI 2015 April 12, 2105
GPU Concurrency: Weak Behaviours and Programming Assumptions
![Page 2: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/2.jpg)
2
Based on our ASPLOS ‘15 paper:
Jade Alglave1,2, Mark Batty3, Alastair F. Donaldson4, Ganesh Gopalakrishnan5, Jeroen Ketema4, Daniel Poetzl6, Tyler Sorensen1,5, John Wickerson4
1 University College London, 2 Microsoft Research, 3 University of Cambridge, 4 Imperial College London, 5 University of Utah, 6 University of Oxford
GPU Concurrency: Weak Behaviours and Programming Assumptions
![Page 3: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/3.jpg)
3
Intel Core i7 4500 CPU
![Page 4: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/4.jpg)
4
![Page 5: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/5.jpg)
5
Nvidia Tesla C2075 GPU
![Page 6: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/6.jpg)
6
Roadmap
• what happened to the pony • how we found the bug • how we are able to fix the pony
(background)(methodology)(contribution)
![Page 7: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/7.jpg)
7
What happened to the pony?
• the visualization bugs are due to weak memory behaviours on GPUs
![Page 8: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/8.jpg)
8
Weak memory models
• consider the test known as message passing (mp)• an instance of this test appears in the pony code
![Page 9: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/9.jpg)
9
Weak memory models
• consider the test known as message passing (mp)• initial state: x and y are memory locations
![Page 10: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/10.jpg)
10
Weak memory models
• consider the test known as message passing (mp)• thread ids
![Page 11: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/11.jpg)
11
Weak memory models
• consider the test known as message passing (mp)• program: for each thread id
![Page 12: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/12.jpg)
12
Weak memory models
• consider the test known as message passing (mp)• assertion: question about the final state of registers
![Page 13: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/13.jpg)
13
Message passing (mp) test
• Tests how to implement a handshake idiom
Data
Data
![Page 14: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/14.jpg)
14
Message passing (mp) test
• Tests how to implement a handshake idiom
Flag
Flag
![Page 15: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/15.jpg)
15
Message passing (mp) test
• Tests how to implement a handshake idiom
Stale Data
![Page 16: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/16.jpg)
16
![Page 17: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/17.jpg)
17
![Page 18: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/18.jpg)
18
![Page 19: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/19.jpg)
19
![Page 20: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/20.jpg)
20
![Page 21: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/21.jpg)
21
assertion cannotbe satisfied by interleavings
this is knownas Lamport’s sequentialconsistency (or SC)
![Page 22: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/22.jpg)
22
Weak memory models
• can we assume assertion will never pass?
![Page 23: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/23.jpg)
23
Weak memory models
• can we assume assertion will never pass? No!
![Page 24: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/24.jpg)
24
Weak memory models
• Alglave and Maranget report this assertion appears 41 million times out of 5 billion test runs on Tegra2 ARM processor1
1http://diy.inria.fr/cats/tables.html
![Page 25: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/25.jpg)
25
Weak memory models
• what happened?
• architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions.
• weak memory models can allow weak behaviors (executions that do not correspond to an interleaving)
![Page 26: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/26.jpg)
26
GPU memory models
• what type of memory model do current GPUs implement?
• documentation is sparse
• CUDA has 1 page + 1 example • PTX has 1 page + 0 examples
• given in English prose
• we need to know this if we are to write correct GPU programs!
![Page 27: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/27.jpg)
27
CTA 0 CTA 1 CTA n
Threads
GPU programming
Global Memory
Shared Memory For CTA 0
Shared Memory For CTA 1
Shared Memory For CTA n
Within CTAs, threadsare grouped into warps(32 threads per warp in Nvidia GPUs)
![Page 28: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/28.jpg)
28
Threads
GPU programming
Global Memory
![Page 29: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/29.jpg)
29
CTA 0 CTA 1 CTA n
Threads
GPU programming
Global Memory
![Page 30: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/30.jpg)
30
CTA 0 CTA 1 CTA n
Threads
GPU programming
Global Memory
Shared Memory For CTA 0
Shared Memory For CTA 1
Shared Memory For CTA n
![Page 31: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/31.jpg)
31
CTA 0 CTA 1 CTA n
Threads
GPU programming
Global Memory
Shared Memory For CTA 0
Shared Memory For CTA 1
Shared Memory For CTA n
Within CTAs, threadsare grouped into warps(32 threads per warp in Nvidia GPUs)
![Page 32: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/32.jpg)
32
(background)(methodology)(contribution)
Roadmap
• what happened to the pony • how we found the bug • how we are able to fix the pony
![Page 33: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/33.jpg)
33
Methodology
GPU litmus tests
GPU hardware
formal model
compare results
![Page 34: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/34.jpg)
34
GPU tests
• GPU litmus test considerations
Scope Tree (device (cta T0) (cta T1) )x: global, y: global
![Page 35: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/35.jpg)
35
GPU tests
• GPU litmus test considerations• PTX instructions
Scope Tree (device (cta T0) (cta T1) )x: global, y: global
![Page 36: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/36.jpg)
36
GPU tests
• GPU litmus test considerations• what memory region (shared or global) are x and y in?
Scope Tree (device (cta T0) (cta T1) )x: global, y: global
![Page 37: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/37.jpg)
37
GPU tests
• GPU litmus test considerations• what memory region (shared or global) are x and y in?
![Page 38: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/38.jpg)
38
GPU tests
• GPU litmus test considerations• are T0 and T1 in the same CTA or different CTAs?
Scope Tree (device (cta T0) (cta T1) )x: global, y: global
![Page 39: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/39.jpg)
39
GPU tests
• GPU litmus test considerations• are T0 and T1 in the same CTA or different CTAs?
![Page 40: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/40.jpg)
40
Running tests
• we extend the litmus CPU testing tool of Alglave and Maranget to run GPU tests
• given a GPU litmus test, generates an executable CUDA or OpenCL code for the test
![Page 41: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/41.jpg)
41
Heuristics
• memory stress: extra threads read and write to scratch memory
T0 T1 extra thread 1 extra thread n . . . . .
run T0 test program
run T1 test program
loop:read or write to scratchpad
loop:read or write to scratchpad
![Page 42: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/42.jpg)
42
Heuristics
• random threads: randomize the location of threads
T0
T1
![Page 43: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/43.jpg)
43
Heuristics
• random threads: randomize the location of threads
![Page 44: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/44.jpg)
44
Heuristics
• random threads: randomize the location of threads
![Page 45: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/45.jpg)
45
Heuristics
• random threads: randomize the location of threads
![Page 46: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/46.jpg)
46
Heuristics
test none random threads memory stress
memory stress +
random threads
gpu-mp 0
# of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075
![Page 47: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/47.jpg)
47
Heuristics
test none random threads memory stress
memory stress +
random threads
gpu-mp 0 0
# of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075
![Page 48: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/48.jpg)
48
Heuristics
test none random threads memory stress
memory stress +
random threads
gpu-mp 0 0 139
# of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075
![Page 49: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/49.jpg)
49
Heuristics
test none random threads memory stress
memory stress +
random threads
gpu-mp 0 0 139 522
# of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075
![Page 50: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/50.jpg)
50
How we found the pony bug
test none random threads memory stress
memory stress +
random threads
gpu-mp 0 0 139 522
This is the idiom and heuristics that caused bug!
![Page 51: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/51.jpg)
51
(background)(methodology)(contribution)
Roadmap
• what happened to the pony• how we found the bug • how we are able to fix the pony
![Page 52: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/52.jpg)
52
GPU fences
• PTX gives 2 fences to disallow reading stale data
• membar.cta – gives ordering intra-CTA
• membar.gl – gives ordering over device
![Page 53: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/53.jpg)
53
GPU fences
• Test amended with a parameterizable fence
Scope Tree (device (cta T0) (cta T1) )x: global, y: global
![Page 54: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/54.jpg)
54
GPU fences
test none membar.cta membar.gl
gpu-mp 3380
# of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075
![Page 55: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/55.jpg)
55
GPU fences
test none membar.cta membar.gl
gpu-mp 3380 2
# of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075
![Page 56: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/56.jpg)
56
GPU fences
test none membar.cta membar.gl
gpu-mp 3380 2 0
# of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075
![Page 57: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/57.jpg)
57
How do we fix the pony
Tesla C2075 Nvidia GPU
![Page 58: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/58.jpg)
58
How do we fix the pony
• adding fences to the code
Tesla C2075 Nvidia GPU(with fences)
![Page 59: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/59.jpg)
59
GPU testing campaign
• we extend the diy CPU litmus test generation tool of Alglave and Maranget to generate GPU tests
• generates litmus tests based on cycles
• enumerates the tests over the GPU thread and memory hierarchy
![Page 60: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/60.jpg)
60
GPU testing campaign
• Using our tools, we generated and ran 10930 tests over 5 Nvidia chips:
chip year architecture
GTX 750 ti 2014 Maxwell
GTX Titan 2013 Kepler
GTX 660 2012 Kepler
GTX 540m 2011 Fermi
Tesla C2075 2011 Fermi
![Page 61: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/61.jpg)
61
GPU testing campaign
• Results are hosted at:http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/flat.html
![Page 62: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/62.jpg)
62
Modeling
• we extended the CPU axiomaitic memory modeling toolherd of Alglave and Maranget, for GPUs
• we developed an axiomatic memory model for PTX which is able to simulate all of our tests
• our model is sound with respect to all of our hardware observations
![Page 63: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/63.jpg)
63
Modeling
• Demo of web interface
![Page 64: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/64.jpg)
64
More results
• surprising and buggy behaviours observed:
• GPU mutex implementations allow stale data to be read(found in CUDA by Example book and other academic papers1,2)
led to an erratum issued by Nvidia
• Hardware re-orders loads from the same address in Nvidia Fermi and Kepler
• Some testing on AMD GPUs
1J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" CoRR, 2011, http://arxiv.org/pdf/1110.4623.pdf.2B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB 2011.
![Page 65: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/65.jpg)
65
Related work (CPU memory models)• Alglave et. al. have done extensive work on testing and modeling
CPUs (notably IBM Power and ARM) and create the tools diy, litmus, and herd which we extended for this work
• Collier tested CPU memory models using the ARCHTEST tool
![Page 66: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/66.jpg)
66
Related work (GPU memory models)• Hower et. al. have proposed several SC for race-free language level
memory models for GPUs
![Page 67: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/67.jpg)
Questions?
Nvidia Tesla C2075 GPU(with fences)
Nvidia Tesla C2075 GPUIntel Core i7 4500 CPU
project page: http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/
![Page 68: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/68.jpg)
68
CUDA by Example
Intel Core i7 4500 CPU
![Page 69: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/69.jpg)
69
CUDA by Example
Nvidia Tesla C2075 GPU
![Page 70: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/70.jpg)
70
CUDA by Example
Nvidia Tesla C2075 GPU(with fences)
![Page 71: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/71.jpg)
71
Read-after-Read Hazard
![Page 72: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/72.jpg)
72
Ignore after this
![Page 73: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/73.jpg)
73
Results
• Surprising and buggy behaviours observed:
• SC-per-location violations on NVIDIA Fermi and Kepler architecture:
todo: add CORR test
![Page 74: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/74.jpg)
74
Limitations
• warps: we do not test intra-warp behaviours as the lock step behaviour of warps is not compatible with some of our heuristics
• grids: we do not test inter-grid behaviours as we did not find any examples in the literature
![Page 75: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/75.jpg)
75
GPU programming
• GPUs are SIMT (Single Instruction, Multiple Thread)
• Nvidia GPUs may be programmed using CUDA or OpenCL
![Page 76: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/76.jpg)
76
Roadmap
• background and motivation• approach• GPU tests• running tests• modeling
![Page 77: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/77.jpg)
77
Heuristics
• two additional heuristics:
• synchronization: testing threads synchronize immediately before running the test program
• general bank conflicts: generate memory access that conflict with the accesses in the memory stress heuristic
![Page 78: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/78.jpg)
78
Challenges
• PTX optimizing assembler may reorder or remove instructions
• We developed a tool optcheck which compares the litmus test with the binary and checks for optimizations
![Page 79: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/79.jpg)
79
Roadmap
• background and motivation• approach• GPU tests• running tests• modeling
![Page 80: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/80.jpg)
80
GPU tests
• concrete GPU test
T0 | T1 ;
st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ;
st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ;
ScopeTree
(grid(cta(warp T0) (warp T1)))
x: shared, y: global
exists (1:r1=1 /\ 1:r2=0)
![Page 81: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/81.jpg)
81
GPU tests
• concrete GPU test
T0 | T1 ;st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ;st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ;
ScopeTree(grid(cta(warp T0) (warp T1)))
x: shared, y: global
exists (1:r1=1 /\ 1:r2=0)
![Page 82: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/82.jpg)
82
GPU tests
• concrete GPU test
T0 | T1 ;st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ;st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ;
ScopeTree(grid(cta(warp T0) (warp T1)))
x: shared, y: global
exists (1:r1=1 /\ 1:r2=0)
![Page 83: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/83.jpg)
83
GPU programming
explicit hierarchical concurrency model
• thread hierarchy:• thread
• warp
• CTA (Cooperative Thread Array)
• grid
• memory hierarchy:• shared memory
• global memory
![Page 84: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/84.jpg)
84
GPU background
Images from Wikipedia [15,16,17]
• GPU is a highly parallel co-processor
• currently found in devicesfrom tablets to top supercomputers
• not just used for visualization anymore!
![Page 85: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/85.jpg)
85
References
[1] L. Lamport, "How to make a multiprocessor computer that correctly executes multi-process programs" Trans. Comput. 1979.
[2] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, "Litmus: Running tests against hardware" TACAS 2011.
[3] J. Alglave, L. Maranget, and M. Tautschnig, "Herding cats: modelling, simulation, testing, and data-mining for weak memory" TOPLAS 2014.
[4] NVIDIA, "CUDA C programming guide, version 6 (July 2014)" http://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf
[5] NVIDIA, "Parallel Thread Execution ISA: Version 4.0 (Feb. 2014)," http://docs.nvidia.com/cuda/parallel-thread-execution
[6] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, “Fences in weak memory models (extended version)” FMSD 2012
[7] J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to General-Purpose GPU Programming” Addison-Wesley Professional, 2010.
![Page 86: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/86.jpg)
86
References
[8] J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" CoRR, 2011, http://arxiv.org/pdf/1110.4623.pdf.
[9] B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB 2011.
[10] W. W. Collier, Reasoning About Parallel Architectures. Prentice-Hall, Inc., 1992.
[11] D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Sequential consistency for heterogeneous-race-free" MSPC 2013.
[12] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free memory models," ASPLOS 2014
[13] T. Sorensen, G. Gopalakrishnan, and V. Grover, "Towards shared memory consistency models for GPUs" ICS 2013
[14] W.-m. W. Hwu, “GPU Computing Gems Jade Edition” Morgan Kaufmann Publishers Inc., 2011.
![Page 87: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/87.jpg)
87
References
[15] http://en.wikipedia.org/wiki/Samsung_Galaxy_S5
[16] http://en.wikipedia.org/wiki/Titan_(supercomputer)
[17] http://en.wikipedia.org/wiki/Barnes_Hut_simulation
![Page 88: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/88.jpg)
88
Roadmap
• what happened to the pony (background)• how we found the bug (methodology)• how we are able to fix the pony (contribution)
![Page 89: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/89.jpg)
89
Message passing (mp) test
• Tests how to implement a handshake idiom• Found in Octree code for the pony visualization
![Page 90: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/90.jpg)
90
Message passing (mp) test
• Tests how to implement a handshake idiom
Data
Data
![Page 91: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/91.jpg)
91
Message passing (mp) test
• Tests how to implement a handshake idiom
Flag
Flag
![Page 92: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/92.jpg)
92
Methodology
• empirically explore the hardware memory model implemented on deployed NVIDIA and AMD GPUs
• develop hardware memory model testing tools for GPUs
• analyze classic (i.e. CPU) memory model properties and communication idioms in CUDA applications
• run large families of tests on GPUs as a basis for modeling and bug hunting
![Page 93: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/93.jpg)
93
Message passing (mp) test
• Tests how to implement a handshake idiom
Stale Data
![Page 94: Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1 GPU Concurrency: Weak Behaviours and Programming Assumptions.](https://reader035.fdocuments.net/reader035/viewer/2022062804/56649dc05503460f94ab4e8e/html5/thumbnails/94.jpg)
94
Running tests
• however, unlike CPUs, simply running the tests did not yield any weak memory behaviours for Nvidia chips!
• we developed heuristics to run tests under a variety of stress to expose weak behaviours