SSA-based Allocation for GPU Architectures
Transcript of SSA-based Allocation for GPU Architectures
![Page 1: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/1.jpg)
SSA-based Register Allocation for GPU ArchitecturesConnor Abbott, Daniel Schürmann (Valve)
![Page 2: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/2.jpg)
SSA Form
if (...) {v1 = ...
} else {v1 = ...
}
if (...) {v1_0 = ...
} else {v1_1 = ...
}v1_2 = φ(v1_0, v1_1)
// PHINode - The PHINode class is used to represent the magical mystical PHI// node, that can not exist in nature, but can be synthesized in a computer// scientist's overactive imagination.
![Page 3: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/3.jpg)
SSA Form: Deconstruction
if (...) {v1_0 = ...v1 = v1_0
} else {v1_1 = ...v1 = v1_1
}
if (...) {v1_0 = ...
} else {v1_1 = ...
}v1_2 = φ(v1_0, v1_1)
![Page 4: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/4.jpg)
Register Allocation
v0 = load ...v1 = load ...v2 = load ...v3 = add v0, v1v4 = add v3, v2
r0 = load ...r1 = load ...r2 = load ...r0 = add r0, r1r0 = add r0, r2
![Page 5: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/5.jpg)
Register Allocation: Optimality
● As few copies as possible?● Less and well-placed spill code?● Using as few registers as possible?● Avoid pipeline stalls (RAW, WAR, …)?
![Page 6: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/6.jpg)
Traditional Register Allocation
● first deconstruct SSA, then run Register Allocation● Existing approaches: graph-coloring, linear-scan
![Page 7: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/7.jpg)
Traditional Register Allocation
● Coalescing and Register Allocation are decoupled● Spilling and Register Allocation are done at once
![Page 8: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/8.jpg)
SSA-Based Register Allocation
● "Optimal Register Allocation for SSA-form Programs in polynomial Time" by Sebastian Hack and Gerhard Goos
○ Not actually optimal!
● First run register allocation, then deconstruct SSA● Phi nodes get registers assigned!
![Page 9: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/9.jpg)
Register Allocation and SSA
● Coalescing is implicit● Spilling can be decoupled
![Page 10: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/10.jpg)
What about GPUs?
![Page 11: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/11.jpg)
Dynamic register sharing
Register File
Wave 0
Wave 1
Wave 2ALU
Memory
![Page 12: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/12.jpg)
GPUs
● Might benefit from using less registers● Spilling is expensive on GPUs● -> SSA-based allocators are much better
![Page 13: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/13.jpg)
The Algorithm
![Page 14: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/14.jpg)
First Steps
● Our initial architecture:○ No branching (single basic block)○ N registers, all exactly the same
![Page 15: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/15.jpg)
Liveness and Kill Flagsv0 = load ...v1 = load ...v3 = add v0, v1(kill)v4 = add v0(kill), v2(kill)
![Page 16: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/16.jpg)
Baby's First Register Allocator
available = {r1, r2, ..., rN}for each instruction:
for each use of V:if use.kill:
available += V.regfor each definition V:
V.reg = pick_physreg(available)available -= V.reg
![Page 17: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/17.jpg)
Handling Control Flow
![Page 18: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/18.jpg)
Handling Control Flow
● Use classic dataflow algorithm to find liveness● Blocks have live-in and live-out sets● Still have kill flags as before
![Page 19: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/19.jpg)
Interlude: Dominance and Liveness
● A dominates B if every path from the start to B goes through A● SSA definitions always dominate their uses
![Page 20: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/20.jpg)
Interlude: Dominance and Liveness
v1 = ......if ...
... = v1
...if ...
v1 = ......if ...
... = v1 (kill) ... = v1...if ...
... = v1 (kill)
![Page 21: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/21.jpg)
Handling Control Flow
for each block, ordered by dominance:available = {r1, ..., rN}foreach live-in value V:
available -= V.reg// main part same as beforefor each instruction in block:
...
![Page 22: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/22.jpg)
Phi Nodes?
● We may assign phi sources and destination to different registers● Phi nodes happen in parallel
![Page 23: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/23.jpg)
Phi Nodes Example
a = ...d = ...if (...) {
e = ac = d
} else {e = ...c = a
}... = e... = c
a = ...d = ...if (...) {
if:} else {
else:e_0 = ...
}e_1 = φ(if: a, else: e_0)c = φ(if: d, else: a)
![Page 24: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/24.jpg)
Phi Nodes Example, Continued
a:r0 = ...d:r1 = ...if (...) {
if:} else {
else:e_0:r1 = ...
}e_1:r0 = φ(if: a:r0, else: e_0:r1)c:r1 = φ(if: d:r1, else: a:r0)
![Page 25: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/25.jpg)
Swap Instructions
● Many targets already have a suitable swap instruction● Xor trick:
x = x ^ yy = y ^ xx = x ^ y
![Page 26: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/26.jpg)
Resolving Phi Nodes
● May have to split critical edges● Create a transfer graph, resolve piece-by-piece● Similar to SSA deconstruction
○ See "Revisiting Out-of-SSA Translation for Correctness, Code Quality, and Efficiency" by Boissinot et. al.
● Need to consider affinities in pick_physreg()
![Page 27: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/27.jpg)
Live Range Splitting
![Page 28: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/28.jpg)
Vector Registers
● Load/store series of registers● For example:
// equivalent to:// r0 = load ...// r1 = load ... + 1r[0:1] = load.v2 ......store.v2 r[1:2] // store r1 and then r2
![Page 29: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/29.jpg)
Vector Registers and SSA
● Need to add split/collect instructions
● Similar in spirit to phi nodes
v2 = collect v0, v1store.v2 v2, ...v3 = load.v2 ...v4, v5 = split v3... = v4... = v5
![Page 30: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/30.jpg)
The Problem
v0 = load.v3 ...v1, v2, v3 = split v0... = v1 (kill)... = v3 (kill)v4 = load.v2 ...
![Page 31: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/31.jpg)
The Problemr0 r1 r2v0 v0 v0v1 v2 v3
v2 v3v2
v0:r[0-2] = load.v3 ...v1:r0, v2:r1, v3:r2 = split v0:r[0-2]... = v1:r0 (kill)... = v3:r2 (kill)v4:??? = load.v2 ...
![Page 32: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/32.jpg)
The Solution: Live Range Splitting
v0:r[0-2] = load.v3 ...v1:r0, v2:r1, v3:r2 = split v0... = v1:r0 (kill)... = v3:r2 (kill)r2 = r1v4:r[0-1] = load.v2 ...
![Page 33: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/33.jpg)
Live-Range Splitting: Worst-case Scenario
foo = load.v2 ...r0 r1 r2 r3 r4 ... rN-2 rN-1 rN
v0 v0 v1 v1 ... vM vM
![Page 34: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/34.jpg)
Live Range Splitting and Control Flow
v1:r1 = ...if (...) {
...r2 = r1 // live-range split
} else {...
}... = v1 (kill)
![Page 35: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/35.jpg)
Live Range Splitting and Control Flow
● Need to repair SSA:○ Either create new Phis○ Or copy the value back
![Page 36: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/36.jpg)
Conclusion
● Simple idea, complex implementation● Code Quality depends entirely on Coalescing● -> Workshop tomorrow at 16:35
Questions?
![Page 37: SSA-based Allocation for GPU Architectures](https://reader030.fdocuments.net/reader030/viewer/2022012716/61ae45c81556eb06eb4f800b/html5/thumbnails/37.jpg)
Live Range Splitting
● Three types of values:○ killed uses○ live-through○ definitions
● Insert parallel copies● First solution: sort/compact values
○ live-through, then killed uses