CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread...
Transcript of CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread...
![Page 1: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/1.jpg)
CS 758: Advanced Topics in Computer Architecture
Lecture #8: GPU Warp Scheduling Research + DRAM Basics
Professor Matthew D. Sinclair
Some of these slides were developed by Tim Rogers at the Purdue University, Tor Aamodt at the University of British Columbia, Wen-mei Hwu & David Kirk at the University of Illinois at Urbana-Champaign, Sudhakar Yalamanchili Georgia Tech, and Prof. Onur Mutlu at Carnegie Mellon University.
Slides enhanced by Matt Sinclair
![Page 2: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/2.jpg)
Studying Warp Scheduling on GPUs
• Numerous works on manipulating different schedulers
• Most looking at the SM-side issue-level warp scheduler
• Some look at TB scheduling at the TB-core level
• Fetch scheduler and operand collector schedule less studied• Fetch largely follows issue.
• Not clear what the opportunity in the operand collector is.• Even an opportunity study here would be helpful.
![Page 3: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/3.jpg)
Use Memory System Feedback[MICRO 2012]
3
0
0.5
1
1.5
2
0
10
20
30
40
Threads Actively Scheduled
Performance
Cache Misses
GPU Core
Processor CacheThread
Scheduler
Feedback
![Page 4: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/4.jpg)
Programmability case study [MICRO 2013]Sparse Vector-Matrix Multiply
Simple Version
GPU-Optimized VersionSHOC Benchmark Suite
(Oakridge National Labs)
4
Added Complication
Dependent on Warp Size
Parallel Reduction
Explicit Scratchpad UseDivergence
Each thread has locality
Using DAWS scheduling
within 4% of optimized with no programmer input
![Page 5: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/5.jpg)
Data CacheData Cache
Sources of Locality
Intra-wavefront locality Inter-wavefront locality
LD $line (X)
LD $line (X)
LD $line (X)
LD $line (X)
Wave0
Hit
Wave0 Wave1
Hit
5
![Page 6: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/6.jpg)
0
20
40
60
80
100
120
AVG-Highly Cache Sensitive
(Hit
s/M
iss)
PK
I
Misses PKI
Inter-Wavefront Hits PKI
Intra-Wavefront Hits PKI
6
![Page 7: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/7.jpg)
Scheduler affects access pattern
Memory
System
Wavefront
Scheduler
Wavefront
Scheduler
Round Robin Scheduler
Memory
System
Greedy then Oldest Scheduler
ld A,B,C,D…
DCBA
ld Z,Y,X,Wld A,B,C,D
WXYZ
... ...
ld Z,Y,X,W
DCBA
DCBA
ld A,B,C,D…
...
Wave0 Wave1 Wave0 Wave1
ld A,B,C,D…
7
![Page 8: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/8.jpg)
Use scheduler to shape access pattern
Memory
System
Wavefront
Scheduler
Wavefront
Scheduler
Greedy then Oldest Scheduler
Memory
System
Cache-Conscious Wavefront Scheduling[MICRO 2012 best paper runner up,
Top Picks 2013, CACM Research Highlight]
ld A,B,C,D
DCBA
ld A,B,C,D
WXYZ
...
ld Z,Y,X,W
DCBA
DCBA
ld A,B,C,D…
...
ld Z,Y,X,W…
Wave0 Wave1 Wave0 Wave1
ld A,B,C,D…
8
![Page 9: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/9.jpg)
Memory Unit
Cache
Victim Tags
Locality Scoring
SystemWave
Scheduler
W0
W1
W2
Tag WID Data
Tag
Tag
Tag
Tag
Tag
Tag
W0
W1
W2
Time
Score
Tag WID Data
…
W0
W1
W2
No W2
loads
W0
W1
W2
…W0: ld X
X 0
W0,X X
W0
detected lost locality
W2: ld YW0: ld X
ProbeW0,X
Y 2
9
![Page 10: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/10.jpg)
0
0.5
1
1.5
2
HMEAN-Highly Cache-Sensitive
Spe
ed
up
LRR GTO CCWS
10
![Page 11: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/11.jpg)
Static Wavefront Limiting[Rogers et al., MICRO 2012]
• Profiling an application we can find an optimal number of wavefrontsto execute
• Does a little better than CCWS.
• Limitations: Requires profiling, input dependent, does not exploit phase behavior.
11
![Page 12: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/12.jpg)
Improve upon CCWS?
• CCWS detects bad scheduling decisions and avoids them in future.
• Would be better if we could “think ahead” / “be proactive” instead of “being reactive”
12
![Page 13: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/13.jpg)
(13)
Divergence Aware Warp Scheduling T. Rogers, M
O’Conner, and T. Aamodt MICRO 2013 Goal
• Design a scheduler to match #scheduled wavefronts with the L1 cache
size
❖ Working set of the wavefronts fits in the cache
❖ Emphasis on intra-wavefront locality
• Differs from CCWS in being proactive
❖ Deeper look at what happens inside loops
❖ Proactive
❖ Explicitly Handles Divergence (both memory and control flow)
![Page 14: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/14.jpg)
(14)
Key Idea
• Manage the relationship between control divergence, memory divergence and scheduling
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 15: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/15.jpg)
(15)
Key Idea (2)
while (i <C[tid+1])
warp 0
warp 1
Fill the cache with 4 references –delay warp 1Divergent
branch
Intra-thread locality
Available room in the cache,
schedule warp 1
Use warp 0 behavior to predict interference due to warp 1
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 16: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/16.jpg)
(16)
Goal
Simpler portable version GPU-Optimized Version
Make the performance equivalent
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 17: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/17.jpg)
(17)
Observation
• Bulk of the accesses in a loop come from a few static
load instructions
• Bulk of the locality in (these) applications is intra-loop
Loops
![Page 18: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/18.jpg)
(18)
Distribution of Locality
Bulk of the locality comes form a few static loads in loops
Hint: Can we keep data from last iteration?
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 19: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/19.jpg)
(19)
A Solution
• Prediction mechanisms for
locality across iterations of
a loop
• Schedule such that data
fetched in one iteration is
still present at next
iteration
• Combine with control flow
divergence (how much of
the footprint needs to be in
the cache?)
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 20: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/20.jpg)
(20)
Classification of Dynamic Loads
• Group static loads into equivalence classes →
reference the same cache line
• Identify these groups by repetition ID
• Prediction for each load by compiler or hardware
converged
Diverged
Somewhat diverged
Control flow divergence
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 21: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/21.jpg)
(21)
Predicting a Warp’s Cache Footprint• Entering loop body• Create footprint prediction
Warp
• Exit loop• Reinitialize prediction
• Some threads exit the loop• Predicted footprint drops
Warp
• Predict locality usage of static loads
❖ Not all loads increase the footprint
• Combine with control divergence to predict footprint
• Use footprint to throttle/not-throttle warp issue
Taken
Not Taken
![Page 22: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/22.jpg)
(22)
Principles of Operation
• Prefix sum of each warp’s cache footprint used to
select warps that can be issued
EffCacheSize= kAssocFactor.TotalNumLines
• Scaling back from a fully associative cache• Empirically determined
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 23: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/23.jpg)
(23)
Principles of Operation (2)
• Profile static load
instructions
❖ Are they divergent?
❖ Loop repetition ID
o Assume all loads with
same base address and
offset within cache line
access are repeated each
iteration
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 24: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/24.jpg)
(24)
Prediction Mechanisms
• Profiled Divergence Aware Scheduling (DAWS)
❖ Used offline profile results to dynamically determine de-scheduling decisions
• Detected Divergence Aware Scheduling (DAWS)
❖ Behaviors derived at run-time to drive de-scheduling decisions
o Loops that exhibit intra-warp locality
o Static loads are characterized as divergent or convergent
![Page 25: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/25.jpg)
(25)
Extensions for DAWS
I-Fetch
Decode
OC
PRF
D-Cache
DataAll Hit?
Writeback
scala
rPip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
Issue
I-Buffer
+
+
![Page 26: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/26.jpg)
(26)
Operation: Tracking
Basis for throttling
Profile-based information
One entry per warp issue
slot
Created/removed at
loop begin/end
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
#Active Lanes
![Page 27: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/27.jpg)
(27)
Operation: Prediction
Sum results from static loads in this loop
- Add #active-lanes of cache lines for divergent loads
- Add 2 for converged loads- Count loads in the same
equivalence class only once (unless divergent)
• Generally only considering de-scheduling warps in loops
❖ Since most of the activity is here
• Can be extended to non-loop regions by associating non-loop
code with next loop
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 28: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/28.jpg)
(28)
Operation: Nested Loops
for (i=0; i<limitx; i++)}{....
for (j=0: j<limity; j++){....}
..
..}
• On-entry update prediction to that of inner loop
• On re-entry predict based inner loop predictions
On-exit, do not clear prediction
De-scheduling of warps determined
by inner loop behaviors!
• Re-used predictions based on inner-most loops which
is where most of the date re-use is found
![Page 29: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/29.jpg)
(29)
Detected DAWS: Prediction
Sampling warp for the loop (>2
active threads)
• Detect both memory divergence and intra-loop
repetition at run time
• Fill PCLoad entries based on run time information
if locality
enter load
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 30: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/30.jpg)
(30)
Detected DAWS: Classification
Increment or decrement the counter depending on #memory accesses for a load
Create equivalence classes of loads (checking the PCs)
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 31: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/31.jpg)
(31)
Performance
Little to no degradation
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 32: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/32.jpg)
(32)
Performance
Significant intra-warp locality
SPMV-scalar normalized to best
SPMV-vector
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
![Page 33: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/33.jpg)
(33)
Summary
• If we can characterize warp level memory reference locality, we can use
this information to minimize interference in the cache through scheduling
constraints
• Proactive scheme outperforms reactive management
• Understand interactions between memory divergence and control
divergence
![Page 34: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/34.jpg)
(34)
OWL: Cooperative Thread Array Aware Scheduling
Techniques for Improving GPGPU Performance
A. Jog et. al ASPLOS 2013 Goal
• Understand memory effects of scheduling from deeper within the memory
hierarchy
• Minimize idle cycles induced by stalling warps waiting on memory
references
![Page 35: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/35.jpg)
Off-chip Bandwidth is Critical!
35
Percentage of total execution cycles wasted waiting for the
data to come back from DRAM
0%
20%
40%
60%
80%
100%
SA
D
PV
C
SS
C
BF
S
MU
M
CF
D
KM
N
SC
P
FW
T
IIX
SP
MV
JP
EG
BF
SR
SC
FF
T
SD
2
WP
PV
R
BP
CO
N
AE
S
SD
1
BL
K
HS
SL
A
DN
LP
S
NN
PF
N
LY
TE
LU
D
MM
ST
O
CP
NQ
U
CU
TP
HW
TP
AF
AV
G
AV
G-T
1
Type-1
Applications
55%AVG: 32%
Type-2
Applications
GPGPU Applications
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 36: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/36.jpg)
(36)
Source of Idle Cycles
• Warps stalled on waiting for memory reference
❖ Cache miss
❖ Service at the memory controller
❖ Row buffer miss in DRAM
❖ Latency in the network (not addressed in this paper)
• The last warp effect
• The last CTA effect
• Lack of multiprogrammed execution
❖ One (small) kernel at a time
![Page 37: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/37.jpg)
(37)
Impact of Idle Cycles
Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 38: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/38.jpg)
High-Level View of a GPU
DRAM
SIMT Cores
Scheduler
ALUsL1 Caches
Threads
WW W W W W
Warps
L2 cache
Interconnect
CTA CTA CTA CTA
Cooperative
Thread
Arrays
(CTAs)
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 39: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/39.jpg)
Warp Scheduler
ALUsL1 Caches
CTA-Assignment Policy (Example)
39
Warp Scheduler
ALUsL1 Caches
Multi-threaded CUDA Kernel
SIMT Core-1 SIMT Core-2
CTA-1 CTA-2 CTA-3 CTA-4
CTA-2 CTA-4CTA-1 CTA-3
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 40: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/40.jpg)
(40)
Organizing CTAs Into Groups
• Set minimum number of warps equal to #pipeline stages
❖ Same philosophy as the two-level warp scheduler
• Use same CTA grouping/numbering across SMs?
Warp Scheduler
ALUsL1 Caches
Warp Scheduler
ALUsL1 Caches
SIMT Core-1 SIMT Core-2
CTA-2 CTA-4CTA-1 CTA-3
Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 41: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/41.jpg)
Warp Scheduling Policy◼ All launched warps on a SIMT core have equal priority
❑ Round-Robin execution
◼ Problem: Many warps stall at long latency operations
roughly at the same time
41
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
Send Memory Requests
SIMT
Core
Stalls
Time
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 42: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/42.jpg)
Solution
42
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
Send Memory Requests
Saved Cycles
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
Send Memory Requests
SIMT Core
Stalls
Time
• Form Warp-Groups
(Narasiman MICRO’11)
• CTA-Aware grouping
• Group Switch is Round-Robin
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 43: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/43.jpg)
(43)
Two Level Round Robin Scheduler
CTA0 CTA1
CTA3 CTA2
CTA4 CTA5
CTA7 CTA8
CTA12 CTA13
CTA15 CTA14
Group 0 Group 1
Group 3Group 2
RR
RR
Thread
Agnostic to whenpending misses
are satisfied
![Page 44: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/44.jpg)
Objective 1: Improve Cache Hit Rates
44
CTA 1 CTA 3 CTA 5 CTA 7CTA 1 CTA 3 CTA 5 CTA 7
Data for CTA1 arrives.
No switching.
CTA 3 CTA 5 CTA 7CTA 1 CTA 3 C5 CTA 7CTA 1 C5
Data for CTA1 arrives.
TNo Switching: 4 CTAs in Time T
Switching: 3 CTAs in Time T
Fewer CTAs accessing the cache concurrently → Less cache contention
Time
Switch to CTA1.
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 45: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/45.jpg)
Reduction in L1 Miss Rates
45
◼ Limited benefits for cache insensitive applications
◼ What is happening deeper in the memory system?
0.00
0.20
0.40
0.60
0.80
1.00
1.20
SAD SSC BFS KMN IIX SPMV BFSR AVG.No
rmal
ized
L1
Mis
s R
ates
8%
18%
Round-Robin CTA-Grouping CTA-Grouping-Prioritization
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 46: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/46.jpg)
(46)
The Off-Chip Memory Path
Off-chip GDDR5
Off-chip GDDR5
CU 0
To
CU 15
CU 16
To
CU 31
MC
MC
MC
MC
MC MC
Off-chip GDDR5
Off-chip GDDR5
Off-chip GDDR5
Off-chip GDDR5
Access patterns?
Ordering and buffering?
![Page 47: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/47.jpg)
(47)
Inter-CTA Locality
Warp Scheduler
ALUsL1 Caches
Warp Scheduler
ALUsL1 Caches
CTA-2 CTA-4CTA-1 CTA-3
DRAM DRAM DRAM DRAM
How do CTAs Interact at the MC and in DRAM?
![Page 48: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/48.jpg)
(48)
Impact of the Memory Controller
• Memory scheduling
policies
❖ Optimize BW vs. memory
latency
• Impact of row buffer
access locality
• Cache lines?
![Page 49: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/49.jpg)
(49)
Row Buffer Locality
![Page 50: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/50.jpg)
The DRAM Subsystem
![Page 51: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/51.jpg)
(51)
DRAM Subsystem Organization
• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column
51
![Page 52: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/52.jpg)
(52)
Page Mode DRAM
• A DRAM bank is a 2D array of cells: rows x columns
• A “DRAM row” is also called a “DRAM page”
• “Sense amplifiers” also called “row buffer”
• Each address is a <row,column> pair
• Access to a “closed row”❖ Activate command opens row (placed into row buffer)
❖ Read/write command reads/writes column in the row buffer
❖ Precharge command closes the row and prepares the bank for next access
• Access to an “open row”❖ No need for activate command
52
![Page 53: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/53.jpg)
(53)
DRAM Bank Operation
53
Row Buffer
(Row 0, Column 0)
Row
decoder
Column mux
Row address 0
Column address 0
Data
Row 0Empty
(Row 0, Column 1)
Column address 1
(Row 0, Column 85)
Column address 85
(Row 1, Column 0)
HITHIT
Row address 1
Row 1
Column address 0
CONFLICT !
Columns
Row
s
Access Address:
![Page 54: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/54.jpg)
(54)
The DRAM Chip
• Consists of multiple banks (2-16 in Synchronous DRAM)
• Banks share command/address/data buses
• The chip itself has a narrow interface (4-16 bits per read)
54
![Page 55: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/55.jpg)
(55)
128M x 8-bit DRAM Chip
55
![Page 56: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/56.jpg)
(56)
DRAM Rank and Module
• Rank: Multiple chips operated together to form a wide interface
• All chips comprising a rank are controlled at the same time
❖ Respond to a single command
❖ Share address and command buses, but provide different data
❖ Like DRAM “SIMD”
• A DRAM module consists of one or more ranks
❖ E.g., DIMM (dual inline memory module)
❖ This is what you plug into your motherboard
• If we have chips with 8-bit interface, to read 8 bytes in a single access, use 8 chips in a DIMM
56
![Page 57: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/57.jpg)
(57)
A 64-bit Wide DIMM (One Rank)
57
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
Command Data
![Page 58: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/58.jpg)
(58)
Multiple DIMMs
58
• Advantages:
❖ Enables even higher capacity
• Disadvantages:
❖ Interconnect complexity and energy consumption can be high
![Page 59: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/59.jpg)
(59)
DRAM Channels
• 2 Independent Channels: 2 Memory Controllers (Above)
• 2 Dependent/Lockstep Channels: 1 Memory Controller with wide interface (Not Shown above)59
![Page 60: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/60.jpg)
(60)
Generalized Memory Structure
60
![Page 61: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/61.jpg)
(61)
Generalized Memory Structure
61
![Page 62: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/62.jpg)
The DRAM Subsystem
The Top Down View
![Page 63: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/63.jpg)
(63)
DRAM Subsystem Organization
• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column
63
![Page 64: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/64.jpg)
(64)
The DRAM subsystem
Memory channel Memory channel
DIMM (Dual in-line memory module)
Processor
“Channel”
![Page 65: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/65.jpg)
(65)
Breaking down a DIMM
DIMM (Dual in-line memory module)
Side view
Front of DIMM Back of DIMM
![Page 66: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/66.jpg)
(66)
Breaking down a DIMM
DIMM (Dual in-line memory module)
Side view
Front of DIMM Back of DIMM
Rank 0: collection of 8 chips Rank 1
![Page 67: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/67.jpg)
(67)
Rank
Rank 0 (Front) Rank 1 (Back)
Data <0:63>CS <0:1>Addr/Cmd
<0:63><0:63>
Memory channel
![Page 68: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/68.jpg)
(68)
Breaking down a Rank
Rank 0
<0:63>
Ch
ip 0
Ch
ip 1
Ch
ip 7. . .
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
![Page 69: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/69.jpg)
(69)
Breaking down a Chip
Ch
ip 0
<0:7
>
Bank 0
<0:7>
<0:7>
<0:7>
...
<0:7
>
![Page 70: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/70.jpg)
(70)
Breaking down a Bank
Bank 0
<0:7
>
row 0
row 16k-1
...
2kB
1B
1B (column)
1B
Row-buffer
1B
...
<0:7
>
![Page 71: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/71.jpg)
(71)
DRAM Subsystem Organization
• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column
71
![Page 72: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/72.jpg)
(72)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Channel 0
DIMM 0
Rank 0
![Page 73: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/73.jpg)
(73)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
. . .
![Page 74: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/74.jpg)
(74)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
Row 0Col 0
. . .
![Page 75: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/75.jpg)
(75)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
Row 0Col 0
. . .
8B
![Page 76: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/76.jpg)
(76)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
Row 0Col 1
. . .
![Page 77: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/77.jpg)
(77)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
8B
Row 0Col 1
. . .
8B
![Page 78: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/78.jpg)
(78)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
8B
Row 0Col 1
A 64B cache block takes 8 I/O cycles to transfer.
During the process, 8 columns are read sequentially.
. . .
![Page 79: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/79.jpg)
(79)
Latency Components: Basic DRAM Operation
• CPU → controller transfer time
• Controller latency
❖ Queuing & scheduling delay at the controller
❖ Access converted to basic commands
• Controller → DRAM transfer time
• DRAM bank latency
❖ Simple CAS if row is “open” OR
❖ RAS + CAS if array precharged OR
❖ PRE + RAS + CAS (worst case)
• DRAM → CPU transfer time (through controller)
79
![Page 80: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/80.jpg)
(80)
Multiple Banks (Interleaving) and Channels
• Multiple banks
❖ Enable concurrent DRAM accesses
❖ Bits in address determine which bank an address resides in
• Multiple independent channels serve the same purpose
❖ But they are even better because they have separate data buses
❖ Increased bus bandwidth
• Enabling more concurrency requires reducing
❖ Bank conflicts
❖ Channel conflicts
• How to select/randomize bank/channel indices in address?
❖ Lower order bits have more entropy
❖ Randomizing hash functions (XOR of different address bits)
80
![Page 81: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/81.jpg)
(81)
How Multiple Banks/Channels Help
81
![Page 82: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/82.jpg)
(82)
Multiple Channels
• Advantages
❖ Increased bandwidth
❖ Multiple concurrent accesses (if independent channels)
• Disadvantages
❖ Higher cost than a single channel
o More board wires
o More pins (if on-chip memory controller)
82
![Page 83: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/83.jpg)
(83)
Address Mapping (Single Channel)
• Single-channel system with 8-byte memory bus
❖ 2GB memory, 8 banks, 16K rows & 2K columns per bank
• Row interleaving
❖ Consecutive rows of memory in consecutive banks
• Cache block interleavingo Consecutive cache block addresses in consecutive banks
o 64 byte cache blocks
o Accesses to consecutive cache blocks can be serviced in parallel
o How about random accesses? Strided accesses?83
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
![Page 84: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/84.jpg)
(84)
Bank Mapping Randomization
• DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely
84
Column (11 bits)3 bits Byte in bus (3 bits)
XOR
Bank index
(3 bits)
![Page 85: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/85.jpg)
(85)
Address Mapping (Multiple Channels)
• Where are consecutive cache blocks?
85
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
![Page 86: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/86.jpg)
(86)
Interaction with Virtual→Physical Mapping
• Operating System influences where an address maps to in DRAM
• Operating system can control which bank/channel/rank a virtual page is mapped to.
• It can perform page coloring to minimize bank conflicts
• Or to minimize inter-application interference
86
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)
Page offset (12 bits)Physical Frame number (19 bits)
Page offset (12 bits)Virtual Page number (52 bits) VA
PA
PA
![Page 87: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/87.jpg)
(87)
DRAM Refresh (I)
• DRAM capacitor charge leaks over time
• The memory controller needs to read each row periodically to restore the charge
❖ Activate + precharge each row every N ms
❖ Typical N = 64 ms
• Implications on performance?
-- DRAM bank unavailable while refreshed
-- Long pause times: If we refresh all rows in burst, every 64ms the DRAM will be unavailable until refresh ends
• Burst refresh: All rows refreshed immediately after one another
• Distributed refresh: Each row refreshed at a different time, at regular intervals
87
![Page 88: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/88.jpg)
(88)
DRAM Refresh (II)
• Distributed refresh eliminates long pause times
88
![Page 89: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/89.jpg)
(89)
• Downsides of refresh
-- Energy consumption: Each refresh consumes energy
-- Performance degradation: DRAM rank/bank unavailable while refreshed
-- QoS/predictability impact: (Long) pause times during refresh
Downsides of DRAM Refresh
89
![Page 90: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/90.jpg)
(90)
Back to the paper…
![Page 91: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/91.jpg)
CTA Data Layout (A Simple Example)
91
A(0,0) A(0,1) A(0,2) A(0,3)
:
:
DRAM Data Layout (Row Major)
Bank 1 Bank 2 Bank
3
Bank 4
A(1,0) A(1,1) A(1,2) A(1,3)
:
:
A(2,0) A(2,1) A(2,2) A(2,3)
:
:
A(3,0) A(3,1) A(3,2) A(3,3)
:
:
Data Matrix
A(0,0) A(0,1) A(0,2) A(0,3)
A(1,0) A(1,1) A(1,2) A(1,3)
A(2,0) A(2,1) A(2,2) A(2,3)
A(3,0) A(3,1) A(3,2) A(3,3)
mapped to Bank 1
CTA 1 CTA 2
CTA 3 CTA 4
mapped to Bank 2
mapped to Bank 3
mapped to Bank 4
Average percentage of consecutive CTAs (out of
total CTAs) accessing the same row = 64%
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 92: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/92.jpg)
L2 Cache
Implications of high CTA-row sharing
92
W
CTA-1
W
CTA-3
W
CTA-2
W
CTA-4
SIMT Core-1 SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
Idle
Banks
W W W W
CTA Prioritization Order CTA Prioritization Order
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 93: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/93.jpg)
High Row Locality
Low Bank Level Parallelism
Bank-1
Row-1
Bank-2
Row-2
Bank-1
Row-1
Bank-2
Row-2
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Lower Row Locality
Higher Bank Level
Parallelism
Req
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 94: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/94.jpg)
(94)
Some Additional Details
• Spread reference from multiple CTAs (on multiple SMs) across row
buffers in the distinct banks
• Do not use same CTA group prioritization across SMs
❖ Play the odds
• What happens with applications with unstructured, irregular memory
access patterns?
![Page 95: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/95.jpg)
L2 Cache
Objective 2: Improving Bank Level Parallelism
W
CTA-1
W
CTA-3
W
CTA-2
W
CTA-4
SIMT Core-1 SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
W W W W
11% increase in bank-level parallelism
14% decrease in row buffer locality
CTA Prioritization OrderCTA Prioritization Order
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 96: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/96.jpg)
L2 Cache
Objective 3: Recovering Row Locality
W
CTA-1
W
CTA-3
W
CTA-2
W
CTA-4
SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
W W W W
Memory Side
Prefetching
L2 Hits!
SIMT Core-1
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 97: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/97.jpg)
Memory Side Prefetching◼ Prefetch the so-far-unfetched cache lines in an already open row into the L2 cache, just before it
is closed
◼ What to prefetch?
❑ Sequentially prefetches the cache lines that were not accessed by demand requests
❑ Sophisticated schemes are left as future work
◼ When to prefetch?
❑ Opportunistic in Nature
❑ Option 1: Prefetching stops as soon as demand request comes for another row. (Demands are
always critical)
❑ Option 2: Give more time for prefetching, make demands wait if there are not many. (Demands
are NOT always critical)
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 98: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/98.jpg)
IPC results (Normalized to Round-Robin)
0.6
1.0
1.4
1.8
2.2
2.6
3.0
SA
D
PV
C
SS
C
BF
S
MU
M
CF
D
KM
N
SC
P
FW
T
IIX
SP
MV
JP
EG
BF
SR
SC
FF
T
SD
2
WP
PV
R
BP
AV
G -
T1
Norm
aliz
ed IP
C
Objective 1 Objective (1+2) Objective (1+2+3) Perfect-L2
25% 31% 33%
◼ 11% within Perfect L2
44%
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
![Page 99: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/99.jpg)
(99)
Summary
• Coordinated scheduling across SMs, CTAs, and warps
• Consideration of effects deeper in the memory system
• Coordinating warp residence in the core with the presence of
corresponding lines in the cache
![Page 100: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/100.jpg)
(100)
CAWA: Coordinated Warp Scheduling and Cache
Prioritization for Critical Warp Acceleration in GPGPU
Workloads S. –Y Lee, A. A. Kumar and C. J Wu
ISCA 2015
Goal• Reduce warp divergence and hence increase throughput
• The key is the identification of critical (lagging) warps
• Manage resources and scheduling decisions to speed up the execution of
critical warps thereby reducing divergence
![Page 101: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/101.jpg)
(101)
Review: Resource Limits on Occupancy
SM Scheduler
Kernel Distributor
SM SM SM SM
DRAM
Limits the #threads
Limits the #thread blocks
Warp Schedulers
Warp ContextWarp ContextWarp ContextWarp Context
Register File
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
TB 0
Thread Block Control
L1/Shared MemoryLimits the #thread
blocks
Limits the #threads
SM – Stream Multiprocessor
SP – Stream Processor
Locality effects
![Page 102: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/102.jpg)
(102)
Evolution of Warps in TB
• Coupled lifetimes of warps in a TB
❖ Start at the same time
❖ Synchronization barriers
❖ Kernel exit (implicit synchronization barrier)
Completed warps
Figure from P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation
Region where latency hiding is less effective
![Page 103: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/103.jpg)
(103)
Warp Criticality Problem
Host CPU
Inte
rcon
nec
tio
n B
us
GPU
SMX SMX SMX SMX
Kernel Distributor
SMX Scheduler Core Core Core Core
Registers
L1 Cache / Shard Memory
Warp Schedulers
Warp Context
Ker
ne
l Man
age
me
nt
Un
it
HW
Wor
k Q
ueu
esPe
nd
ing
Ker
nel
s
Memory Controller
PC Dim Param ExeBLKernel Distributor Entry
Control Registers
DRAML2 Cache
TB
Warp
Warp
Available registers: spatial underutilization
Registers allocated to completed warps in the
TB: temporal underutilization
The last warp
Warp Context
Completed (idle) warp contexts
Manage resources and schedules around Critical Warps
Temporal & spatial underutilization
![Page 104: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/104.jpg)
(104)
The Warp Criticality Problem
• Significant warp execution disparity for warps in the same thread block
1.5
1.4
1.3
1.2
1.1
1.0
0.9
0.8
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w1
0
w1
1
w1
2
w1
3
w1
4
w1
5
Exe
cu6
on
Tim
e (
Co
mp
are
dto
the
Fa
ste
st W
arp
)
Warps (Sorted by Cri6cality)
*Lee and Wu, “CAWS: Cri)cality-‐Aware Warp Scheduling for GPGPU Workloads,” PACT-‐2014
Breadth-‐first-‐search
0%
25%
50%
75%
bfs
b+
tree
he
artw
all
kmea
ns
nee
dle
srad
_1
strc
ltr_
smal
l
Avg
Exe
cu6
on
Tim
e D
isp
arit
y
Computation Idle Time
100%
40%
![Page 105: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/105.jpg)
(105)
Research Questions
• What is the source of warp criticality?
• How can we effectively accelerate critical warp execution?
5/28
![Page 106: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/106.jpg)
(106)
Source of Warp Criticality
• Workload Imbalance
• Diverging Branch Behavior
• Memory Contention and Memory Access Latency
• Execution Order of Warp Scheduling
7/28
![Page 107: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/107.jpg)
(107)
Workload Imbalance & Diverging Branch
8/28
0.9
1.0
1.1
1.21.51.41.31.21.11.00.90.8
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w1
0
w1
1
w1
2
w1
3
w1
4
w1
5
Inst
Co
un
t N
orm
aliz
edto
th
e F
aste
st W
arp
Exe
cu6
on
Tim
e N
orm
aliz
ed
to
th
e F
aste
st W
arp
Warps (Sorted by Cri6cality)
• Workload imbalance or diverging branch behavior makes warps have different number of dynamic instruction counts.
exe time inst count
![Page 108: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/108.jpg)
(108)
Memory Contention
9/28
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w1
0
w1
1
w1
2
w1
3
w1
4
w1
5Exe
cu6
on
Tim
e N
orm
aliz
ed
to
th
e F
aste
st W
arp
Warps (Sorted by Cri6cality)
• While warps experience different latency to access memory, memory contention can induce warp criticality.
Total Memory System Latency Other Latency2.5
2
1.5
1
0.5
0
![Page 109: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/109.jpg)
(109)
Warp Scheduling Order
10/28
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w1
0
w1
1
w1
2
w1
3
w1
4
w1
5Exe
cu6
on
Tim
e N
orm
aliz
ed
to
th
e F
aste
st W
arp
Warps (Sorted by Cri6cality)
• The warp scheduler may introduce additional stall cycles for a ready warp, resulting in warp criticality
Scheduler Latency Other Latency2.5
2
1.5
1
0.5
0
![Page 110: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/110.jpg)
(110)
CAWA: Criticality-Aware Warp Acceleration
Coordinated warp scheduling and cache prioritization design
• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime
• greedy Criticality Aware Warp Scheduler (gCAWS)
– Prioritizing and accelerating the critical warp execution
• Criticality-‐Aware Cache Prioritization (CACP)– Prioritizing and allocating cache lines for critical warp reuse
![Page 111: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/111.jpg)
(111)
CAWA: Criticality-Aware Warp Acceleration
• Criticality Prediction Logic (CPL)– Predic6ng and iden6fying the cri6cal warp at run6me
• greedy Criticality Aware Warp Scheduler (gCAWS)
– Prioritizing and accelerating the critical warp execution
• Criticality-‐Aware Cache Prioritization (CACP)
– Prioritizing and allocating cache lines for critical warp reuse
![Page 112: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/112.jpg)
(112)
CAWACPL : Criticality Prediction Logic
14/28
• Evaluating number of additional cycles a warp may experience
• nInst is decremented whenever an instruction is executed
Criticality = nInst * w.CPIavg + nStall
instruction count disparity diverging branch
memory latency scheduling latency
![Page 113: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/113.jpg)
(113)
• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime
• greedy Criticality Aware Warp Scheduler (gCAWS)– Prioritizing and accelerating the critical warp execu6on
• Criticality-‐Aware Cache Prioritization (CACP)
– Prioritizing and allocating cache lines for critical warp reuse
CAWA: Criticality-Aware Warp Acceleration
Reduce delay from the scheduler
![Page 114: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/114.jpg)
(114)
• • Prioritizing warps based on their criticality given by CPL
• • Executing warps in a greedy* manner
• – Select the most critical ready-‐warp
• – Keep on executing the select warp until it stalls
• Warp Scheduler Selec6on Sequence
• Traditional Approach (e.g. RR, 2L, GTO):
• W0→W1→W2→W3
• gCAWS:
• W1→W3→W0→W2
16/28
*Rogers et al., “Cache-‐Conscious Wavefront Scheduler,” MICRO’12
Warp Pool Criticality
Warp 0 5
Warp 1 10
Warp 2 3
Warp 3 7
CAWAgCAWS : greedy Criticality-Aware Warp
Scheduler
![Page 115: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/115.jpg)
(115)
CAWA: Criticality-Aware Warp Acceleration
• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime
• greedy Criticality Aware Warp Scheduler (gCAWS)
– Prioritizing and accelerating the critical warp execution
• Cri6cality-‐Aware Cache Priori6za6on (CACP)– Priori6zing and alloca6ng cache lines for cri6cal warp reuse
Reduce delay from the memory
![Page 116: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/116.jpg)
(116)
CAWACACP : Criticality-Aware Cache Prioritization
*Wu et al., “SHiP: Signatured-‐based Hit Predictor for High Performance Caching,” MICRO’1119/28
![Page 117: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/117.jpg)
(117)20/28
Cache Request
CAWACACP : Criticality-Aware Cache
Prioritization
![Page 118: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/118.jpg)
(118)
No
rmal
ize
d M
PK
I1.2
1.1
1.0
0.91.21.11.00.90.80.7
No
rmal
ize
dIP
C
CAWA2L GTO3.13x
2.54x
Overall Performance Improvement
23/28
CAWA aims to retain data for critical warps
Improving GPU performance via large warps and two-‐level warp scheduling. Narasiman et al. MICRO `11. Cache conscious wavefront scheduling. Rogers, O’Connor, and Aamodt. MICRO `12.
![Page 119: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/119.jpg)
(119)
1.301.251.201.151.101.051.000.950.90
IPC
No
rmal
ized
to
Bas
elin
e R
R
gCAWS CAWA
2.63x3.13x
gCAWS Performance Improvement
24/28
Large memory footprint
![Page 120: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/120.jpg)
(120)
Performance Improvement with CAWACACP
25/28
1.00
0.75
0.50
1.25
1.50
MP
KI N
orm
aliz
ed
to B
aselin
e R
RRR+CACP 2L + CACP GTO+CACP CAWA
Schedulers need modificationfor robust and higher
performance improvement withCACP
CACP can effectively reduce
cache interference with any schedulers
![Page 121: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS](https://reader036.fdocuments.net/reader036/viewer/2022071405/60fb0a1393d3f205b86a569a/html5/thumbnails/121.jpg)
(121)
Summary
• Warp divergence leads to some lagging warps → critical warps
• Expose the performance impact of critical warps → throughput reduction
• Coordinate scheduler and cache management to reduce warp divergence