MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency
Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose
Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu
Executive Summary
2
Problem: Address translation overheadslimit the latency hiding capability of a GPU
Key IdeaPrioritize address translation requests over data requests
MASK: a GPU memory hierarchy thatA. Reduces shared TLB contention
B. Improves L2 cache utilization
C. Reduces page walk latency
MASK improves system throughput by 57.8% on average
over state-of-the-art address translation mechanisms
Large performance loss vs. no translation
High contention at the shared TLB
Low L2 cache utilization
Outline
3
• Executive Summary
• Background, Key Challenges and Our Goal
• MASK: A Translation-aware Memory Hierarchy
• Evaluation
• Conclusion
Why Share Discrete GPUs?
4
• Enables multiple GPGPU applications torun concurrently
• Better resource utilization • An application often cannot utilize an entire GPU• Different compute and bandwidth demands
• Enables GPU sharing in the cloud• Multiple users spatially share each GPU
Key requirement: fine-grained memory protection
GPU Core
Private TLB
State-of-the-art Address Translation in GPUs
5
GPU CoreGPU CoreGPU Core
Shared TLB
Private TLB
Page Table Walkers
Page Table(in main memory)
Private TLB Private TLB
Private
Shared
App 1
App 2
GPU Core
Private TLB
A TLB Miss Stalls Multiple Warps
6
GPU CoreGPU CoreGPU Core
Shared TLB
Private TLB
Page Table Walkers
Page Table(in main memory)
Private TLB Private TLB
Private
Shared
App 1
App 2
StalledStalled
Data in a page isshared by many threads
All threadsaccess the same page
GPU Core
Private TLB
Multiple Page Walks Happen Together
7
GPU CoreGPU CoreGPU Core
Shared TLB
Private TLB
Page Table Walkers
Page Table(in main memory)
Private TLB Private TLB
Private
Shared
App 1
App 2
GPU’s parallelism creates parallel page walks
Stalled StalledStalledStalled
Data in a page isshared by many threads
All threadsaccess the same page
0 0.2 0.4 0.6 0.8 1
Ideal SharedTLB PWCache
Effect of Translation on Performance
8
Normalized Performance
0 0.2 0.4 0.6 0.8 1
Ideal SharedTLB PWCache
Effect of Translation on Performance
9
Normalized Performance
What causes the large performance loss?
0 0.2 0.4 0.6 0.8 1
Ideal SharedTLB PWCache
Effect of Translation on Performance
10
37.4%
Normalized Performance
Problem 1: Contention at the Shared TLB
11
• Multiple GPU applications contend for the TLB
0.00.20.40.60.81.0
App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2
L2 T
LB
Mis
s R
ate
(Lo
we
r is
Be
tte
r) Alone Shared
3DS_HISTO CONS_LPS MUM_HISTO RED_RAY
Problem 1: Contention at the Shared TLB
12
• Multiple GPU applications contend for the TLB
0.00.20.40.60.81.0
App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2
L2 T
LB
Mis
s R
ate
(Lo
we
r is
Be
tte
r) Alone Shared
3DS_HISTO CONS_LPS MUM_HISTO RED_RAY
Contention at the shared TLB leads to lower performance
Problem 2: Thrashing at the L2 Cache
13
• L2 cache can be used to reduce page walk latency Partial translation data can be cached
• Thrashing Source 1: Parallel page walks Different address translation data evicts each other
• Thrashing Source 2: GPU memory intensity Demand-fetched data evicts address translation data
L2 cache is ineffective at reducing page walk latency
Observation: Address Translation Is Latency Sensitive
14
• Multiple warps share data from a single page
0
10
20
30
40
3D
SB
FS
2B
LK
BP
CF
DC
ON
SF
FT
FW
TG
UP
SH
IST
OH
SJP
EG
LIB
LP
SL
UD
LU
HM
MM
UM
NN
NW
QT
CR
AY
RE
DS
AD
SC
SC
AN
SC
PS
PM
VS
RA
DT
RD
Ave
rag
eWarp
s S
talled
Per
On
e T
LB
Mis
s
A single TLB miss causes 8 warps to stall on average
Observation: Address Translation Is Latency Sensitive
15
• Multiple warps share data from a single page
• GPU’s parallelism causes multiple concurrent page walks
0
20
40
60
3D
SB
FS
2B
LK
BP
CF
DC
ON
SF
FT
FW
TG
UP
SH
IST
OH
SJP
EG
LIB
LP
SL
UD
LU
HM
MM
UM
NN
NW
QT
CR
AY
RE
DS
AD
SC
SC
AN
SC
PS
PM
VS
RA
DT
RD
Ave
rag
e
Co
ncu
rren
tP
ag
e W
alk
s
High address translation latency More stalled warps
Our Goals
16
•Reduce shared TLB contention
• Improve L2 cache utilization
•Lower page walk latency
Outline
17
• Executive Summary
• Background, Key Challenges and Our Goal
• MASK: A Translation-aware Memory Hierarchy
• Evaluation
• Conclusion
MASK: A Translation-aware Memory Hierarchy
18
•Reduce shared TLB contentionA. TLB-fill Tokens
• Improve L2 cache utilizationB. Translation-aware L2 Bypass
•Lower page walk latencyC. Address-space-aware Memory Scheduler
A: TLB-fill Tokens
19
• Goal: Limit the number of warps that can fill the TLB A warp with a token fills the shared TLB
A warp with no token fills a very small bypass cache
• Number of tokens changes based on TLB miss rate Updated every epoch
• Tokens are assigned based on warp ID
Benefit: Limits contention at the shared TLB
MASK: A Translation-aware Memory Hierarchy
20
•Reduce shared TLB contentionA. TLB-fill Tokens
• Improve L2 cache utilizationB. Translation-aware L2 Bypass
•Lower page walk latencyC. Address-space-aware Memory Scheduler
B: Translation-aware L2 Bypass
21
• L2 hit rate decreases for deep page walk levels
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1
B: Translation-aware L2 Bypass
22
• L2 hit rate decreases for deep page walk levels
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1Page Table Level 2
B: Translation-aware L2 Bypass
23
• L2 hit rate decreases for deep page walk levels
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1Page Table Level 2Page Table Level 3
B: Translation-aware L2 Bypass
24
• L2 hit rate decreases for deep page walk levels
Some address translation data does not benefit from caching
Only cache address translation data with high hit rate
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1Page Table Level 2Page Table Level 3Page Table Level 4
Cache
0 0.2 0.4 0.6 0.8 1
B: Translation-aware L2 Bypass
25
• Goal: Cache address translation data with high hit rate
L2 Cache Hit Rate
Page Table Level 1Page Table Level 2Page Table Level 3Page Table Level 4 Bypass
Benefit 1: Better L2 cache utilization for translation data
Benefit 2: Bypassed requests No L2 queuing delay
Average L2 Cache Hit Rate
MASK: A Translation-aware Memory Hierarchy
26
•Reduce shared TLB contentionA. TLB-fill Tokens
• Improve L2 cache utilizationB. Translation-aware L2 Bypass
•Lower page walk latencyC. Address-space-aware Memory Scheduler
C: Address-space-aware Memory Scheduler
27
• Cause: Address translation requests are treated similarly to data demand requests
Idea: Lower address translation request latency
0
100
200
300
400
500
DR
AM
Late
ncy
0.0
0.2
0.4
0.6
0.8
1.0
DR
AM
Ban
dw
idth
Address Translation Requests Data Demand Requests
C: Address-space-aware Memory Scheduler
28
• Idea 1: Prioritize address translation requestsover data demand requests
To DRAM
Memory Scheduler
Golden Queue
Address Translation Request
Normal Queue
Data Demand Request
High Priority
Low Priority
C: Address-space-aware Memory Scheduler
29
• Idea 1: Prioritize address translation requestsover data demand requests
• Idea 2: Improve quality-of-service using the Silver Queue
To DRAM
Memory Scheduler
Golden Queue
Address Translation Request
Silver Queue
Data Demand Request(Applications take turns)
Normal Queue
Data Demand Request
High Priority
Low Priority
Each application takes turn injecting into the Silver Queue
Outline
30
• Executive summary
• Background, Key Challenges and Our Goal
• MASK: A Translation-aware Memory Hierarchy
• Evaluation
• Conclusion
Methodology
31
• Mosaic simulation platform [MICRO ’17] - Based on GPGPU-Sim and MAFIA [Jog et al., MEMSYS ’15]
- Models page walks and virtual-to-physical mapping
- Available at https://github.com/CMU-SAFARI/Mosaic
• NVIDIA GTX750 Ti
• Two GPGPU applications execute concurrently
• CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites- 3 workload categories based on TLB miss rate
Comparison Points
32
• State-of-the-art CPU–GPU memory management [Power et al., HPCA ’14] PWCache: Page Walk Cache GPU MMU design
SharedTLB: Shared TLB GPU MMU design
• Ideal: Every TLB access is an L1 TLB hit
0.0
0.5
1.0
1.5
2.0
2.5
0-HMR 1-HMR 2-HMR Average
No
rmalized
P
erf
orm
an
ce
PWCache SharedTLB MASK Ideal
Performance
33
57.8%52.0%61.2%58.7%
MASK outperforms state-of-the-art design for every workload
Other Results in the Paper
34
• MASK reduces unfairness
• Effectiveness of each individual component• All three MASK components are effective
• Sensitivity analysis over multiple GPU architectures• MASK improves performance on all evaluated architectures,
including CPU–GPU heterogeneous systems
• Sensitivity analysis to different TLB sizes• MASK improves performance on all evaluated sizes
• Performance improvement over different memory scheduling policies• MASK improves performance over other
state-of-the-art memory schedulers
Conclusion
35
Problem: Address translation overheadslimit the latency hiding capability of a GPU
Key IdeaPrioritize address translation requests over data requests
MASK: a translation-aware GPU memory hierarchyA. TLB-fill Tokens reduces shared TLB contention
B. Translation-aware L2 Bypass improves L2 cache utilization
C. Address-space-aware Memory Scheduler reduces page walk latency
MASK improves system throughput by 57.8% on average
over state-of-the-art address translation mechanisms
Large performance loss vs. no translation
High contention at the shared TLB
Low L2 cache utilization
MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency
Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose
Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu
Backup Slides
37
Other Ways to Manage TLB Contention
38
• Prefetching:• Stream prefetcher is ineffective for multiple workloads
• GPU’s parallelism makes it hard to predict which translation data to prefetch
• GPU’s parallelism causes thrashing on the prefetched data
• Reuse-based technique:• Lowers TLB hit rate
• Most pages have similar TLB hit rate
Other Ways to Manage L2 Thrashing
39
• Cache Partitioning• Performs ~3% worse on average compared to
Translation-aware L2 Bypass
• Multiple address translation requests still thrash each other
• Can lead to underutilization
• Lowers hit rate of data requests
• Cache Insertion Policy• Does not yield better hit rate for lower page table level
• Does not benefit from lower queuing latency
Utilizing Large Page?• One single large page size
High demand paging latency
> 90% performance overhead with demand paging
All threads stall during large page PCIe transfer
• Mosaic [Ausavarungnirun et al., MICRO’17] Supports for multiple page sizes
Demand paging happens on small page granularity
Allocates data from the same application in large page granularity
Opportunistically coalesces small page to reduce TLB contention
MASK + Mosaic performs within 5% of the Ideal TLB
Area and Storage Overhead
41
• Area overhead• <1% area of its original components (Shared TLB, L2$,
Memory Scheduler)
• Storage overhead• TLB-fill Tokens:
• 3.8% extra storage on Shared TLB
• Translation-aware L2 Bypass: • 0.1% extra storage on L2$
• Address-space-aware Memory Scheduler:• 6% extra memory request buffer
0.0
0.5
1.0
1.5
2.0
0-HMR 1-HMR 2-HMR Average
Un
fair
nes
s
PWCache SharedTLB MASK
20.1%
25.0%21.8% 22.4%
Unfairness
MASK is effective at improving fairness
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
0-HMR 1-HMR 2-HMR Average
We
igh
ted
Sp
eed
up
Static PWCache SharedTLB MASK-TLB
MASK-Cache MASK-DRAM MASK Ideal
Performance vs. Other Baselines
0.0
0.5
1.0
1.5
2.0
0-HMR 1-HMR 2-HMR Average
Un
fair
ness
Static PWCache SharedTLB MASK
22.4%21.8%25.0%20.1%
Unfairness
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
3D
S_B
P
3D
S_H
IST
O
BLK
_LP
S
CF
D_M
M
CO
NS
_LP
S
CO
NS
_LU
H
FW
T_B
P
HIS
TO
_G
UP
HIS
TO
_LP
S
LU
H_B
FS
2
LU
H_G
UP
MM
_C
ON
S
MU
M_
HIS
TO
NW
_H
S
NW
_L
PS
RA
Y_G
UP
RA
Y_H
S
RE
D_B
P
RE
D_G
UP
RE
D_M
M
RE
D_R
AY
RE
D_S
C
SC
AN
_C
ON
S
SC
AN
_H
IST
O
SC
AN
_S
AD
SC
AN
_S
RA
D
SC
P_G
UP
SC
P_H
S
SC
_F
WT
SR
AD
_3D
S
TR
D_H
S
TR
D_LP
S
TR
D_M
UM
TR
D_R
AY
TR
D_R
ED
Ave
rag
e
No
rma
lized
DR
AM
Ban
dw
idth
Uti
l.
Address Translation Requests Data Demand Requests
DRAM Utilization Breakdowns
0
200
400
600
800
10003
DS
_B
P
3D
S_
HIS
TO
BLK
_L
PS
CF
D_
MM
CO
NS
_L
PS
CO
NS
_L
UH
FW
T_
BP
HIS
TO
_G
UP
HIS
TO
_L
PS
LU
H_B
FS
2
LU
H_G
UP
MM
_C
ON
S
MU
M_
HIS
TO
NW
_H
S
NW
_L
PS
RA
Y_
GU
P
RA
Y_
HS
RE
D_
BP
RE
D_
GU
P
RE
D_
MM
RE
D_
RA
Y
RE
D_
SC
SC
AN
_C
ON
S
SC
AN
_H
IST
O
SC
AN
_S
AD
SC
AN
_S
RA
D
SC
P_
GU
P
SC
P_
HS
SC
_F
WT
SR
AD
_3D
S
TR
D_
HS
TR
D_
LP
S
TR
D_
MU
M
TR
D_
RA
Y
TR
D_
RE
D
Ave
rag
e
DR
AM
Late
ncy
(Cycle
s)
Address Translation Requests Data Demand Requests
DRAM Latency Breakdowns
0
1
2
3
4
5
HISTO_GUP HISTO_LPS NW_HS NW_LPS RAY_GUP RAY_HS SCP_GUP SCP_HS
Weig
hte
d S
pe
ed
up
Static PWCache SharedTLBMASK-TLB MASK-Cache MASK-DRAMMASK
0-HMR Performance
0
1
2
3
4
5
Weig
hte
d S
peed
up
Static PWCache SharedTLB
MASK-TLB MASK-Cache MASK-DRAM
MASK
1-HMR Performance
0
1
2
3
4
5
Weig
hte
d S
peed
up
Static PWCache SharedTLBMASK-TLB MASK-Cache MASK-DRAMMASK
2-HMR Performance
0
0.2
0.4
0.6
0.8
1
No
rmalized
Perf
orm
an
ce
PWCache SharedTLB Ideal
Additional Baseline Performance
50
45.6%
Top Related