An Instruction Roofline Model for GPUs
Transcript of An Instruction Roofline Model for GPUs
Nan Ding, Samuel WilliamsComputational Research DivisionLawrence Berkeley National Lab{nanding, swwilliams}@lbl.gov
Nov.18th ,2019
An Instruction Roofline Model for GPUs
GFLOP/s = min Peak GFLOP/sAI * GB/s
History of Roofline Models
2
โข Sustainable performance is bound by
Roofline Model
โข Arithmetic Intensity (AI) : FLOPs/Byte
Peak GFLOP/s
Atta
inab
le G
FLO
P/s
Arithmetic Intensity (Flops: Bytes)
DRAM G
B/sโข Arithmetic Intensity (AI) : FLOPs/Byte(L1/L2/DRAM)โข Additional compute ceilings: No-FMA peak
Peak GFLOP/s
Atta
inab
le G
FLO
P/s
DRAM G
B/s
L2 G
B/sL1 G
B/s
Arithmetic Intensity (Flops: Bytes)
Hierarchical Roofline Model
No-FMA GFLOP/sbetter
Roofline is Useful
3
Memory-bound or Compute-bound Cache effects
Driving performance optimization
4
Howeverโฆ
โข Even with sufficient data locality, one cannot guarantee high performanceโข Pathological memory access patterns?โข Re-design the data layout?โข Limited by instruction throughput?
โข Many applications perform more integer operations than floating-point/no-flopsPeak GFLOP/s
Atta
inab
le G
FLO
P/s
DRAM
GB/
s
L2 G
B/s
L1 G
B/s
Arithmetic Intensity (Flops: Bytes)
???
Motivation for the Instruction Roofline Model
Emerging Domains Architectural Evolution Practical Use
โข Mixed precision, Integer-heavy
โข No floating points operations
โข Instruction throughputโข pipeline utilization
โข Warp efficiencyโข Thread predication
โข Memory access patternsโข reduce wasted transactionsโข reduce redundant access
More than Flops A New Set of Metrics
โข What is holding you back?
โข What optimizations should be performed?
โข When to stop optimization?
Drive Code Optimization in a Good Visual Manner
GFLOP/s = min Peak GFLOP/ssAI * GB/s
6
The First Step to Instruction Roofline Model
Atta
inab
le G
IPS
DRAM G
B/s
L2 G
B/sL1 G
B/s
Arithmetic Intensity (Flops: Byte)
Peak GIPS
Expanding the applicability of roofline to several emerging computational domains
Form the basis for several subsequent Instruction Roofline-oriented performance analysis technologies
โข Identify fetch-decode-issue bottlenecksโข Function unit utilization (FPU, tensor,
integer, etc...)
โข Sustainable performance is bound by
GIPS = min Peak GIPSII * GB/s
7
The Second Step to Instruction Roofline Model
Atta
inab
le G
IPS
DRAM G
B/s
L2 G
B/sL1 G
B/s
Instruction Intensity (Instructions: Byte)
Peak GIPS
โข Instruction Intensity: Instructions per Byte
Limitation:Hard to motivate more performance analysis techniques, such as memory pattern access
GFLOP/s = min Peak GFLOP/sAI * GB/s
Expanding the applicability of roofline to several emerging computational domains
โข Sustainable performance is bound by
GIPS = min Peak GIPSII * GB/s
8
A Final Step to Instruction Roofline Model on GPUs
Atta
inab
le G
IPS
DRAM G
TXN/s
L2 G
TXN/s
L1 G
TXN/s
Instruction Intensity (Instructions: Transaction)
Peak GIPS
โข Memory Transactionโข the natural unit to access data on NVIDIA GPUsโข the natural unit to analyze memory accessโข a warp-level load/store -> 1 - 32 transactions
Expanding the applicability of roofline to more performance analysis technologies GPUs
โข Instruction Intensity โข Instructions per Transaction
Form the basis for several subsequent Instruction Roofline-oriented performance analysis technologies on GPUs:
โข Memory access patterns
GIPS = min Peak GIPSInstructions/Transaction * GTXN/s
Instruction Roofline Performance Model
โข Sustainable performance of is bound by
โข Theoretical Peak on V100:80 SMs x 4 warp scheduler x 1 inst/cyc x 1.53GHz = 489.6 GIPS
[1] https://bitbucket.org/berkeleylab/cs-roofline-toolkit 9
GIPS = min Peak GIPSInstruction Intensity * GTransaction/s
10-2 10-1 100 101 102
Instruction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 437.5 GTXN/s
Theoretical Peak: 489.6 warp GIPS
โข Memory ceilings on V100:โข Based on the GB/s from Empirical Roofline
Toolkit[1]
โข Calculate the number of equivalent 32-byte transactions
Instruction Throughput
Capabilities of Instruction Roofline Performance Model ๏ผ1/2๏ผ
Capabilities of Instruction Roofline Model -- Instruction Throughput
โข Instruction throughputAll instruction, Transactions of each memory level(L1/L2/HBM), runtime
10-2 10-1 100 101
Instuction Intensity (Warp Instructions per Transaction)
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
HBM 25.9 GTXN/sL2 93.6 GTXN/sL1 437.5 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
โข Insights:1. Distance between the
ceilings and dots can tell memory-bound or instruction-bound
2. Distance between the two plots (different memory level) can tell the data reuse.
11
Theoretical
โข Instruction throughputAll instruction, Transactions of each memory level(L1/L2/HBM), runtime
10-2 10-1 100 101
Instuction Intensity (Warp Instructions per Transaction)
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
HBM 25.9 GTXN/sL2 93.6 GTXN/sL1 437.5 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
Better!
12
โข Insights:1. Distance between the
ceilings and dots can tell memory-bound or instruction-bound
2. Distance between the two plots (different memory level) can tell the data reuse.
Capabilities of Instruction Roofline Model -- Instruction Throughput
Theoretical
Memory Access Patterns
Capabilities of Instruction Roofline Performance Model ๏ผ2/2๏ผ
Memory Access Pattern is Critical to Application Execution Time
14
Easy to code in an inefficient memory pattern
Low performance
Hidden deep in the code
Time consuming to reason the performance
Capabilities of Instruction Roofline Model -- Global Memory Patterns
1 warp-level load/store -> 1 to 32 transactions depending on memory patterns
๐ "#$% &'()#' *+,-๐ &'()#' -$#./#012(.
โStride-0โ ๐ "#$% &'()#' *+,-
๐๐ &'()#' -$#./#012(./
โStride-8โ
Useful data Waste data
15
war
p0
31
int array[]1 2 7 310 โฆ โฆ
1 global transaction = 32 Bytesw
arp
031
โฆ1 2 70 โฆ
โฆ
โฆ
1 global transaction
9 10 158 โฆ 16
โฆ 1 global transaction
one cache line: 128 bytes
24 โฆ 31
1 2 70 โฆ โฆ
1 global transaction โฆ
9 10 158 โฆ 16
1 global transaction
24 โฆ 31
๐ "#$% &'()#' *+,-๐ &'()#' -$#./#012(./
โStride-1โ (Unit Stride)
warp
1 2 3 30 310 โฆ
int array[]
โฆ
10-2 10-1 100
Instuction Intensity (Warp Instructions per Transaction)
10-2
10-1
100
101
102
103Pe
rfor
man
ce (w
arp
GIP
S)Theoretical Peak: 489.6 warp GIPS
Theoretical LDST Peak: 122.4 warp GIPS
Capabilities of Instruction Roofline Performance Model ---- Three Intensity ``Wallsโโ for Stride Global Memory Access Patterns
16
Better Efficiency!
๐ "#$% &'()#' *+,-๐ &'()#' -$#./#012(.
๐ "#$% &'()#' *+,-๐๐ &'()#' -$#./#012(./
๐ "#$% &'()#' *+,-๐ &'()#' -$#./#012(./
Stride-8 Stride-1 Stride-0
10-4 10-2 100 102 104
Instuction Intensity (Warp Instructions per Transaction)
10-2
10-1
100
101
102
103
Perfo
rman
ce (w
arp
GIP
S)
Strid
e-0
Stri
de-1
Strid
e-8Global 4
37.5 GTXN/s
Thoeretical Peak: 489.6 warp GIPS
Thoeretical LDST Peak: 122.4 warp GIPS
Capabilities of Instruction Roofline Performance Model ---- Characterize Global Memory Access Patterns
Breakdown the L1 dot into Global Memory Only metrics -> Stride Global Memory Patterns according to Global Memory Walls
X-axis: ๐ฎ๐๐๐๐๐ ๐ณ๐ซ๐บ๐ป ๐๐๐๐๐๐๐๐๐๐๐๐๐ฎ๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐
Y-axis : ๐ฎ๐๐๐๐๐ ๐ณ๐ซ๐บ๐ป ๐๐๐๐๐๐๐๐๐๐๐๐๐น๐๐๐๐๐๐
17
Better Efficiency!
TheoreticalX-axis: ๐ป๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐ป๐๐๐๐ ๐ณ๐ (๐๐๐๐๐๐K๐๐๐๐๐๐ K ๐๐๐๐๐) ๐๐๐๐๐๐๐๐๐๐๐๐
Y-axis : ๐ป๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐น๐๐๐๐๐๐
Capabilities of Instruction Roofline Performance Model ---- Shared Memory Access Patterns
Notional Physical
__shared__ int array[32][32]32 banks per shared memory rowEach bank is 4 Byte
Reduce bank conflicts__shared__ int array[32][32+1]
310 1
No bank conflict
__shared__ int array[32][32]
32-way bank conflict
No bank conflict
โฆ
โฆ
โฆ
โฆwarp
310 1โฆ
โฆ
โฆ
310 1โฆ
โฆโฆ
0
0
โฆwar
p1
31
โฆwarp
18
โข โNo bank conflictโ = O"#$% ,P#$QR *+,-O ,P#$QR -$#./#012(.
โข different 4-byte word, different bankโข same 4-byte word , same bank
โข โ32-way bank conflictโ = O"#$% ,P#$QR *+,-ST ,P#$QR -$#./#012(./
โข different 4-byte words, same bank
Capabilities of Instruction Roofline Performance Model ---- Two Intensity ``Wallsโโ for Bank Shared Memory Access Patterns
19
10-4 10-2 100 102 104
Instuction Intensity (Warp Instructions per Transaction)
10-2
10-1
100
101
102
103Pe
rfor
man
ce (w
arp
GIP
S)
No
bank
con
flict
32-w
ay b
ank
conf
lictShare
d 109.3 GTXN/s
Thoeretical Peak: 489.6 warp GIPS
Thoeretical LDST Peak: 122.4 warp GIPS
Better Efficiency!
Theoretical
10-4 10-2 100 102 104
Instuction Intensity (Warp Instructions per Transaction)
10-2
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
No
bank
con
flict
32-w
ay b
ank
conf
lict
Shared 109.3 GTXN/s
Capabilities of Instruction Roofline Performance Model ---- Characterize Shared Memory Access Patterns
Breakdown the L1 dot into Shared Memory Only metrics -> banked Shared Memory Patterns according to Shared Memory Walls
X-axis: ๐ป๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐ป๐๐๐๐ ๐ณ๐ (๐๐๐๐๐๐ K ๐๐๐๐๐๐ K ๐๐๐๐๐) ๐๐๐๐๐๐๐๐๐๐๐๐
Y-axis : ๐ป๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐น๐๐๐๐๐๐
X-axis: ๐บ๐๐๐๐๐ ๐ณ๐ซ๐บ๐ป ๐๐๐๐๐๐๐๐๐๐๐๐๐บ๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐
Y-axis : ๐บ๐๐๐๐๐ ๐ณ๐ซ๐บ๐ป ๐๐๐๐๐๐๐๐๐๐๐๐๐น๐๐๐๐๐๐
20
Better Efficiency!
An example to understand the outputs from Instruction Roofline Model
Example: Matrix Transpose
Description A-> AT, stored in column majorMatrix size 1024 x 1024Machine NVIDIAโs latest V100 GPU
Implementations Naive Coalesced Coalesced_NoBankConflict
Simple copy
Coaleced global memory access
Based on โCoalescedโReduce shared memory bank conflicts
Using 32ร8 thread blocks operating on 32ร32 matrix tiles
21
Global memory stride access
Input Matrix: A (column major)
Output Matrix: AT (column major)
4096 Bytes (1024 floats)40
96 B
ytes
(102
4 fl
oats
)
Coalesced read
Stride over 4096 Bytes
4096 Bytes (1024 floats)
4096
Byt
es (1
024
floa
ts)
Naive Implementation
Instruction Throughput
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
Strid
e-0
Strid
e-1
Strid
e-8
Thoeretical LDST Peak: 122.4 warp GIPS
Global 437.5 GTXN/s
Global (ldst_inst)
Better Efficiency!
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
L1 437.5 GTXN/s
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
Theoretical
Theoretical
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
L1 437.5 GTXN/s
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
Strid
e-0
Strid
e-1
Strid
e-8
Thoeretical LDST Peak: 122.4 warp GIPS
Global 437.5 GTXN/s
Global (ldst_inst)
Global memory stride access
Input Matrix: A (column major)
Output Matrix: AT (column major)
4096 Bytes (1024 floats)40
96 B
ytes
(102
4 fl
oats
)
4096 Bytes (1024 floats)
4096
Byt
es (1
024
floa
ts)
Coalesced Implementation
Instruction Throughput
Coalesced read32 x 32 (floats)
Coalesced write
Shared memory
Recalculating the array index
Better Efficiency!
Naive
Coalesced
Theoretical
Theoretical
Shared memory bank access
Input Matrix: A (column major)
Output Matrix: AT (column major)
4096 Bytes (1024 floats)40
96 B
ytes
(102
4 fl
oats
)
4096 Bytes (1024 floats)
4096
Byt
es (1
024
floa
ts)
Coalesced Implementation
Coalesced read32 x 32 (floats)
Coalesced write
Shared memory
Recalculating the array index
k k+16T0 T1T2 T3T4 T5โฆ โฆT30 T31
16-way bank conflict
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
No
bank
con
flict
32-w
ay b
ank
conf
lict Thoeretical LDST Peak: 122.4 warp GIPS
Shared (ldst_inst)Better Efficiency!
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
L1 437.5 GTXN/s
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
Instruction Throughput
Naive
Coalesced
Theoretical
Theoretical
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
L1 437.5 GTXN/s
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
Shared memory bank access
Input Matrix: A (column major)
Output Matrix: AT (column major)
4096 Bytes (1024 floats)40
96 B
ytes
(102
4 fl
oats
)
4096 Bytes (1024 floats)
4096
Byt
es (1
024
floa
ts)
Coalesced_NoBankConflict Implementation
Instruction Throughput
Coalesced read32 x 32 (floats)
Coalesced write
Shared memory
Recalculating the array index
k k+16T0 T1
โฆ โฆ
T3T4 T5T2
T30 T31
[32][32+1]
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
No
bank
con
flict
32-w
ay b
ank
conf
lict Thoeretical LDST Peak: 122.4 warp GIPS
Shared (ldst_inst)Better Efficiency!
Naive
CoalescedNoBankConflict
Theoretical
Theoretical
Summary
โข Instruction Throughput โข Expanding the applicability of roofline to several emerging computational domains
โข Global Memory Access Patternsโข Quantify the memory access pattern, e.g. unit-stride vs. gather/scatter
โข Shared Memory Access Patternsโข Denote the efficiency of shared memory access.
26
Thereโs more in the paper !!โข Thread Predicationโข More Examples
โข HPGMG (mixed precision): three implementations. โข BatchSW (integer-only): two implementations
โข Tensor coreโข WMMAโข cuBLAS
Closing Thoughts
Emerging Domains Architectural Evolution Practical Use
โข Mixed precision โข Integer-heavyโข No floating points operations
โข Instruction throughputโข pipeline utilization
โข Quantify memory pattern โข Unit-stride, scatter/gather
โข Efficiency of memory accessโข Warp efficiency
โข Thread predication
Applicability of roofline to GPUs with greater insights
โข Unified visualization of bandwidth and access efficiency
Applicability to several emerging computational domains
Rapidly tell how different aspects of modern GPU architectures constrain performance.
What the Instruction Roofline Models Tell usโฆ
Future Work
โข Apply our methodology to other accelerated architectures
โข Extend the access efficiency concept to networking, I/O, and Lustre file systems.
29
Acknowledgement
โข This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.
โข This research used resources of the National Energy Research Scientific Computing Center (NERSC) which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and the Oak Ridge Leadership Facility which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
โข We thank NVIDIA Corporation for their willingness to answer our myriad of questions on nvprof metrics.
30
Questions?
Backup