Nan Ding, Samuel WilliamsComputational Research DivisionLawrence Berkeley National Lab{nanding, swwilliams}@lbl.gov
Nov.18th ,2019
An Instruction Roofline Model for GPUs
GFLOP/s = min Peak GFLOP/sAI * GB/s
History of Roofline Models
2
• Sustainable performance is bound by
Roofline Model
• Arithmetic Intensity (AI) : FLOPs/Byte
Peak GFLOP/s
Atta
inab
le G
FLO
P/s
Arithmetic Intensity (Flops: Bytes)
DRAM G
B/s• Arithmetic Intensity (AI) : FLOPs/Byte(L1/L2/DRAM)• Additional compute ceilings: No-FMA peak
Peak GFLOP/s
Atta
inab
le G
FLO
P/s
DRAM G
B/s
L2 G
B/sL1 G
B/s
Arithmetic Intensity (Flops: Bytes)
Hierarchical Roofline Model
No-FMA GFLOP/sbetter
Roofline is Useful
3
Memory-bound or Compute-bound Cache effects
Driving performance optimization
4
However…
• Even with sufficient data locality, one cannot guarantee high performance• Pathological memory access patterns?• Re-design the data layout?• Limited by instruction throughput?
• Many applications perform more integer operations than floating-point/no-flopsPeak GFLOP/s
Atta
inab
le G
FLO
P/s
DRAM
GB/
s
L2 G
B/s
L1 G
B/s
Arithmetic Intensity (Flops: Bytes)
???
Motivation for the Instruction Roofline Model
Emerging Domains Architectural Evolution Practical Use
• Mixed precision, Integer-heavy
• No floating points operations
• Instruction throughput• pipeline utilization
• Warp efficiency• Thread predication
• Memory access patterns• reduce wasted transactions• reduce redundant access
More than Flops A New Set of Metrics
• What is holding you back?
• What optimizations should be performed?
• When to stop optimization?
Drive Code Optimization in a Good Visual Manner
GFLOP/s = min Peak GFLOP/ssAI * GB/s
6
The First Step to Instruction Roofline Model
Atta
inab
le G
IPS
DRAM G
B/s
L2 G
B/sL1 G
B/s
Arithmetic Intensity (Flops: Byte)
Peak GIPS
Expanding the applicability of roofline to several emerging computational domains
Form the basis for several subsequent Instruction Roofline-oriented performance analysis technologies
• Identify fetch-decode-issue bottlenecks• Function unit utilization (FPU, tensor,
integer, etc...)
• Sustainable performance is bound by
GIPS = min Peak GIPSII * GB/s
7
The Second Step to Instruction Roofline Model
Atta
inab
le G
IPS
DRAM G
B/s
L2 G
B/sL1 G
B/s
Instruction Intensity (Instructions: Byte)
Peak GIPS
• Instruction Intensity: Instructions per Byte
Limitation:Hard to motivate more performance analysis techniques, such as memory pattern access
GFLOP/s = min Peak GFLOP/sAI * GB/s
Expanding the applicability of roofline to several emerging computational domains
• Sustainable performance is bound by
GIPS = min Peak GIPSII * GB/s
8
A Final Step to Instruction Roofline Model on GPUs
Atta
inab
le G
IPS
DRAM G
TXN/s
L2 G
TXN/s
L1 G
TXN/s
Instruction Intensity (Instructions: Transaction)
Peak GIPS
• Memory Transaction• the natural unit to access data on NVIDIA GPUs• the natural unit to analyze memory access• a warp-level load/store -> 1 - 32 transactions
Expanding the applicability of roofline to more performance analysis technologies GPUs
• Instruction Intensity • Instructions per Transaction
Form the basis for several subsequent Instruction Roofline-oriented performance analysis technologies on GPUs:
• Memory access patterns
GIPS = min Peak GIPSInstructions/Transaction * GTXN/s
Instruction Roofline Performance Model
• Sustainable performance of is bound by
• Theoretical Peak on V100:80 SMs x 4 warp scheduler x 1 inst/cyc x 1.53GHz = 489.6 GIPS
[1] https://bitbucket.org/berkeleylab/cs-roofline-toolkit 9
GIPS = min Peak GIPSInstruction Intensity * GTransaction/s
10-2 10-1 100 101 102
Instruction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 437.5 GTXN/s
Theoretical Peak: 489.6 warp GIPS
• Memory ceilings on V100:• Based on the GB/s from Empirical Roofline
Toolkit[1]
• Calculate the number of equivalent 32-byte transactions
Instruction Throughput
Capabilities of Instruction Roofline Performance Model (1/2)
Capabilities of Instruction Roofline Model -- Instruction Throughput
• Instruction throughputAll instruction, Transactions of each memory level(L1/L2/HBM), runtime
10-2 10-1 100 101
Instuction Intensity (Warp Instructions per Transaction)
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
HBM 25.9 GTXN/sL2 93.6 GTXN/sL1 437.5 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
• Insights:1. Distance between the
ceilings and dots can tell memory-bound or instruction-bound
2. Distance between the two plots (different memory level) can tell the data reuse.
11
Theoretical
• Instruction throughputAll instruction, Transactions of each memory level(L1/L2/HBM), runtime
10-2 10-1 100 101
Instuction Intensity (Warp Instructions per Transaction)
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
HBM 25.9 GTXN/sL2 93.6 GTXN/sL1 437.5 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
Better!
12
• Insights:1. Distance between the
ceilings and dots can tell memory-bound or instruction-bound
2. Distance between the two plots (different memory level) can tell the data reuse.
Capabilities of Instruction Roofline Model -- Instruction Throughput
Theoretical
Memory Access Patterns
Capabilities of Instruction Roofline Performance Model (2/2)
Memory Access Pattern is Critical to Application Execution Time
14
Easy to code in an inefficient memory pattern
Low performance
Hidden deep in the code
Time consuming to reason the performance
Capabilities of Instruction Roofline Model -- Global Memory Patterns
1 warp-level load/store -> 1 to 32 transactions depending on memory patterns
𝟏 "#$% &'()#' *+,-𝟏 &'()#' -$#./#012(.
“Stride-0” 𝟏 "#$% &'()#' *+,-
𝟑𝟐 &'()#' -$#./#012(./
“Stride-8”
Useful data Waste data
15
war
p0
31
int array[]1 2 7 310 … …
1 global transaction = 32 Bytesw
arp
031
…1 2 70 …
…
…
1 global transaction
9 10 158 … 16
… 1 global transaction
one cache line: 128 bytes
24 … 31
1 2 70 … …
1 global transaction …
9 10 158 … 16
1 global transaction
24 … 31
𝟏 "#$% &'()#' *+,-𝟒 &'()#' -$#./#012(./
“Stride-1” (Unit Stride)
warp
1 2 3 30 310 …
int array[]
…
10-2 10-1 100
Instuction Intensity (Warp Instructions per Transaction)
10-2
10-1
100
101
102
103Pe
rfor
man
ce (w
arp
GIP
S)Theoretical Peak: 489.6 warp GIPS
Theoretical LDST Peak: 122.4 warp GIPS
Capabilities of Instruction Roofline Performance Model ---- Three Intensity ``Walls’’ for Stride Global Memory Access Patterns
16
Better Efficiency!
𝟏 "#$% &'()#' *+,-𝟏 &'()#' -$#./#012(.
𝟏 "#$% &'()#' *+,-𝟑𝟐 &'()#' -$#./#012(./
𝟏 "#$% &'()#' *+,-𝟒 &'()#' -$#./#012(./
Stride-8 Stride-1 Stride-0
10-4 10-2 100 102 104
Instuction Intensity (Warp Instructions per Transaction)
10-2
10-1
100
101
102
103
Perfo
rman
ce (w
arp
GIP
S)
Strid
e-0
Stri
de-1
Strid
e-8Global 4
37.5 GTXN/s
Thoeretical Peak: 489.6 warp GIPS
Thoeretical LDST Peak: 122.4 warp GIPS
Capabilities of Instruction Roofline Performance Model ---- Characterize Global Memory Access Patterns
Breakdown the L1 dot into Global Memory Only metrics -> Stride Global Memory Patterns according to Global Memory Walls
X-axis: 𝑮𝒍𝒐𝒃𝒂𝒍 𝑳𝑫𝑺𝑻 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑮𝒍𝒐𝒃𝒂𝒍 𝒕𝒓𝒂𝒏𝒔𝒂𝒄𝒕𝒊𝒐𝒏𝒔
Y-axis : 𝑮𝒍𝒐𝒃𝒂𝒍 𝑳𝑫𝑺𝑻 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑹𝒖𝒏𝒕𝒊𝒎𝒆
17
Better Efficiency!
TheoreticalX-axis: 𝑻𝒐𝒕𝒂𝒍 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑻𝒐𝒕𝒂𝒍 𝑳𝟏 (𝒈𝒍𝒐𝒃𝒂𝒍K𝒔𝒉𝒂𝒓𝒆𝒅K 𝒍𝒐𝒄𝒂𝒍) 𝒕𝒓𝒂𝒏𝒔𝒂𝒄𝒕𝒊𝒐𝒏𝒔
Y-axis : 𝑻𝒐𝒕𝒂𝒍 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑹𝒖𝒏𝒕𝒊𝒎𝒆
Capabilities of Instruction Roofline Performance Model ---- Shared Memory Access Patterns
Notional Physical
__shared__ int array[32][32]32 banks per shared memory rowEach bank is 4 Byte
Reduce bank conflicts__shared__ int array[32][32+1]
310 1
No bank conflict
__shared__ int array[32][32]
32-way bank conflict
No bank conflict
…
…
…
…warp
310 1…
…
…
310 1…
……
0
0
…war
p1
31
…warp
18
• “No bank conflict” = O"#$% ,P#$QR *+,-O ,P#$QR -$#./#012(.
• different 4-byte word, different bank• same 4-byte word , same bank
• “32-way bank conflict” = O"#$% ,P#$QR *+,-ST ,P#$QR -$#./#012(./
• different 4-byte words, same bank
Capabilities of Instruction Roofline Performance Model ---- Two Intensity ``Walls’’ for Bank Shared Memory Access Patterns
19
10-4 10-2 100 102 104
Instuction Intensity (Warp Instructions per Transaction)
10-2
10-1
100
101
102
103Pe
rfor
man
ce (w
arp
GIP
S)
No
bank
con
flict
32-w
ay b
ank
conf
lictShare
d 109.3 GTXN/s
Thoeretical Peak: 489.6 warp GIPS
Thoeretical LDST Peak: 122.4 warp GIPS
Better Efficiency!
Theoretical
10-4 10-2 100 102 104
Instuction Intensity (Warp Instructions per Transaction)
10-2
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
No
bank
con
flict
32-w
ay b
ank
conf
lict
Shared 109.3 GTXN/s
Capabilities of Instruction Roofline Performance Model ---- Characterize Shared Memory Access Patterns
Breakdown the L1 dot into Shared Memory Only metrics -> banked Shared Memory Patterns according to Shared Memory Walls
X-axis: 𝑻𝒐𝒕𝒂𝒍 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑻𝒐𝒕𝒂𝒍 𝑳𝟏 (𝒈𝒍𝒐𝒃𝒂𝒍 K 𝒔𝒉𝒂𝒓𝒆𝒅K 𝒍𝒐𝒄𝒂𝒍) 𝒕𝒓𝒂𝒏𝒔𝒂𝒄𝒕𝒊𝒐𝒏𝒔
Y-axis : 𝑻𝒐𝒕𝒂𝒍 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑹𝒖𝒏𝒕𝒊𝒎𝒆
X-axis: 𝑺𝒉𝒂𝒓𝒆𝒅 𝑳𝑫𝑺𝑻 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑺𝒉𝒂𝒓𝒅 𝒕𝒓𝒂𝒏𝒔𝒂𝒄𝒕𝒊𝒐𝒏𝒔
Y-axis : 𝑺𝒉𝒂𝒓𝒆𝒅 𝑳𝑫𝑺𝑻 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑹𝒖𝒏𝒕𝒊𝒎𝒆
20
Better Efficiency!
An example to understand the outputs from Instruction Roofline Model
Example: Matrix Transpose
Description A-> AT, stored in column majorMatrix size 1024 x 1024Machine NVIDIA’s latest V100 GPU
Implementations Naive Coalesced Coalesced_NoBankConflict
Simple copy
Coaleced global memory access
Based on “Coalesced”Reduce shared memory bank conflicts
Using 32×8 thread blocks operating on 32×32 matrix tiles
21
Global memory stride access
Input Matrix: A (column major)
Output Matrix: AT (column major)
4096 Bytes (1024 floats)40
96 B
ytes
(102
4 fl
oats
)
Coalesced read
Stride over 4096 Bytes
4096 Bytes (1024 floats)
4096
Byt
es (1
024
floa
ts)
Naive Implementation
Instruction Throughput
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
Strid
e-0
Strid
e-1
Strid
e-8
Thoeretical LDST Peak: 122.4 warp GIPS
Global 437.5 GTXN/s
Global (ldst_inst)
Better Efficiency!
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
L1 437.5 GTXN/s
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
Theoretical
Theoretical
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
L1 437.5 GTXN/s
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
10-1
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
Strid
e-0
Strid
e-1
Strid
e-8
Thoeretical LDST Peak: 122.4 warp GIPS
Global 437.5 GTXN/s
Global (ldst_inst)
Global memory stride access
Input Matrix: A (column major)
Output Matrix: AT (column major)
4096 Bytes (1024 floats)40
96 B
ytes
(102
4 fl
oats
)
4096 Bytes (1024 floats)
4096
Byt
es (1
024
floa
ts)
Coalesced Implementation
Instruction Throughput
Coalesced read32 x 32 (floats)
Coalesced write
Shared memory
Recalculating the array index
Better Efficiency!
Naive
Coalesced
Theoretical
Theoretical
Shared memory bank access
Input Matrix: A (column major)
Output Matrix: AT (column major)
4096 Bytes (1024 floats)40
96 B
ytes
(102
4 fl
oats
)
4096 Bytes (1024 floats)
4096
Byt
es (1
024
floa
ts)
Coalesced Implementation
Coalesced read32 x 32 (floats)
Coalesced write
Shared memory
Recalculating the array index
k k+16T0 T1T2 T3T4 T5… …T30 T31
16-way bank conflict
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
No
bank
con
flict
32-w
ay b
ank
conf
lict Thoeretical LDST Peak: 122.4 warp GIPS
Shared (ldst_inst)Better Efficiency!
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
L1 437.5 GTXN/s
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
Instruction Throughput
Naive
Coalesced
Theoretical
Theoretical
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS) Thoeretical Peak: 489.6 warp GIPS
L1 437.5 GTXN/s
HBM 25.9 GTXN/s
L2 93.6 GTXN/s
L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)
Shared memory bank access
Input Matrix: A (column major)
Output Matrix: AT (column major)
4096 Bytes (1024 floats)40
96 B
ytes
(102
4 fl
oats
)
4096 Bytes (1024 floats)
4096
Byt
es (1
024
floa
ts)
Coalesced_NoBankConflict Implementation
Instruction Throughput
Coalesced read32 x 32 (floats)
Coalesced write
Shared memory
Recalculating the array index
k k+16T0 T1
… …
T3T4 T5T2
T30 T31
[32][32+1]
10-2 10-1 100 101 102
Instuction Intensity (Warp Instructions per Transaction)
100
101
102
103
Perf
orm
ance
(war
p G
IPS)
No
bank
con
flict
32-w
ay b
ank
conf
lict Thoeretical LDST Peak: 122.4 warp GIPS
Shared (ldst_inst)Better Efficiency!
Naive
CoalescedNoBankConflict
Theoretical
Theoretical
Summary
• Instruction Throughput • Expanding the applicability of roofline to several emerging computational domains
• Global Memory Access Patterns• Quantify the memory access pattern, e.g. unit-stride vs. gather/scatter
• Shared Memory Access Patterns• Denote the efficiency of shared memory access.
26
There’s more in the paper !!• Thread Predication• More Examples
• HPGMG (mixed precision): three implementations. • BatchSW (integer-only): two implementations
• Tensor core• WMMA• cuBLAS
Closing Thoughts
Emerging Domains Architectural Evolution Practical Use
• Mixed precision • Integer-heavy• No floating points operations
• Instruction throughput• pipeline utilization
• Quantify memory pattern • Unit-stride, scatter/gather
• Efficiency of memory access• Warp efficiency
• Thread predication
Applicability of roofline to GPUs with greater insights
• Unified visualization of bandwidth and access efficiency
Applicability to several emerging computational domains
Rapidly tell how different aspects of modern GPU architectures constrain performance.
What the Instruction Roofline Models Tell us…
Future Work
• Apply our methodology to other accelerated architectures
• Extend the access efficiency concept to networking, I/O, and Lustre file systems.
29
Acknowledgement
• This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.
• This research used resources of the National Energy Research Scientific Computing Center (NERSC) which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and the Oak Ridge Leadership Facility which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
• We thank NVIDIA Corporation for their willingness to answer our myriad of questions on nvprof metrics.
30
Questions?
Backup
Top Related