Download - An Instruction Roofline Model for GPUs

Nan Ding, Samuel WilliamsComputational Research DivisionLawrence Berkeley National Lab{nanding, swwilliams}@lbl.gov

Nov.18th ,2019

An Instruction Roofline Model for GPUs

http://lbl.gov

GFLOP/s = min Peak GFLOP/sAI * GB/s

History of Roofline Models

2

• Sustainable performance is bound by

Roofline Model

• Arithmetic Intensity (AI) : FLOPs/Byte

Peak GFLOP/s

Atta

inab

le G

FLO

P/s

Arithmetic Intensity (Flops: Bytes)

DRAM G

B/s• Arithmetic Intensity (AI) : FLOPs/Byte(L1/L2/DRAM)• Additional compute ceilings: No-FMA peak

Peak GFLOP/s

Atta

inab

le G

FLO

P/s

DRAM G

B/s

L2 G

B/sL1 G

B/s


Hierarchical Roofline Model

No-FMA GFLOP/sbetter

Roofline is Useful

3

Memory-bound or Compute-bound Cache effects

Driving performance optimization

4

However…

• Even with sufficient data locality, one cannot guarantee high performance• Pathological memory access patterns?• Re-design the data layout?• Limited by instruction throughput?

• Many applications perform more integer operations than floating-point/no-flopsPeak GFLOP/s

Atta

inab

le G

FLO

P/s

DRAM

GB/

s

L2 G

B/s

L1 G

B/s


???

Motivation for the Instruction Roofline Model

Emerging Domains Architectural Evolution Practical Use

• Mixed precision, Integer-heavy

• No floating points operations

• Instruction throughput• pipeline utilization

• Warp efficiency• Thread predication

• Memory access patterns• reduce wasted transactions• reduce redundant access

More than Flops A New Set of Metrics

• What is holding you back?

• What optimizations should be performed?

• When to stop optimization?

Drive Code Optimization in a Good Visual Manner

GFLOP/s = min Peak GFLOP/ssAI * GB/s

6

The First Step to Instruction Roofline Model

Atta

inab

le G

IPS

DRAM G

B/s

L2 G

B/sL1 G

B/s

Arithmetic Intensity (Flops: Byte)

Peak GIPS

Expanding the applicability of roofline to several emerging computational domains

Form the basis for several subsequent Instruction Roofline-oriented performance analysis technologies

• Identify fetch-decode-issue bottlenecks• Function unit utilization (FPU, tensor,

integer, etc...)


GIPS = min Peak GIPSII * GB/s

7

The Second Step to Instruction Roofline Model

Atta

inab

le G

IPS

DRAM G

B/s

L2 G

B/sL1 G

B/s

Instruction Intensity (Instructions: Byte)

Peak GIPS

• Instruction Intensity: Instructions per Byte

Limitation:Hard to motivate more performance analysis techniques, such as memory pattern access

GFLOP/s = min Peak GFLOP/sAI * GB/s

Expanding the applicability of roofline to several emerging computational domains


GIPS = min Peak GIPSII * GB/s

8

A Final Step to Instruction Roofline Model on GPUs

Atta

inab

le G

IPS

DRAM G

TXN/s

L2 G

TXN/s

L1 G

TXN/s

Instruction Intensity (Instructions: Transaction)

Peak GIPS

• Memory Transaction• the natural unit to access data on NVIDIA GPUs• the natural unit to analyze memory access• a warp-level load/store -> 1 - 32 transactions

Expanding the applicability of roofline to more performance analysis technologies GPUs

• Instruction Intensity • Instructions per Transaction

Form the basis for several subsequent Instruction Roofline-oriented performance analysis technologies on GPUs:

• Memory access patterns

GIPS = min Peak GIPSInstructions/Transaction * GTXN/s

Instruction Roofline Performance Model

• Sustainable performance of is bound by

• Theoretical Peak on V100:80 SMs x 4 warp scheduler x 1 inst/cyc x 1.53GHz = 489.6 GIPS

[1] https://bitbucket.org/berkeleylab/cs-roofline-toolkit 9

GIPS = min Peak GIPSInstruction Intensity * GTransaction/s

10-2 10-1 100 101 102

Instruction Intensity (Warp Instructions per Transaction)

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

HBM 25.9 GTXN/s

L2 93.6 GTXN/s

L1 437.5 GTXN/s

Theoretical Peak: 489.6 warp GIPS

• Memory ceilings on V100:• Based on the GB/s from Empirical Roofline

Toolkit[1]

• Calculate the number of equivalent 32-byte transactions

Instruction Throughput

Capabilities of Instruction Roofline Performance Model （1/2）

Capabilities of Instruction Roofline Model -- Instruction Throughput

• Instruction throughputAll instruction, Transactions of each memory level(L1/L2/HBM), runtime

10-2 10-1 100 101

Instuction Intensity (Warp Instructions per Transaction)

10-1

100

101

102

103

Perf

orm

ance

(war

p G

IPS) Thoeretical Peak: 489.6 warp GIPS

HBM 25.9 GTXN/sL2 93.6 GTXN/sL1 437.5 GTXN/s

L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)

• Insights:1. Distance between the

ceilings and dots can tell memory-bound or instruction-bound

2. Distance between the two plots (different memory level) can tell the data reuse.

11

Theoretical

• Instruction throughputAll instruction, Transactions of each memory level(L1/L2/HBM), runtime

10-2 10-1 100 101


10-1

100

101

102

103

Perf

orm

ance

(war

p G


HBM 25.9 GTXN/sL2 93.6 GTXN/sL1 437.5 GTXN/s


Better!

12

• Insights:1. Distance between the

ceilings and dots can tell memory-bound or instruction-bound

2. Distance between the two plots (different memory level) can tell the data reuse.

Capabilities of Instruction Roofline Model -- Instruction Throughput

Theoretical

Memory Access Patterns

Capabilities of Instruction Roofline Performance Model （2/2）

Memory Access Pattern is Critical to Application Execution Time

14

Easy to code in an inefficient memory pattern

Low performance

Hidden deep in the code

Time consuming to reason the performance

Capabilities of Instruction Roofline Model -- Global Memory Patterns

1 warp-level load/store -> 1 to 32 transactions depending on memory patterns

𝟏 "#$% &'()#' *+,-𝟏 &'()#' -$#./#012(.

“Stride-0” 𝟏 "#$% &'()#' *+,-

𝟑𝟐 &'()#' -$#./#012(./

“Stride-8”

Useful data Waste data

15

war

p0

31

int array[]1 2 7 310 … …

1 global transaction = 32 Bytesw

arp

031

…1 2 70 …

…

…

1 global transaction

9 10 158 … 16

… 1 global transaction

one cache line: 128 bytes

24 … 31

1 2 70 … …

1 global transaction …

9 10 158 … 16

1 global transaction

24 … 31

𝟏 "#$% &'()#' *+,-𝟒 &'()#' -$#./#012(./

“Stride-1” (Unit Stride)

warp

1 2 3 30 310 …

int array[]

…

10-2 10-1 100


10-2

10-1

100

101

102

103Pe

rfor

man

ce (w

arp

GIP

S)Theoretical Peak: 489.6 warp GIPS

Theoretical LDST Peak: 122.4 warp GIPS

Capabilities of Instruction Roofline Performance Model ---- Three Intensity ``Walls’’ for Stride Global Memory Access Patterns

16

Better Efficiency!

𝟏 "#$% &'()#' *+,-𝟏 &'()#' -$#./#012(.

𝟏 "#$% &'()#' *+,-𝟑𝟐 &'()#' -$#./#012(./

𝟏 "#$% &'()#' *+,-𝟒 &'()#' -$#./#012(./

Stride-8 Stride-1 Stride-0

10-4 10-2 100 102 104


10-2

10-1

100

101

102

103

Perfo

rman

ce (w

arp

GIP

S)

Strid

e-0

Stri

de-1

Strid

e-8Global 4

37.5 GTXN/s

Thoeretical Peak: 489.6 warp GIPS

Thoeretical LDST Peak: 122.4 warp GIPS

Capabilities of Instruction Roofline Performance Model ---- Characterize Global Memory Access Patterns

Breakdown the L1 dot into Global Memory Only metrics -> Stride Global Memory Patterns according to Global Memory Walls

X-axis: 𝑮𝒍𝒐𝒃𝒂𝒍 𝑳𝑫𝑺𝑻 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑮𝒍𝒐𝒃𝒂𝒍 𝒕𝒓𝒂𝒏𝒔𝒂𝒄𝒕𝒊𝒐𝒏𝒔

Y-axis : 𝑮𝒍𝒐𝒃𝒂𝒍 𝑳𝑫𝑺𝑻 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑹𝒖𝒏𝒕𝒊𝒎𝒆

17

Better Efficiency!

TheoreticalX-axis: 𝑻𝒐𝒕𝒂𝒍 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑻𝒐𝒕𝒂𝒍 𝑳𝟏 (𝒈𝒍𝒐𝒃𝒂𝒍K𝒔𝒉𝒂𝒓𝒆𝒅K 𝒍𝒐𝒄𝒂𝒍) 𝒕𝒓𝒂𝒏𝒔𝒂𝒄𝒕𝒊𝒐𝒏𝒔

Y-axis : 𝑻𝒐𝒕𝒂𝒍 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑹𝒖𝒏𝒕𝒊𝒎𝒆

Capabilities of Instruction Roofline Performance Model ---- Shared Memory Access Patterns

Notional Physical

__shared__ int array[32][32]32 banks per shared memory rowEach bank is 4 Byte

Reduce bank conflicts__shared__ int array[32][32+1]

310 1

No bank conflict

__shared__ int array[32][32]

32-way bank conflict

No bank conflict

…

…

…

…warp

310 1…

…

…

310 1…

……

0

0

…war

p1

31

…warp

18

• “No bank conflict” = O"#$% ,P#$QR *+,-O ,P#$QR -$#./#012(.

• different 4-byte word, different bank• same 4-byte word , same bank

• “32-way bank conflict” = O"#$% ,P#$QR *+,-ST ,P#$QR -$#./#012(./

• different 4-byte words, same bank

Capabilities of Instruction Roofline Performance Model ---- Two Intensity ``Walls’’ for Bank Shared Memory Access Patterns

19

10-4 10-2 100 102 104


10-2

10-1

100

101

102

103Pe

rfor

man

ce (w

arp

GIP

S)

No

bank

con

flict

32-w

ay b

ank

conf

lictShare

d 109.3 GTXN/s

Thoeretical Peak: 489.6 warp GIPS


Better Efficiency!

Theoretical

10-4 10-2 100 102 104


10-2

10-1

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

No

bank

con

flict

32-w

ay b

ank

conf

lict

Shared 109.3 GTXN/s

Capabilities of Instruction Roofline Performance Model ---- Characterize Shared Memory Access Patterns

Breakdown the L1 dot into Shared Memory Only metrics -> banked Shared Memory Patterns according to Shared Memory Walls

X-axis: 𝑻𝒐𝒕𝒂𝒍 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑻𝒐𝒕𝒂𝒍 𝑳𝟏 (𝒈𝒍𝒐𝒃𝒂𝒍 K 𝒔𝒉𝒂𝒓𝒆𝒅K 𝒍𝒐𝒄𝒂𝒍) 𝒕𝒓𝒂𝒏𝒔𝒂𝒄𝒕𝒊𝒐𝒏𝒔

Y-axis : 𝑻𝒐𝒕𝒂𝒍 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑹𝒖𝒏𝒕𝒊𝒎𝒆

X-axis: 𝑺𝒉𝒂𝒓𝒆𝒅 𝑳𝑫𝑺𝑻 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑺𝒉𝒂𝒓𝒅 𝒕𝒓𝒂𝒏𝒔𝒂𝒄𝒕𝒊𝒐𝒏𝒔

Y-axis : 𝑺𝒉𝒂𝒓𝒆𝒅 𝑳𝑫𝑺𝑻 𝒊𝒏𝒔𝒕𝒓𝒖𝒄𝒕𝒊𝒐𝒏𝒔𝑹𝒖𝒏𝒕𝒊𝒎𝒆

20

Better Efficiency!

An example to understand the outputs from Instruction Roofline Model

Example: Matrix Transpose

Description A-> AT, stored in column majorMatrix size 1024 x 1024Machine NVIDIA’s latest V100 GPU

Implementations Naive Coalesced Coalesced_NoBankConflict

Simple copy

Coaleced global memory access

Based on “Coalesced”Reduce shared memory bank conflicts

Using 32×8 thread blocks operating on 32×32 matrix tiles

21

Global memory stride access

Input Matrix: A (column major)

Output Matrix: AT (column major)

4096 Bytes (1024 floats)40

96 B

ytes

(102

4 fl

oats

)

Coalesced read

Stride over 4096 Bytes

4096 Bytes (1024 floats)

4096

Byt

es (1

024

floa

ts)

Naive Implementation


10-2 10-1 100 101 102


10-1

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

Strid

e-0

Strid

e-1

Strid

e-8


Global 437.5 GTXN/s

Global (ldst_inst)

Better Efficiency!

10-2 10-1 100 101 102


100

101

102

103

Perf

orm

ance

(war

p G


L1 437.5 GTXN/s

HBM 25.9 GTXN/s

L2 93.6 GTXN/s


Theoretical

Theoretical

10-2 10-1 100 101 102


100

101

102

103

Perf

orm

ance

(war

p G


L1 437.5 GTXN/s

HBM 25.9 GTXN/s

L2 93.6 GTXN/s


10-2 10-1 100 101 102


10-1

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

Strid

e-0

Strid

e-1

Strid

e-8


Global 437.5 GTXN/s

Global (ldst_inst)

Global memory stride access




96 B

ytes

(102

4 fl

oats

)


4096

Byt

es (1

024

floa

ts)

Coalesced Implementation


Coalesced read32 x 32 (floats)

Coalesced write

Shared memory

Recalculating the array index

Better Efficiency!

Naive

Coalesced

Theoretical

Theoretical

Shared memory bank access




96 B

ytes

(102

4 fl

oats

)


4096

Byt

es (1

024

floa

ts)

Coalesced Implementation


Coalesced write

Shared memory


k k+16T0 T1T2 T3T4 T5… …T30 T31

16-way bank conflict

10-2 10-1 100 101 102


100

101

102

103

Perf

orm

ance

(war

p G

IPS)

No

bank

con

flict

32-w

ay b

ank

conf

lict Thoeretical LDST Peak: 122.4 warp GIPS

Shared (ldst_inst)Better Efficiency!

10-2 10-1 100 101 102


100

101

102

103

Perf

orm

ance

(war

p G


L1 437.5 GTXN/s

HBM 25.9 GTXN/s

L2 93.6 GTXN/s



Naive

Coalesced

Theoretical

Theoretical

10-2 10-1 100 101 102


100

101

102

103

Perf

orm

ance

(war

p G


L1 437.5 GTXN/s

HBM 25.9 GTXN/s

L2 93.6 GTXN/s


Shared memory bank access




96 B

ytes

(102

4 fl

oats

)


4096

Byt

es (1

024

floa

ts)

Coalesced_NoBankConflict Implementation



Coalesced write

Shared memory


k k+16T0 T1

… …

T3T4 T5T2

T30 T31

[32][32+1]

10-2 10-1 100 101 102


100

101

102

103

Perf

orm

ance

(war

p G

IPS)

No

bank

con

flict

32-w

ay b

ank

conf

lict Thoeretical LDST Peak: 122.4 warp GIPS

Shared (ldst_inst)Better Efficiency!

Naive

CoalescedNoBankConflict

Theoretical

Theoretical

Summary

• Instruction Throughput • Expanding the applicability of roofline to several emerging computational domains

• Global Memory Access Patterns• Quantify the memory access pattern, e.g. unit-stride vs. gather/scatter

• Shared Memory Access Patterns• Denote the efficiency of shared memory access.

26

There’s more in the paper !!• Thread Predication• More Examples

• HPGMG (mixed precision): three implementations. • BatchSW (integer-only): two implementations

• Tensor core• WMMA• cuBLAS

Closing Thoughts

Emerging Domains Architectural Evolution Practical Use

• Mixed precision • Integer-heavy• No floating points operations

• Instruction throughput• pipeline utilization

• Quantify memory pattern • Unit-stride, scatter/gather

• Efficiency of memory access• Warp efficiency

• Thread predication

Applicability of roofline to GPUs with greater insights

• Unified visualization of bandwidth and access efficiency

Applicability to several emerging computational domains

Rapidly tell how different aspects of modern GPU architectures constrain performance.

What the Instruction Roofline Models Tell us…

Future Work

• Apply our methodology to other accelerated architectures

• Extend the access efficiency concept to networking, I/O, and Lustre file systems.

29

Acknowledgement

• This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.

• This research used resources of the National Energy Research Scientific Computing Center (NERSC) which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and the Oak Ridge Leadership Facility which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

• We thank NVIDIA Corporation for their willingness to answer our myriad of questions on nvprof metrics.

30

Questions?

Backup