An Instruction Roofline Model for GPUs

32
Nan Ding , Samuel Williams Computational Research Division Lawrence Berkeley National Lab {nanding, swwilliams}@lbl.gov Nov.18 th ,2019 An Instruction Roofline Model for GPUs

Transcript of An Instruction Roofline Model for GPUs

Page 1: An Instruction Roofline Model for GPUs

Nan Ding, Samuel WilliamsComputational Research DivisionLawrence Berkeley National Lab{nanding, swwilliams}@lbl.gov

Nov.18th ,2019

An Instruction Roofline Model for GPUs

Page 2: An Instruction Roofline Model for GPUs

GFLOP/s = min Peak GFLOP/sAI * GB/s

History of Roofline Models

2

โ€ข Sustainable performance is bound by

Roofline Model

โ€ข Arithmetic Intensity (AI) : FLOPs/Byte

Peak GFLOP/s

Atta

inab

le G

FLO

P/s

Arithmetic Intensity (Flops: Bytes)

DRAM G

B/sโ€ข Arithmetic Intensity (AI) : FLOPs/Byte(L1/L2/DRAM)โ€ข Additional compute ceilings: No-FMA peak

Peak GFLOP/s

Atta

inab

le G

FLO

P/s

DRAM G

B/s

L2 G

B/sL1 G

B/s

Arithmetic Intensity (Flops: Bytes)

Hierarchical Roofline Model

No-FMA GFLOP/sbetter

Page 3: An Instruction Roofline Model for GPUs

Roofline is Useful

3

Memory-bound or Compute-bound Cache effects

Driving performance optimization

Page 4: An Instruction Roofline Model for GPUs

4

Howeverโ€ฆ

โ€ข Even with sufficient data locality, one cannot guarantee high performanceโ€ข Pathological memory access patterns?โ€ข Re-design the data layout?โ€ข Limited by instruction throughput?

โ€ข Many applications perform more integer operations than floating-point/no-flopsPeak GFLOP/s

Atta

inab

le G

FLO

P/s

DRAM

GB/

s

L2 G

B/s

L1 G

B/s

Arithmetic Intensity (Flops: Bytes)

???

Page 5: An Instruction Roofline Model for GPUs

Motivation for the Instruction Roofline Model

Emerging Domains Architectural Evolution Practical Use

โ€ข Mixed precision, Integer-heavy

โ€ข No floating points operations

โ€ข Instruction throughputโ€ข pipeline utilization

โ€ข Warp efficiencyโ€ข Thread predication

โ€ข Memory access patternsโ€ข reduce wasted transactionsโ€ข reduce redundant access

More than Flops A New Set of Metrics

โ€ข What is holding you back?

โ€ข What optimizations should be performed?

โ€ข When to stop optimization?

Drive Code Optimization in a Good Visual Manner

Page 6: An Instruction Roofline Model for GPUs

GFLOP/s = min Peak GFLOP/ssAI * GB/s

6

The First Step to Instruction Roofline Model

Atta

inab

le G

IPS

DRAM G

B/s

L2 G

B/sL1 G

B/s

Arithmetic Intensity (Flops: Byte)

Peak GIPS

Expanding the applicability of roofline to several emerging computational domains

Form the basis for several subsequent Instruction Roofline-oriented performance analysis technologies

โ€ข Identify fetch-decode-issue bottlenecksโ€ข Function unit utilization (FPU, tensor,

integer, etc...)

โ€ข Sustainable performance is bound by

GIPS = min Peak GIPSII * GB/s

Page 7: An Instruction Roofline Model for GPUs

7

The Second Step to Instruction Roofline Model

Atta

inab

le G

IPS

DRAM G

B/s

L2 G

B/sL1 G

B/s

Instruction Intensity (Instructions: Byte)

Peak GIPS

โ€ข Instruction Intensity: Instructions per Byte

Limitation:Hard to motivate more performance analysis techniques, such as memory pattern access

GFLOP/s = min Peak GFLOP/sAI * GB/s

Expanding the applicability of roofline to several emerging computational domains

โ€ข Sustainable performance is bound by

GIPS = min Peak GIPSII * GB/s

Page 8: An Instruction Roofline Model for GPUs

8

A Final Step to Instruction Roofline Model on GPUs

Atta

inab

le G

IPS

DRAM G

TXN/s

L2 G

TXN/s

L1 G

TXN/s

Instruction Intensity (Instructions: Transaction)

Peak GIPS

โ€ข Memory Transactionโ€ข the natural unit to access data on NVIDIA GPUsโ€ข the natural unit to analyze memory accessโ€ข a warp-level load/store -> 1 - 32 transactions

Expanding the applicability of roofline to more performance analysis technologies GPUs

โ€ข Instruction Intensity โ€ข Instructions per Transaction

Form the basis for several subsequent Instruction Roofline-oriented performance analysis technologies on GPUs:

โ€ข Memory access patterns

GIPS = min Peak GIPSInstructions/Transaction * GTXN/s

Page 9: An Instruction Roofline Model for GPUs

Instruction Roofline Performance Model

โ€ข Sustainable performance of is bound by

โ€ข Theoretical Peak on V100:80 SMs x 4 warp scheduler x 1 inst/cyc x 1.53GHz = 489.6 GIPS

[1] https://bitbucket.org/berkeleylab/cs-roofline-toolkit 9

GIPS = min Peak GIPSInstruction Intensity * GTransaction/s

10-2 10-1 100 101 102

Instruction Intensity (Warp Instructions per Transaction)

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

HBM 25.9 GTXN/s

L2 93.6 GTXN/s

L1 437.5 GTXN/s

Theoretical Peak: 489.6 warp GIPS

โ€ข Memory ceilings on V100:โ€ข Based on the GB/s from Empirical Roofline

Toolkit[1]

โ€ข Calculate the number of equivalent 32-byte transactions

Page 10: An Instruction Roofline Model for GPUs

Instruction Throughput

Capabilities of Instruction Roofline Performance Model ๏ผˆ1/2๏ผ‰

Page 11: An Instruction Roofline Model for GPUs

Capabilities of Instruction Roofline Model -- Instruction Throughput

โ€ข Instruction throughputAll instruction, Transactions of each memory level(L1/L2/HBM), runtime

10-2 10-1 100 101

Instuction Intensity (Warp Instructions per Transaction)

10-1

100

101

102

103

Perf

orm

ance

(war

p G

IPS) Thoeretical Peak: 489.6 warp GIPS

HBM 25.9 GTXN/sL2 93.6 GTXN/sL1 437.5 GTXN/s

L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)

โ€ข Insights:1. Distance between the

ceilings and dots can tell memory-bound or instruction-bound

2. Distance between the two plots (different memory level) can tell the data reuse.

11

Theoretical

Page 12: An Instruction Roofline Model for GPUs

โ€ข Instruction throughputAll instruction, Transactions of each memory level(L1/L2/HBM), runtime

10-2 10-1 100 101

Instuction Intensity (Warp Instructions per Transaction)

10-1

100

101

102

103

Perf

orm

ance

(war

p G

IPS) Thoeretical Peak: 489.6 warp GIPS

HBM 25.9 GTXN/sL2 93.6 GTXN/sL1 437.5 GTXN/s

L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)

Better!

12

โ€ข Insights:1. Distance between the

ceilings and dots can tell memory-bound or instruction-bound

2. Distance between the two plots (different memory level) can tell the data reuse.

Capabilities of Instruction Roofline Model -- Instruction Throughput

Theoretical

Page 13: An Instruction Roofline Model for GPUs

Memory Access Patterns

Capabilities of Instruction Roofline Performance Model ๏ผˆ2/2๏ผ‰

Page 14: An Instruction Roofline Model for GPUs

Memory Access Pattern is Critical to Application Execution Time

14

Easy to code in an inefficient memory pattern

Low performance

Hidden deep in the code

Time consuming to reason the performance

Page 15: An Instruction Roofline Model for GPUs

Capabilities of Instruction Roofline Model -- Global Memory Patterns

1 warp-level load/store -> 1 to 32 transactions depending on memory patterns

๐Ÿ "#$% &'()#' *+,-๐Ÿ &'()#' -$#./#012(.

โ€œStride-0โ€ ๐Ÿ "#$% &'()#' *+,-

๐Ÿ‘๐Ÿ &'()#' -$#./#012(./

โ€œStride-8โ€

Useful data Waste data

15

war

p0

31

int array[]1 2 7 310 โ€ฆ โ€ฆ

1 global transaction = 32 Bytesw

arp

031

โ€ฆ1 2 70 โ€ฆ

โ€ฆ

โ€ฆ

1 global transaction

9 10 158 โ€ฆ 16

โ€ฆ 1 global transaction

one cache line: 128 bytes

24 โ€ฆ 31

1 2 70 โ€ฆ โ€ฆ

1 global transaction โ€ฆ

9 10 158 โ€ฆ 16

1 global transaction

24 โ€ฆ 31

๐Ÿ "#$% &'()#' *+,-๐Ÿ’ &'()#' -$#./#012(./

โ€œStride-1โ€ (Unit Stride)

warp

1 2 3 30 310 โ€ฆ

int array[]

โ€ฆ

Page 16: An Instruction Roofline Model for GPUs

10-2 10-1 100

Instuction Intensity (Warp Instructions per Transaction)

10-2

10-1

100

101

102

103Pe

rfor

man

ce (w

arp

GIP

S)Theoretical Peak: 489.6 warp GIPS

Theoretical LDST Peak: 122.4 warp GIPS

Capabilities of Instruction Roofline Performance Model ---- Three Intensity ``Wallsโ€™โ€™ for Stride Global Memory Access Patterns

16

Better Efficiency!

๐Ÿ "#$% &'()#' *+,-๐Ÿ &'()#' -$#./#012(.

๐Ÿ "#$% &'()#' *+,-๐Ÿ‘๐Ÿ &'()#' -$#./#012(./

๐Ÿ "#$% &'()#' *+,-๐Ÿ’ &'()#' -$#./#012(./

Stride-8 Stride-1 Stride-0

Page 17: An Instruction Roofline Model for GPUs

10-4 10-2 100 102 104

Instuction Intensity (Warp Instructions per Transaction)

10-2

10-1

100

101

102

103

Perfo

rman

ce (w

arp

GIP

S)

Strid

e-0

Stri

de-1

Strid

e-8Global 4

37.5 GTXN/s

Thoeretical Peak: 489.6 warp GIPS

Thoeretical LDST Peak: 122.4 warp GIPS

Capabilities of Instruction Roofline Performance Model ---- Characterize Global Memory Access Patterns

Breakdown the L1 dot into Global Memory Only metrics -> Stride Global Memory Patterns according to Global Memory Walls

X-axis: ๐‘ฎ๐’๐’๐’ƒ๐’‚๐’ ๐‘ณ๐‘ซ๐‘บ๐‘ป ๐’Š๐’๐’”๐’•๐’“๐’–๐’„๐’•๐’Š๐’๐’๐’”๐‘ฎ๐’๐’๐’ƒ๐’‚๐’ ๐’•๐’“๐’‚๐’๐’”๐’‚๐’„๐’•๐’Š๐’๐’๐’”

Y-axis : ๐‘ฎ๐’๐’๐’ƒ๐’‚๐’ ๐‘ณ๐‘ซ๐‘บ๐‘ป ๐’Š๐’๐’”๐’•๐’“๐’–๐’„๐’•๐’Š๐’๐’๐’”๐‘น๐’–๐’๐’•๐’Š๐’Ž๐’†

17

Better Efficiency!

TheoreticalX-axis: ๐‘ป๐’๐’•๐’‚๐’ ๐’Š๐’๐’”๐’•๐’“๐’–๐’„๐’•๐’Š๐’๐’๐’”๐‘ป๐’๐’•๐’‚๐’ ๐‘ณ๐Ÿ (๐’ˆ๐’๐’๐’ƒ๐’‚๐’K๐’”๐’‰๐’‚๐’“๐’†๐’…K ๐’๐’๐’„๐’‚๐’) ๐’•๐’“๐’‚๐’๐’”๐’‚๐’„๐’•๐’Š๐’๐’๐’”

Y-axis : ๐‘ป๐’๐’•๐’‚๐’ ๐’Š๐’๐’”๐’•๐’“๐’–๐’„๐’•๐’Š๐’๐’๐’”๐‘น๐’–๐’๐’•๐’Š๐’Ž๐’†

Page 18: An Instruction Roofline Model for GPUs

Capabilities of Instruction Roofline Performance Model ---- Shared Memory Access Patterns

Notional Physical

__shared__ int array[32][32]32 banks per shared memory rowEach bank is 4 Byte

Reduce bank conflicts__shared__ int array[32][32+1]

310 1

No bank conflict

__shared__ int array[32][32]

32-way bank conflict

No bank conflict

โ€ฆ

โ€ฆ

โ€ฆ

โ€ฆwarp

310 1โ€ฆ

โ€ฆ

โ€ฆ

310 1โ€ฆ

โ€ฆโ€ฆ

0

0

โ€ฆwar

p1

31

โ€ฆwarp

18

Page 19: An Instruction Roofline Model for GPUs

โ€ข โ€œNo bank conflictโ€ = O"#$% ,P#$QR *+,-O ,P#$QR -$#./#012(.

โ€ข different 4-byte word, different bankโ€ข same 4-byte word , same bank

โ€ข โ€œ32-way bank conflictโ€ = O"#$% ,P#$QR *+,-ST ,P#$QR -$#./#012(./

โ€ข different 4-byte words, same bank

Capabilities of Instruction Roofline Performance Model ---- Two Intensity ``Wallsโ€™โ€™ for Bank Shared Memory Access Patterns

19

10-4 10-2 100 102 104

Instuction Intensity (Warp Instructions per Transaction)

10-2

10-1

100

101

102

103Pe

rfor

man

ce (w

arp

GIP

S)

No

bank

con

flict

32-w

ay b

ank

conf

lictShare

d 109.3 GTXN/s

Thoeretical Peak: 489.6 warp GIPS

Thoeretical LDST Peak: 122.4 warp GIPS

Better Efficiency!

Theoretical

Page 20: An Instruction Roofline Model for GPUs

10-4 10-2 100 102 104

Instuction Intensity (Warp Instructions per Transaction)

10-2

10-1

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

No

bank

con

flict

32-w

ay b

ank

conf

lict

Shared 109.3 GTXN/s

Capabilities of Instruction Roofline Performance Model ---- Characterize Shared Memory Access Patterns

Breakdown the L1 dot into Shared Memory Only metrics -> banked Shared Memory Patterns according to Shared Memory Walls

X-axis: ๐‘ป๐’๐’•๐’‚๐’ ๐’Š๐’๐’”๐’•๐’“๐’–๐’„๐’•๐’Š๐’๐’๐’”๐‘ป๐’๐’•๐’‚๐’ ๐‘ณ๐Ÿ (๐’ˆ๐’๐’๐’ƒ๐’‚๐’ K ๐’”๐’‰๐’‚๐’“๐’†๐’…K ๐’๐’๐’„๐’‚๐’) ๐’•๐’“๐’‚๐’๐’”๐’‚๐’„๐’•๐’Š๐’๐’๐’”

Y-axis : ๐‘ป๐’๐’•๐’‚๐’ ๐’Š๐’๐’”๐’•๐’“๐’–๐’„๐’•๐’Š๐’๐’๐’”๐‘น๐’–๐’๐’•๐’Š๐’Ž๐’†

X-axis: ๐‘บ๐’‰๐’‚๐’“๐’†๐’… ๐‘ณ๐‘ซ๐‘บ๐‘ป ๐’Š๐’๐’”๐’•๐’“๐’–๐’„๐’•๐’Š๐’๐’๐’”๐‘บ๐’‰๐’‚๐’“๐’… ๐’•๐’“๐’‚๐’๐’”๐’‚๐’„๐’•๐’Š๐’๐’๐’”

Y-axis : ๐‘บ๐’‰๐’‚๐’“๐’†๐’… ๐‘ณ๐‘ซ๐‘บ๐‘ป ๐’Š๐’๐’”๐’•๐’“๐’–๐’„๐’•๐’Š๐’๐’๐’”๐‘น๐’–๐’๐’•๐’Š๐’Ž๐’†

20

Better Efficiency!

Page 21: An Instruction Roofline Model for GPUs

An example to understand the outputs from Instruction Roofline Model

Example: Matrix Transpose

Description A-> AT, stored in column majorMatrix size 1024 x 1024Machine NVIDIAโ€™s latest V100 GPU

Implementations Naive Coalesced Coalesced_NoBankConflict

Simple copy

Coaleced global memory access

Based on โ€œCoalescedโ€Reduce shared memory bank conflicts

Using 32ร—8 thread blocks operating on 32ร—32 matrix tiles

21

Page 22: An Instruction Roofline Model for GPUs

Global memory stride access

Input Matrix: A (column major)

Output Matrix: AT (column major)

4096 Bytes (1024 floats)40

96 B

ytes

(102

4 fl

oats

)

Coalesced read

Stride over 4096 Bytes

4096 Bytes (1024 floats)

4096

Byt

es (1

024

floa

ts)

Naive Implementation

Instruction Throughput

10-2 10-1 100 101 102

Instuction Intensity (Warp Instructions per Transaction)

10-1

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

Strid

e-0

Strid

e-1

Strid

e-8

Thoeretical LDST Peak: 122.4 warp GIPS

Global 437.5 GTXN/s

Global (ldst_inst)

Better Efficiency!

10-2 10-1 100 101 102

Instuction Intensity (Warp Instructions per Transaction)

100

101

102

103

Perf

orm

ance

(war

p G

IPS) Thoeretical Peak: 489.6 warp GIPS

L1 437.5 GTXN/s

HBM 25.9 GTXN/s

L2 93.6 GTXN/s

L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)

Theoretical

Theoretical

Page 23: An Instruction Roofline Model for GPUs

10-2 10-1 100 101 102

Instuction Intensity (Warp Instructions per Transaction)

100

101

102

103

Perf

orm

ance

(war

p G

IPS) Thoeretical Peak: 489.6 warp GIPS

L1 437.5 GTXN/s

HBM 25.9 GTXN/s

L2 93.6 GTXN/s

L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)

10-2 10-1 100 101 102

Instuction Intensity (Warp Instructions per Transaction)

10-1

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

Strid

e-0

Strid

e-1

Strid

e-8

Thoeretical LDST Peak: 122.4 warp GIPS

Global 437.5 GTXN/s

Global (ldst_inst)

Global memory stride access

Input Matrix: A (column major)

Output Matrix: AT (column major)

4096 Bytes (1024 floats)40

96 B

ytes

(102

4 fl

oats

)

4096 Bytes (1024 floats)

4096

Byt

es (1

024

floa

ts)

Coalesced Implementation

Instruction Throughput

Coalesced read32 x 32 (floats)

Coalesced write

Shared memory

Recalculating the array index

Better Efficiency!

Naive

Coalesced

Theoretical

Theoretical

Page 24: An Instruction Roofline Model for GPUs

Shared memory bank access

Input Matrix: A (column major)

Output Matrix: AT (column major)

4096 Bytes (1024 floats)40

96 B

ytes

(102

4 fl

oats

)

4096 Bytes (1024 floats)

4096

Byt

es (1

024

floa

ts)

Coalesced Implementation

Coalesced read32 x 32 (floats)

Coalesced write

Shared memory

Recalculating the array index

k k+16T0 T1T2 T3T4 T5โ€ฆ โ€ฆT30 T31

16-way bank conflict

10-2 10-1 100 101 102

Instuction Intensity (Warp Instructions per Transaction)

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

No

bank

con

flict

32-w

ay b

ank

conf

lict Thoeretical LDST Peak: 122.4 warp GIPS

Shared (ldst_inst)Better Efficiency!

10-2 10-1 100 101 102

Instuction Intensity (Warp Instructions per Transaction)

100

101

102

103

Perf

orm

ance

(war

p G

IPS) Thoeretical Peak: 489.6 warp GIPS

L1 437.5 GTXN/s

HBM 25.9 GTXN/s

L2 93.6 GTXN/s

L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)

Instruction Throughput

Naive

Coalesced

Theoretical

Theoretical

Page 25: An Instruction Roofline Model for GPUs

10-2 10-1 100 101 102

Instuction Intensity (Warp Instructions per Transaction)

100

101

102

103

Perf

orm

ance

(war

p G

IPS) Thoeretical Peak: 489.6 warp GIPS

L1 437.5 GTXN/s

HBM 25.9 GTXN/s

L2 93.6 GTXN/s

L1 (tot_inst)L2 (tot_inst)HBM (tot_inst)

Shared memory bank access

Input Matrix: A (column major)

Output Matrix: AT (column major)

4096 Bytes (1024 floats)40

96 B

ytes

(102

4 fl

oats

)

4096 Bytes (1024 floats)

4096

Byt

es (1

024

floa

ts)

Coalesced_NoBankConflict Implementation

Instruction Throughput

Coalesced read32 x 32 (floats)

Coalesced write

Shared memory

Recalculating the array index

k k+16T0 T1

โ€ฆ โ€ฆ

T3T4 T5T2

T30 T31

[32][32+1]

10-2 10-1 100 101 102

Instuction Intensity (Warp Instructions per Transaction)

100

101

102

103

Perf

orm

ance

(war

p G

IPS)

No

bank

con

flict

32-w

ay b

ank

conf

lict Thoeretical LDST Peak: 122.4 warp GIPS

Shared (ldst_inst)Better Efficiency!

Naive

CoalescedNoBankConflict

Theoretical

Theoretical

Page 26: An Instruction Roofline Model for GPUs

Summary

โ€ข Instruction Throughput โ€ข Expanding the applicability of roofline to several emerging computational domains

โ€ข Global Memory Access Patternsโ€ข Quantify the memory access pattern, e.g. unit-stride vs. gather/scatter

โ€ข Shared Memory Access Patternsโ€ข Denote the efficiency of shared memory access.

26

Thereโ€™s more in the paper !!โ€ข Thread Predicationโ€ข More Examples

โ€ข HPGMG (mixed precision): three implementations. โ€ข BatchSW (integer-only): two implementations

โ€ข Tensor coreโ€ข WMMAโ€ข cuBLAS

Page 27: An Instruction Roofline Model for GPUs

Closing Thoughts

Page 28: An Instruction Roofline Model for GPUs

Emerging Domains Architectural Evolution Practical Use

โ€ข Mixed precision โ€ข Integer-heavyโ€ข No floating points operations

โ€ข Instruction throughputโ€ข pipeline utilization

โ€ข Quantify memory pattern โ€ข Unit-stride, scatter/gather

โ€ข Efficiency of memory accessโ€ข Warp efficiency

โ€ข Thread predication

Applicability of roofline to GPUs with greater insights

โ€ข Unified visualization of bandwidth and access efficiency

Applicability to several emerging computational domains

Rapidly tell how different aspects of modern GPU architectures constrain performance.

What the Instruction Roofline Models Tell usโ€ฆ

Page 29: An Instruction Roofline Model for GPUs

Future Work

โ€ข Apply our methodology to other accelerated architectures

โ€ข Extend the access efficiency concept to networking, I/O, and Lustre file systems.

29

Page 30: An Instruction Roofline Model for GPUs

Acknowledgement

โ€ข This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.

โ€ข This research used resources of the National Energy Research Scientific Computing Center (NERSC) which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and the Oak Ridge Leadership Facility which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

โ€ข We thank NVIDIA Corporation for their willingness to answer our myriad of questions on nvprof metrics.

30

Page 31: An Instruction Roofline Model for GPUs

Questions?

Page 32: An Instruction Roofline Model for GPUs

Backup