Introduction to the Roofline Model

Introduction to theRoofline Model

Samuel WilliamsComputational Research DivisionLawrence Berkeley National Lab

SWWilliams@lbl.gov

You just spent 6 months porting your application to GPUs

Are you done?

What is “Good” Performance?

§ Imagine profiling the mix of loop nests in an application when running on the GPUo GFLOP/s alone may not be particularly

insightful

Loop nest (kernel)

o speedup relative to a Xeon may seem random

2. making good use of the GPU’s compute and/or bandwidth capabilities

1. Operating in the throughput-limited regimenot sensitive to Amdahl effects, D2H/H2D transfers, launch overheads, etc…

Ø Ultimately, we need a quantitative model rather than qualitative statements like “good”

§ Two fundamental aspects to “Good” performance…

Roofline Model

§ Roofline Model is a throughput-oriented performance model

§ Tracks rates not times§ Independent of ISA and architecture§ applies to CPUs, GPUs, Google

TPUs1, FPGAs, etc…§ Helps quantify Good Performance

51Jouppi et al, “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA, 2017.

https://crd.lbl.gov/departments/computer-science/PAR/research/roofline

Reduced Model

§ Superscalar architectures can be complex§ Don’t model / simulate full architecture§ Created simplified processor architecture

6https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)

Reduced Model

§ Superscalar architectures can be complex§ Don’t model / simulate full architecture§ Created simplified processor architecture§ Make assumptions on performance and

usage…o Cores can attain peak GFLOP/s on local datao Cores execute load-balanced SPMD codeo NoC bisection bandwidth is sufficiento There is sufficient cache bandwidth and capacity

such that they do not affect performanceØ Basis for DRAM Roofline Model

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

Data Movement or Compute?

§ Which takes longer?o Data Movemento Compute?

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

#FP ops / Peak GFLOP/sTime = max

#Bytes / Peak GB/s

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

1 / Peak GFLOP/sTime#FP ops #Bytes / #FP ops / Peak GB/s

§ Is performance limited by compute or data movement?

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

Peak GFLOP/s#FP opsTime (#FP ops / #Bytes) * Peak GB/s

DRAMDRAM GB/s

Perfect Caches

Compute GFLOP/s

Peak GFLOP/sGFLOP/s = min

AI * Peak GB/sAI (Arithmetic Intensity) = FLOPs / Bytes (as presented to DRAM )

Arithmetic Intensity

§ Measure of data locality (data reuse)§ Ratio of Total Flops performed to Total Bytes moved§ For the DRAM Roofline…

o Total Bytes to/from DRAM o Includes all cache and prefetcher effectso Can be very different from total loads/stores (bytes requested)o Equal to ratio of sustained GFLOP/s to sustained GB/s (time cancels)

(DRAM) Roofline Model

§ Plot Roofline bound using Arithmetic Intensity as the x-axis

§ Log-log scale makes it easy to doodle, extrapolate performance along Moore’s Law, etc…

Peak GFLOP/s

HBM GB/s

Arithmetic Intensity (FLOP:Byte)

AI * Peak GB/sAI (Arithmetic Intensity) = FLOPs / Bytes (moved to/from DRAM )

Attain

Transition @ AI ==Peak GFLOP/s / Peak GB/s ==

‘Machine Balance’

Peak GFLOP/s

§ Plot Roofline bound using Arithmetic

Intensity as the x-axis

§ Log-log scale makes it easy to

doodle, extrapolate performance

along Moore’s Law, etc…

unattainable performance(greater than peak GFLOP/s)

unattain

perform

fficien

dwidth)

performanceHBM

Peak GFLOP/s

§ Roofline tessellates this 2D view of performance into 5 regions…

Roofline Example #1

§ Typical machine balance is 5-10 FLOPs per byte…o 40-80 FLOPs per double to exploit compute capabilityo Artifact of technology and moneyo Unlikely to improve

#pragma omp parallel forfor(i=0;i<N;i++){

Z[i] = X[i] + alpha*Y[i];}

§ Consider STREAM Triad…

o 2 FLOPs per iterationo Transfer 24 bytes per iteration (read X[i], Y[i], write Z[i])o AI = 0.083 FLOPs per byte == Memory bound

HBM GB/s

Peak GFLOP/s

GFLOP/s ≤ AI * HBM GB/s

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…

#pragma omp parallel forfor(k=1;k<dim+1;k++){for(j=1;j<dim+1;j++){for(i=1;i<dim+1;i++){new[k][j][i] = -6.0*old[k ][j ][i ]

+ old[k ][j ][i-1]+ old[k ][j ][i+1]+ old[k ][j-1][i ]+ old[k ][j+1][i ]+ old[k-1][j ][i ]+ old[k+1][j ][i ];

HBMHBM GB/s

Perfect Caches

Compute GFLOP/s

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…o 7 FLOPso 8 memory references (7 reads, 1 store) per point

HBMHBM GB/s

Perfect Caches

Compute GFLOP/s

o AI = 7 / (8*8) = 0.11 FLOPs per byte(measured at the L1)

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…o 7 FLOPso 8 memory references (7 reads, 1 store) per pointo Ideally, cache will filter all but 1 read and 1 write per point

HBMHBM GB/s

Perfect Caches

Compute GFLOP/s

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…o 7 FLOPso 8 memory references (7 reads, 1 store) per pointo Ideally, cache will filter all but 1 read and 1 write per pointØ 7 / (8+8) = 0.44 FLOPs per byte (DRAM)

HBMHBM GB/s

Perfect Caches

Compute GFLOP/s

Roofline Example #2

§ Conversely, 7-point constant coefficient stencil…o 7 FLOPso 8 memory references (7 reads, 1 store) per pointo Ideally, cache will filter all but 1 read and 1 write per pointØ 7 / (8+8) = 0.44 FLOPs per byte (DRAM)

HBM GB/s

Arithmetic Intensity (FLOP:Byte)0.083

7-pointStencil

GFLOP/s ≤ AI * HBM GB/s

Peak GFLOP/s

== memory bound, but 5x the FLOP rate as TRIAD

§ Think back to our mix of loop nests…

Loop nest (kernel)

§ We can sort kernels by arithmetic intensity…

§ … and compare performance relative to machine capabilities

Peak GFLOP/s

HBM GB/s

50% of Peak

§ Kernels near the roofline are making good use of computational resources

Peak GFLOP/s

50% of Peak

50% of

STREAM

Peak GFLOP/s

HBM GB/s

Ø kernels can have low performance (GFLOP/s), but make good use (%STREAM) of a machine

50% of Peak

50% of

STREAM

Peak GFLOP/s

HBM GB/s

Ø kernels can have low performance (GFLOP/s), but make good use (%STREAM) of a machine

Ø kernels can have high performance (GFLOP/s), but still make poor use of a machine (%peak)

Roofline is made of two components

§ Machine Modelo Lines defined by peak GB/s and GF/s

(Benchmarking)o Unique to each architectureo Common to all apps on that architecture

Peak GFLOP/s

HBM GB/s

50% of

STREAM

50% of Peak

Roofline is made of two components

§ Machine Modelo Lines defined by peak GB/s and GF/s

(Benchmarking)

o Unique to each architecture

o Common to all apps on that architecture

§ Application Characteristicso Dots defined by application GFLOP’s and

GB’s (Application Instrumentation)

o Unique to each application

o Unique to each architecture

General Performance Optimization Strategy

§ Get to the Roofline

Peak GFLOP/s

50% of Peak

HBM GB/s

50% of

STREAM

General Performance Optimization Strategy

§ Get to the Roofline

Peak GFLOP/s

50% of Peak

HBM GB/s

50% of

STREAM

§ Increase Arithmetic Intensity when bandwidth-limitedo Reducing data movement increases AI

How can performance ever be below the Roofline?

Performance Below the Roofline?

§ Insufficient cache bandwidth and data locality

§ “Lack of Parallelism”o Thread Divergence (idle threads)o Insufficient Occupancy (idle warp sched)o Insufficient #Thread Blocks (idle SMs)

§ Integer-heavy Codeso Non-FP instructions impair FP

performanceo No FP instructions… AI=0

§ Instruction Mixo Lack of FMAo Mixed Precision effectso Lack of Tensor Core operations

Additional FP CeilingsCharlene Yang, Thorsten Kurth,Samuel Williams, "HierarchicalRoofline analysis for GPUs:Accelerating performanceoptimization for the NERSC-9Perlmutter system", Concurrencyand Computation: Practice andExperience (CCPE), August 2019.

Instruction Roofline Model

Nan Ding, Samuel Williams, "AnInstruction Roofline Model for GPUs",Performance Modeling, Benchmarking,and Simulation (PMBS), BEST PAPERAWARD, November 2019.

10-2 10-1 100 101 102

Instruction Intensity (Warp Instructions per Transaction)

HBM 25.9 GTXN/s

L2 93.6 GTXN/s

L1 437.5 GTXN/sThoeretical Peak: 489.6 warp GIPSRoofline Scaling

TrajectoriesKhaled Ibrahim, Samuel Williams, LeonidOliker, "Performance Analysis of GPUProgramming Models using the RooflineScaling Trajectories", InternationalSymposium on Benchmarking,Measuring and Optimizing (Bench),BEST PAPER AWARD, November 2019.

0.01 0.05 0.50 5.00 50.00

Arithmetic Intensity (Flops/Byte)

VFMA (1229)

ADD (c32) (77)

ADD (c1) (9.2)DRAM (c3

2) (128)

DRAM (c1) (1

●●●

roofline_summary_sp_lbl

● Class AClass BClass C

c16c32c64

HBM GB/s

Peak GFLOP/s

Hierarchical Roofline Model

Charlene Yang, Thorsten Kurth, SamuelWilliams, "Hierarchical Roofline analysisfor GPUs: Accelerating performanceoptimization for the NERSC-9 Perlmuttersystem", Concurrency and Computation:Practice and Experience (CCPE), August2019.

Summary

Why We Use Roofline…

1. Determine when we’re done optimizing codeo Assess performance relative to machine capabilitieso Track progress towards optimalityo Motivate need for algorithmic changes

2. Identify performance bottlenecks & motivate software optimizations

3. Understand performance differences between Architectures, Programming Models, implementations, etc…o Why do some Architectures/Implementations move more data than others?o Why do some compilers outperform others?

4. Predict performance on future machines / architectureso Set realistic performance expectationso Drive for Architecture-Computer Science-Applied Math Co-Design

Take away

§ Roofline helps understand application performance relative to machine capabilities§ just the beginning of the optimization process§ Other bottleneck- or architecture-specific tools can be used to refine the process

§ Roofline helps frame the conversation between…§ Application Developers§ Computer Scientists§ Applied Mathematicians§ Processor Vendors

…providing a common mental model and optimization language

Questions

Introduction to the Roofline Model

Documents

Transcript of Introduction to the Roofline Model

A roofline model of energy

Basics of performance modeling for numerical applications ... · Basics of performance modeling for numerical applications: Roofline model and beyond Georg Hager, Jan Treibig, Gerhard

Roofline Model - CSRC

· model — the Cerato 2013 today. The all-new Cerato has cab-forward styling, a coupe-like roofline, distinctive concave door contours and a rising beltline with chrome molding

On Using the Rooﬂine Model with Lower Bounds on Data …cacs.usc.edu/education/cs653/Elango-Roofline-ACMTACO15.pdf · VENMUGIL ELANGO and NASER SEDAGHATI, The Ohio State University

Gables: A Roofline Model for Mobile SoCs · Mark D. Hill ∗ Computer Sciences Department ... from video processing to deep neural network processing. Nowhere is the trend toward

The Roofline Model

Roofline Model Toolkit : A Practical Tool for Architectural and Program Analysis Yu Jung Lo*, Samuel Williams†, Brian Van Straalen†, Terry Ligocki†, Matthew.

Business Model Innovation: Introduction to Implementation · Business Model Innovation: Introduction to Implementation ... McDonalds Vietnam VS ... Business Model Innovation: Introduction

1918 Government Housing Tour - rockislandpreservation.org · With the exception of the roofline this home is a mirror image of the house at 1545 40th Street. Here a front gabled roofline

A Vision for Integrating Performance Counters into the Roofline model

BUILDING PRODUCTS Range Guide · BUILDING PRODUCTS Range Guide Design Options for PVC Roofline Cladding Rainwater The Specifier’s Choice for Cellular PVC Roofline & Cladding Uniclass

the hassle-free way · products you need exactly when you need them – and at great-value prices. Introducing Roofline Long-life, low-maintenance PVC-U roofline systems Fascias,

An Instruction Roofline Model for GPUs

Parallel Architecture - Colorado State University · Roofline Model (Will be using this in HW2) Roofline: An Insightful Visual Performance Model for Multicore Architecture – By

Introduction Model selection Model validationjitkomut.eng.chula.ac.th/ee531/modelsel.pdf · Model Selection and Model Validation • Introduction • Model selection • Model validation

Commodore - allcarcentral.com€¦ · the Commodore even looks more compact. But here’s the thing. The new model actually has a longer wheelbase and wider track, a higher roofline

4 Guidance for Future Development4.2.2 Roof levels, styles and extensions (a) Roofline developments All roofline developments or alterations require planning permission in conservation

The complete roofline solution - Westburytimberandallied.co.uk/pdfs/fascia-soffits/freefoam... · 2014. 3. 18. · “My new roofline has given me real peace of mind” Mrs June Sanger,

Roofline, Window and Cladding Systems · 2020. 9. 7. · ROOFLINE, WINDOW & CLADDING SSTEMS 6 For more information visit Underground Drainage Systems White Roofline Joints and Trims