Performance Analysis using the Roofline Model · Performance Analysis using the Roofline Model...

1
Performance Analysis using the Roofline Model Samuel Williams ([email protected] ), Charlene Yang, Khaled Ibrahim, Thorsten Kurth, Nan Ding, Jack Deslippe, Leonid Oliker CRD/NERSC, Lawrence Berkeley National Laboratory Community Engagement Introduction Scaling Trajectories Roofline on GPUs Roofline for TensorFlow Publications RAPIDS Integration in Intel Advisor § https ://crd.lbl.gov/roofline/publications § C. Yang, T. Kurth, S. Williams, "Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System", CUG, 2019. § C. Yang, S. Williams, “Performance Analysis of GPU-Accelerated Applications using the Roofline Model”, GTC, 2019. § C. Yang, et al., "An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", P3HPC, 2018. § K. Ibrahim, S. Williams, L. Oliker, "Roofline Scaling Trajectories: A Method for Parallel Application and Architectural Performance Analysis", HPBench, 2018. § T. Koskela, et al., "A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", ISC, 2018. #Filters Intensity #Filters Low L2 data locality Some use of TC’s (>FP16 FMA)… partial TC ceiling Kernel Size Intensity kernel size Low L2 data locality Autotuner switched FP32 algorithm to FFT at 9x9 Some use of TC’s (>FP16 FMA)… partial TC ceiling Batch Size Constant performance(?) FP16 performance anti- correlated with batch size Performance << TC peak Transformation kernels Low L2 locality #Filters Close to FP16 TC peak Close to FP32 FMA peak Kernel Size Good FP32 performance trend (almost peak) Autotuner chose to run 9x9 FP16 in FP32 !! Batch Size Autotuner chose different (better) algorithm for FP32 with batch size = 64 (boost) conv2d Forward Pass conv2d Backward Pass § Demonstrate methodology using conv2d from TensorFlow+cuDNN on V100 GPU § Setup… input_image = tf.random_uniform(shape=input_size, minval=0., maxval=1., dtype=dtype) output_result = conv2d(input_image, ’NHWC’, kernel_size, stride_size, dtype) § Forward Pass (2D conv) exec_op = output_result § Backward Pass (2D conv + derivative) opt = tf.train.GradientDescentOptimizer(0.5) exec_op = opt.compute_gradients(output_result) § Each kernel includes multiple sub-kernels o Padding, permutations, conversions, compute, etc… o Should include all of them when analyzing performance § TensorFlow also includes an autotuning step o Ignore autotuning when profiling/modeling o nvprof --profile-from-start off o run 5 warmup iterations (autotuning / not profiled) o start profiler (pyc.driver.start_profiler), run 20 iter, stop profiler § Vary parameters to understand performance § Developed a Roofline methodology POC for analyzing applications running on NVIDIA GPUs § Use NVProf to collect Roofline-related metrics (FLOPs, cache/DRAM data movement, etc…) § BerkeleyGW (Materials) https://github.com/cyanguwa/BerkeleyGW-GPP § nw increases data reuse in inner loop o More flops for fixed data movement o Understand cache effects o Quantify effects of FMA:MUL ratio (disable FMA in compiler) § Observations… o High correlation with HBM BW o FMA doesn’t hit FMA ceiling o High RF and L2 Locality o Minimal increases in L1 locality § HPGMG (Multigrid) https://bitbucket.org/hpgmg/hpgmg § Multiple variants of GSRB smoother… o GSRB_FP does 2x the work but is trivial to implement o STRIDE2 requires more complex memory access and predication § Observations… o High correlation with HBM BW for large problem sizes (level>5) o Moderate L1 cache locality o Low reuse in the L2 cache for GSRB_FP variant o STRIDE2 performance crashes due to decline in intensity 5.00 0.1 1.0 10.0 100.0 1000.0 GFlop/s roofline_summary_sp_lbl 1 2 4 8 16 32 64 #Threads 0.01 0.05 0.50 5.00 50.00 0.1 1.0 10.0 100.0 1000.0 Arithmetic Intensity (Flops/Byte) GFlop/s VFMA (1229) ADD (c32) (77) ADD (c1) (9.2) DRAM (c32) (128) DRAM (c1) (14.3) roofline_summary_sp_lbl Class A Class B Class C c1 c2 c4 c8 c16 c32 c64 § Performance as a function of thread concurrency provides little insight § Need better approach to understand turn overs in performance § NAS Parallel Benchmarks § Intensity (data movement) varies with concurrency and problem size § Large problems (green and red) move more data per thread, and exhaust cache capacity § Use Roofline to analyze thread scalability § Roofline Scaling Trajectorieso 2D scatter plot of performance as a function of intensity and concurrency o Identify loss in performance due to increased cache pressure (data movement) § Falling Intensity → hit the bandwidth ceiling quickly and degrade. Ø Useful for understanding locality/BW contention induced scaling bottlenecks § Roofline has been integrated into Intel’s Advisor Performance Tool… ü Automatically instruments applications (one dot per loop nest/function) ü Computes FLOPS and AI for each function / loop nest ü Integrated Cache Simulator (hierarchical roofline) ü Automatically benchmarks target system (calculates ceilings) ü AVX-512 support including vector masks ü Full integration with existing Advisor capabilities § Fully supported on NERSC’s Edison and Cori (Haswell and Knights Landing) Systems § http ://www.nersc.gov/users/software/performance- and-debugging-tools/advisor/ % module load advisor/2018.integrated_roofline % cc -g -dynamic -openmp -O2 -o mycode.exe mycode.c % source advixe-vars.sh % advixe-cl -collect survey --project-dir ./your_project -- <your-executable-with-parameters> % advixe-cl -collect tripcounts -enable-cache-simulation - flop --project-dir ./your_project -- <your-executable-with- parameters> § Strong collaboration with NERSC, Intel, and NVIDIA § We’ve run Roofline tutorials at SC’17, SC’18, SC’19, ECP’18, ECP’19, ISC’18, ISC’19, NERSC, etc… Each DOT represents a loop nest or function in the TARGET APPLICATION (profiled ) Each CEILING represents the peak CPU/Memory throughput of the TARGET PLATFORM (benchmarked ) § Increasingly, many applications have large, non- floating-point components (e.g. Genomics, Graphs, etc…) § Traditional FLOP Roofline is irrelevant (no FLOPs) § Advisor Roofline support expanded to include Integer and Integer+FLOP Rooflines § Roofline is a throughput- oriented performance model § Tracks rates not times § Independent of ISA and architecture § applies to CPUs, GPUs, Google TPUs, FPGAs, etc… § Defines Good Performance § Arithmetic Intensity is a measure of data locality o Ratio of Total Flops to Total Bytes o Includes cache and prefetcher effects o Can be very different from total loads/stores (bytes requested) o Equal to ratio of sustained GFLOP/s to sustained GB/s (time cancels) § Hierarchical Roofline o Applies to all levels of memory hierarchy on both CPUs and GPUs o Different data movements for L2/HBM/PCIe imply different arithmetic intensities o Differences in L2/HBM/PCIe intensity highlight differences in locality (similar AI’s imply streaming) § Focus on important Loops, Kernels, Applications, … o loops/kernels/apps attaining better than 50% of Roofline will see limited benefit from optimization o Users can use Roofline to identify underperforming loops/kernels/apps Peak FLOP/s Attainable FLOP/s DRAM GB/s Arithmetic Intensity (FLOP:Byte) DRAM bound Compute bound Transition @ AI == Peak Gflop/s / Peak GB/s == ‘Machine Balance’ Peak FLOP/s Attainable FLOP/s HBM GB/s 50% of STREAM Arithmetic Intensity (FLOP:Byte) 50% of Peak PCIe Bound PCIe AI*BW < HBM AI*BW Attainable FLOP/s PCIe GB/s HBM GB/s Arithmetic Intensity (FLOP:Byte) L2 GB/s Peak FLOP/s

Transcript of Performance Analysis using the Roofline Model · Performance Analysis using the Roofline Model...

Page 1: Performance Analysis using the Roofline Model · Performance Analysis using the Roofline Model Samuel Williams (SWWilliams@lbl.gov), Charlene Yang, Khaled Ibrahim, Thorsten Kurth,

Performance Analysis using the Roofline ModelSamuel Williams ([email protected]), Charlene Yang, Khaled Ibrahim,

Thorsten Kurth, Nan Ding, Jack Deslippe, Leonid OlikerCRD/NERSC, Lawrence Berkeley National Laboratory

Community Engagement

Introduction

Scaling Trajectories

Roofline on GPUs

Roofline for TensorFlow

Publications

RESOURCE & APPL ICATION PRODUCTIVITY THROUGH COMPUTAT ION , I NFORMAT ION , AND DATA SC I ENCE

SCIDAC4 INSTITUTE

RAPIDS

Integration in Intel Advisor

§ https://crd.lbl.gov/roofline/publications

§ C. Yang, T. Kurth, S. Williams, "Hierarchical

Roofline Analysis for GPUs: Accelerating

Performance Optimization for the NERSC-9

Perlmutter System", CUG, 2019.

§ C. Yang, S. Williams, “Performance Analysis of

GPU-Accelerated Applications using the Roofline

Model”, GTC, 2019.

§ C. Yang, et al., "An Empirical Roofline

Methodology for Quantitatively Assessing

Performance Portability", P3HPC, 2018.

§ K. Ibrahim, S. Williams, L. Oliker, "Roofline Scaling

Trajectories: A Method for Parallel Application

and Architectural Performance Analysis",HPBench, 2018.

§ T. Koskela, et al., "A Novel Multi-Level Integrated

Roofline Model Approach for Performance

Characterization", ISC, 2018.

#Filters• Intensity ∝ #Filters• Low L2 data locality• Some use of TC’s (>FP16

FMA)… partial TC ceiling

Kernel Size• Intensity ∝ kernel size• Low L2 data locality• Autotuner switched FP32

algorithm to FFT at 9x9

• Some use of TC’s (>FP16 FMA)… partial TC ceiling

Batch Size• Constant performance(?)• FP16 performance anti-

correlated with batch size

• Performance << TC peak• Transformation kernels• Low L2 locality

#Filters• Close to FP16 TC peak• Close to FP32 FMA peak

Kernel Size• Good FP32 performance trend

(almost peak)• Autotuner chose to run 9x9

FP16 in FP32 !!

Batch Size• Autotuner chose different

(better) algorithm for FP32 with batch size = 64 (boost)

conv2d Forward Pass

conv2d Backward Pass

§ Demonstrate methodology using conv2d fromTensorFlow+cuDNN on V100 GPU

§ Setup…input_image = tf.random_uniform(shape=input_size, minval=0., maxval=1.,

dtype=dtype)output_result = conv2d(input_image, ’NHWC’, kernel_size, stride_size, dtype)

§ Forward Pass (2D conv)exec_op = output_result

§ Backward Pass (2D conv + derivative)opt = tf.train.GradientDescentOptimizer(0.5)exec_op = opt.compute_gradients(output_result)

§ Each kernel includes multiple sub-kernelso Padding, permutations, conversions, compute, etc…o Should include all of them when analyzing performance

§ TensorFlow also includes an autotuning stepo Ignore autotuning when profiling/modelingo nvprof --profile-from-start offo run 5 warmup iterations (autotuning / not profiled)o start profiler (pyc.driver.start_profiler), run 20 iter, stop profiler

§ Vary parameters to understand performance

§ Developed a Roofline methodology POC foranalyzing applications running on NVIDIA GPUs

§ Use NVProf to collect Roofline-related metrics(FLOPs, cache/DRAM data movement, etc…)

§ BerkeleyGW (Materials) https://github.com/cyanguwa/BerkeleyGW-GPP

§ nw increases data reuse ininner loopo More flops for fixed data movemento Understand cache effectso Quantify effects of FMA:MUL ratio

(disable FMA in compiler)

§ Observations…o High correlation with HBM BWo FMA doesn’t hit FMA ceilingo High RF and L2 Localityo Minimal increases in L1 locality

§ HPGMG (Multigrid) https://bitbucket.org/hpgmg/hpgmg

§ Multiple variants of GSRBsmoother…o GSRB_FP does 2x the work but is

trivial to implemento STRIDE2 requires more complex

memory access and predication

§ Observations…o High correlation with HBM BW for

large problem sizes (level>5)o Moderate L1 cache localityo Low reuse in the L2 cache for

GSRB_FP varianto STRIDE2 performance crashes due

to decline in intensity

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0

Arithmetic Intensity (Flops/Byte)

GFl

op/s

VFMA (1229)

ADD (c32) (77)

ADD (c1) (9.2)DRAM (c3

2) (128)

DRAM (c1) (1

4.3)

●●●

roofline_summary_sp_lbl

● Class AClass BClass C

c1

c2

c4c8

c16c32c64

1 2 4 8 16 32 64#Threads

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0

Arithmetic Intensity (Flops/Byte)

GFl

op/s

VFMA (1229)

ADD (c32) (77)

ADD (c1) (9.2)DRAM (c3

2) (128)

DRAM (c1) (1

4.3)

●●●

roofline_summary_sp_lbl

● Class AClass BClass C

c1

c2

c4c8

c16c32c64

§ Performance as afunction of threadconcurrency provideslittle insight

§ Need better approachto understand turnovers in performance

§ NAS Parallel Benchmarks§ Intensity (data movement)

varies with concurrency andproblem size

§ Large problems (green andred) move more data perthread, and exhaust cachecapacity

§ Use Roofline to analyze thread scalability§ “Roofline Scaling Trajectories”

o 2D scatter plot of performance as a function of intensity andconcurrency

o Identify loss in performance due to increased cache pressure (datamovement)

§ Falling Intensity → hit the bandwidth ceiling quicklyand degrade.

Ø Useful for understanding locality/BW contention

induced scaling bottlenecks

§ Roofline has been integrated into Intel’s AdvisorPerformance Tool…ü Automatically instruments applications

(one dot per loop nest/function)ü Computes FLOPS and AI for each function / loop nestü Integrated Cache Simulator (hierarchical roofline)

ü Automatically benchmarks target system (calculates ceilings)ü AVX-512 support including vector masksü Full integration with existing Advisor capabilities

§ Fully supported on NERSC’s Edison and Cori(Haswell and Knights Landing) Systems

§ http://www.nersc.gov/users/software/performance-and-debugging-tools/advisor/

% module load advisor/2018.integrated_roofline% cc -g -dynamic -openmp -O2 -o mycode.exe mycode.c% source advixe-vars.sh% advixe-cl -collect survey --project-dir ./your_project --

<your-executable-with-parameters>% advixe-cl -collect tripcounts -enable-cache-simulation -

flop --project-dir ./your_project -- <your-executable-with-parameters>

§ Strong collaboration with NERSC, Intel, and NVIDIA§ We’ve run Roofline tutorials at SC’17, SC’18, SC’19,

ECP’18, ECP’19, ISC’18, ISC’19, NERSC, etc…

Each DOT represents a loop nest or function in the TARGET

APPLICATION (profiled)

Each CEILING represents the peak CPU/Memory throughput of the

TARGET PLATFORM (benchmarked)

§ Increasingly, many applications have large, non-floating-point components (e.g. Genomics, Graphs,etc…)

§ Traditional FLOP Roofline is irrelevant (no FLOPs)§ Advisor Roofline support expanded to include Integer

and Integer+FLOP Rooflines

§ Roofline is a throughput-oriented performance model

§ Tracks rates not times§ Independent of ISA and

architecture§ applies to CPUs, GPUs,

Google TPUs, FPGAs, etc…§ Defines Good Performance

§ Arithmetic Intensity is a measure of data localityo Ratio of Total Flops to Total Byteso Includes cache and prefetcher effectso Can be very different from total

loads/stores (bytes requested)o Equal to ratio of sustained GFLOP/s to

sustained GB/s (time cancels)

§ Hierarchical Rooflineo Applies to all levels of memory

hierarchy on both CPUs and GPUso Different data movements for

L2/HBM/PCIe imply different arithmetic intensities

o Differences in L2/HBM/PCIe intensity highlight differences in locality(similar AI’s imply streaming)

§ Focus on important Loops, Kernels, Applications, …o loops/kernels/apps attaining better

than 50% of Roofline will see limited benefit from optimization

o Users can use Roofline to identify underperforming loops/kernels/apps

Peak FLOP/s

Atta

inab

le F

LOP

/s

DRAM GB/s

Arithmetic Intensity (FLOP:Byte)

DRAMbound

Computebound

Transition @ AI ==Peak Gflop/s / Peak GB/s ==

‘Machine Balance’

Peak FLOP/s

Atta

inab

le F

LOP

/s

HBM GB/s

50%

of S

TREAM

Arithmetic Intensity (FLOP:Byte)

50% ofPeak

PCIe BoundPCIe AI*BW <HBM AI*BWA

ttai

nabl

e FL

OP/

s

PCIe GB/s

HBM G

B/s

Arithmetic Intensity (FLOP:Byte)

L2 G

B/sPeak FLOP/s