“Simple” performance modeling: The Roofline Model

69
“Simple” performance modeling: The R oofline Model Loop - based performance modeling: Execution vs. data transfer

Transcript of “Simple” performance modeling: The Roofline Model

Page 1: “Simple” performance modeling: The Roofline Model

“Simple” performance modeling:

The Roofline Model

Loop-based performance modeling: Execution vs. data transfer

Page 2: “Simple” performance modeling: The Roofline Model

2

Prelude: Modeling customer dispatch in a bank

03/2017 Roofline Model

Revolving door

throughput:

bS [customers/sec]

Intensity:

I [tasks/customer]

Processing

capability:

Pmax [tasks/sec]

Page 3: “Simple” performance modeling: The Roofline Model

3

Prelude: Modeling customer dispatch in a bank

How fast can tasks be processed? 𝑷 [tasks/sec]

The bottleneck is either

The service desks (max. tasks/sec): 𝑃max The revolving door (max. customers/sec): 𝐼 ∙ 𝑏𝑆

This is the “Roofline Model”

High intensity: P limited by “execution”

Low intensity: P limited by “bottleneck”

“Knee” at 𝑃𝑚𝑎𝑥 = 𝐼 ∙ 𝑏𝑆:

Best use of resources

Roofline is an “optimistic” model

(“light speed”)

03/2017 Roofline Model

𝑃 = min(𝑃max, 𝐼 ∙ 𝑏𝑆)

Intensity

Perf

orm

ance

Pmax

Page 4: “Simple” performance modeling: The Roofline Model

4

Apply this example to performance of compute devices

Maximum processing capability Peak performance: 𝑃𝑝𝑒𝑎𝑘 [𝐹

𝑠]

Rate of revolving door Memory bandwidth: 𝑏𝑆 [𝐵

𝑠]

Workload per customer Computational Intensity: 𝐼 [𝐹

𝐵]

The Roofline Model – Basics

03/2017 Roofline Model

Machine model:

𝑃𝑝𝑒𝑎𝑘 = 4𝐺𝐹

𝑠

𝑏𝑆 = 10𝐺𝐵

𝑠

Application model: I

double s=0, a[];

for(i=0; i<N; ++i) {

s = s + a[i] * a[i];}

𝐼 =2 𝐹

8 𝐵= 0.25 𝐹 𝐵

Page 5: “Simple” performance modeling: The Roofline Model

5

The Roofline Model – Basics

Compare capabilities of different machines

RLM always provides upper bound – but is it realistic?

If code is not able to reach this limit (e.g. contains add operations

only) machine parameter need to redefined (.e.g. 𝑃𝑝𝑒𝑎𝑘 𝑃𝑝𝑒𝑎𝑘/𝟐)

03/2017 Roofline Model

Assuming double

precision –

for single precision:

𝑃𝑝𝑒𝑎𝑘 2 ∗ 𝑃𝑝𝑒𝑎𝑘

Page 6: “Simple” performance modeling: The Roofline Model

6

The Roofline Model – refined

03/2017 Roofline Model

D. Callahan et al.: Estimating interlock and improving balance for pipelined architectures. Journal for Parallel and Distributed Computing 5(4),

334 (1988). DOI: 10.1016/0743-7315(88)90002-0

W. Schönauer: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition (2000)

S. Williams: Auto-tuning Performance on Multicore Computers. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)

1. Pmax = Applicable peak performance of a loop, assuming that data

comes from the level 1 cache (this is not necessarily Ppeak)

e.g., Pmax = 176 GFlop/s

2. I = Computational intensity (“work” per byte transferred) over the

slowest data path utilized (code balance BC = I -1)

e.g., I = 0.167 Flop/Byte BC = 6 Byte/Flop

3. bS = Applicable (saturated) peak bandwidth of the slowest data path

utilized

e.g., bS = 56 GByte/s

Expected performance:

𝑃 = min 𝑃max, 𝐼 ∙ 𝑏𝑆 = min 𝑃max,𝑏𝑆𝐵𝐶

[Byte/s]

[Byte/Flop]

Page 7: “Simple” performance modeling: The Roofline Model

9

Preliminary: Estimating Pmax

Remember the Haswell port scheduler model?

Port 0 Port 1 Port 5Port 2 Port 3 Port 4 Port 6 Port 7

ALU ALU ALU

FMA FMA FSHUF

JUMP

LOAD LOAD

AGU AGU

STORE

Retire 4 uops

32b 32b 32b

AGU

Haswell

FMUL

ALU

JUMP

03/2017 Roofline Model

Page 8: “Simple” performance modeling: The Roofline Model

11

Example: Estimate Pmax of vector triad on Haswell

double *A, *B, *C, *D;

for (int i=0; i<N; i++) {

A[i] = B[i] + C[i] * D[i];

}

Minimum number of cycles to process one AVX-vectorized iteration

(one core)?

Equivalent to 4 scalar iterations

Cycle 1: LOAD + LOAD + STORE

Cycle 2: LOAD + LOAD + FMA + FMA

Cycle 3: LOAD + LOAD + STORE Answer: 1.5 cycles

03/2017 Roofline Model

Page 9: “Simple” performance modeling: The Roofline Model

13

Example: Estimate Pmax of vector triad on Haswell (2.3 GHz)

double *A, *B, *C, *D;

for (int i=0; i<N; i++) {

A[i] = B[i] + C[i] * D[i];

}

What is the performance in GFlops/s per core and the bandwidth in

GBytes/s?

One AVX iteration (1.5 cycles) does 4 x 2 = 8 flops:

2.3 ∙ 109 cy/s

1.5 cy∙ 4 updates ∙

2 flops

update= 𝟏𝟐. 𝟐𝟕

Gflops

s

6.13 ∙ 109updates

s∙ 32

bytes

update= 196

Gbyte

s

03/2017 Roofline Model

Page 10: “Simple” performance modeling: The Roofline Model

14

Pmax + bandwidth limitations: The vector triad

Example: Vector triad A(:)=B(:)+C(:)*D(:)

on a 2.3 GHz 14-core Haswell chip (AVX vectorized)

Peak memory bandwidth: bS = 50 GB/s

Memory code balance:

Bc = (4+1) Words / 2 Flops = 20 B/F (including write allocate)

I = 0.05 F/B

I ∙ bS = 2.5 GF/s (0.5% of peak performance)

Ppeak = 515.2 Gflop/s (14 cores x (8+8) Flops/cy x 2.3 GHz)

Pmax = 14 x 12.27 Gflop/s =172 Gflop/s (33% peak)

𝑃 = min 𝑃max, 𝐼 ∙ 𝑏𝑆 = min 172,2.5 GFlop s= 2.5 GFlop s

03/2017 Roofline Model

Page 11: “Simple” performance modeling: The Roofline Model

15

Code balance: more examples

double a[], b[];

for(i=0; i<N; ++i) {

a[i] = a[i] + b[i];}

BC = 24B / 1F = 24 B/F

I = 0.042 F/B

double a[], b[];

for(i=0; i<N; ++i) {

a[i] = a[i]+ s * b[i];}

BC = 24B / 2F = 12 B/F

I = 0.083 F/B

Scalar – can be kept in register

float s=0, a[];

for(i=0; i<N; ++i) {

s = s + a[i] * a[i];}

BC = 4B/2F = 2 B/F

I = 0.5 F/B

Scalar – can be kept in register

float s=0, a[], b[];

for(i=0; i<N; ++i) {

s = s + a[i] * b[i];}

BC = 8B / 2F = 4 B/F

I = 0.25 F/B

Scalar – can be kept in register

03/2017 Roofline Model

Page 12: “Simple” performance modeling: The Roofline Model

16

Machine balance for hardware characterization

For quick comparisons the concept of machine balance is useful

Machine Balance = How much input data can be delivered for each FP

operation? (“Memory Gap characterization”)

Assuming balanced MULT/ADD

Rough estimate: 𝐵𝑚 ≪ 𝐵𝑐 strongly memory-bound code

Typical values (main memory):

Intel Haswell 14-core 2.3 GHz

Bm = 60 GB/s / (14 x 2.3 x 16) GF/s ≈ 0.12 B/F

Intel Sandy Bridge 8-core 2.7 GHz ≈ 0.23 B/F

Nvidia K20x ≈ 0.14 B/F

Intel Xeon Phi Knights Corner ≈ 0.16 B/F

03/2017 Roofline Model

𝐵𝑚 =𝑏𝑆

𝑃peak

Page 13: “Simple” performance modeling: The Roofline Model

17

A not so simple Roofline example

Example: do i=1,N; s=s+a(i); enddo

in single precision on a 2.2 GHz Sandy Bridge socket @ “large” N

03/2017 Roofline Model

ADD peak

(best possible

code)

no SIMD

3-cycle latency

per ADD if not

unrolled

P (worst loop code)

𝑃 = min(𝑃max, 𝐼 ∙ 𝑏𝑆)

See first part

on how to get

these numbers

I = 1 flop / 4 byte (SP!)

141 GF/s

17.6 GF/s

5.9 GF/s

282 GF/s

Machine peak

(ADD+MULT)

Out of reach for this

code

P (better loop code)

Page 14: “Simple” performance modeling: The Roofline Model

21

Input to the roofline model

… on the example of do i=1,N; s=s+a(i); enddo

in single precision

03/2017 Roofline Model

analysis

Code analysis:

1 ADD + 1 LOAD

architectureThroughput: 1 ADD + 1 LD/cy

Pipeline depth: 3 cy (ADD)

8-way SIMD, 8 cores

measurement

Maximum memory

bandwidth 40 GB/s

Worst code: P = 5.9 GF/s (core bound)

Better code: P = 10 GF/s (memory bound)

5.9 … 141 GF/s

10 GF/s

Page 15: “Simple” performance modeling: The Roofline Model

22

Prerequisites for the Roofline Model

03/2017 Roofline Model

The roofline formalism is based on some (crucial) assumptions:

There is a clear concept of “work” vs. “traffic”

“work” = flops, updates, iterations…

“traffic” = required data to do “work”

Attainable bandwidth of code = input parameter! Determine effective

bandwidth via simple streaming benchmarks to model more complex

kernels and applications

Data transfer and core execution overlap perfectly!

Either the limit is core execution or it is data transfer

Slowest limiting factor “wins”; all others are assumed

to have no impact

Latency effects are ignored, i.e. perfect streaming mode

Be able to saturate bandwidth Full socket/node scenarios work best

Page 16: “Simple” performance modeling: The Roofline Model

23

Factors to consider in the roofline model

Bandwidth-bound (simple case)

Accurate traffic calculation (write-

allocate, strided access, …)

Practical ≠ theoretical BW limits

Erratic access patterns

Core-bound (may be complex)

Multiple bottlenecks: LD/ST,

arithmetic, pipelines, SIMD,

execution ports

Limit is linear in # of cores

03/2017 Roofline Model

Page 17: “Simple” performance modeling: The Roofline Model

24

Complexities of in-core execution

Multiple bottlenecks:

Decode/retirement

throughput

Port contention

(direct or indirect)

Arithmetic pipeline stalls

(dependencies)

Overall pipeline stalls

(branching)

L1 Dcache bandwidth

(LD/ST throughput)

Scalar vs. SIMD execution

L1 Icache (LD/ST) bandwidth

Alignment issues

03/2017 Roofline Model

Page 18: “Simple” performance modeling: The Roofline Model

26

Typical code optimizations in the Roofline Model

1. Hit the BW bottleneck by good

serial code(e.g., Ninja C++ Fortran)

2. Increase intensity to make

better use of BW bottleneck(e.g., loop blocking [see later])

3. Increase intensity and go from

memory-bound to core-bound(e.g., temporal blocking)

4. Hit the core bottleneck by good

serial code(e.g., -fno-alias [see later])

5. Shift Pmax by accessing

additional hardware features or

using a different

algorithm/implementation(e.g., scalar SIMD)

03/2017 Roofline Model

Perl

Page 19: “Simple” performance modeling: The Roofline Model

27

Shortcomings of the roofline model

Saturation effects in multicore chips are not explained

Reason: “saturation assumption”

Cache line transfers and core execution do sometimes not overlap perfectly

It is not sufficient to measure single-core STREAM to make it work

Only increased “pressure” on the memory

interface can saturate the bus

need more cores!

In-cache performance is not correctly

predicted

The ECM performance model gives more

insight:

A(:)=B(:)+C(:)*D(:)

03/2017 Roofline Model

G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring

performance and power properties of modern multicore chips

via simple machine models. Concurrency and Computation:

Practice and Experience (2013).

DOI: 10.1002/cpe.3180 Preprint: arXiv:1208.2908

Page 20: “Simple” performance modeling: The Roofline Model

28

Interlude: Hardware Performance Counters

03/2017 Roofline Model

Page 21: “Simple” performance modeling: The Roofline Model

29

Monitoring “live” jobs on a cluster in the Roofline diagram

03/2017 Roofline Model

Where are the “good”

and the “bad” jobs in

this diagram?

Page 22: “Simple” performance modeling: The Roofline Model

Roofline Case Studies

Dense matrix-vector multiplication (dMVM)

Sparse matrix-vector multiplication (spMVM)

Jacobi smoother

Page 23: “Simple” performance modeling: The Roofline Model

Case Study: dMVM

Page 24: “Simple” performance modeling: The Roofline Model

32

Dense matrix-vector multiplication in DP (AVX)

Assume C = R ≈ 10,000

Applicable peak performance?

Relevant data path?

Computational Intensity?

do c = 1 , C

do r = 1 , R

y(r)=y(r) + A(r,c)* x(c)

enddo

enddo

03/2017 Roofline Model

do c = 1 , C

tmp=x(c)

do r = 1 , R

y(r)=y(r) + A(r,c)* tmp

enddo

enddo

Page 25: “Simple” performance modeling: The Roofline Model

33

Vectorization strategy: 4-way inner loop unrolling

One register holds tmp in each of its 4 entries (“broadcast”)

Loop kernel requires/consume 3 AVX registers

DMVM (DP) – AVX vectorization

03/2017 Roofline Model

do c = 1,C

tmp=x(c)

do r = 1,R,4 ! R is multiple of 4

y(r) = y(r) + A(r,c) * tmp

y(r+1) = y(r+1) + A(r+1,c)* tmp

y(r+2) = y(r+2) + A(r+2,c)* tmp

y(r+3) = y(r+3) + A(r+3,c)* tmp

enddo

enddo

Page 26: “Simple” performance modeling: The Roofline Model

34

DMVM (DP) – Single core performance vs. column height

03/2017 Roofline Model

Intel Xeon E5 2695 v3 (Haswell-EP), 2.3 GHz, CoD mode, Core Ppeak=18.4 GF/s,

Caches: 32 KB / 256 KB / 35 MB, PageSize: 2 MB; ifort V15.0.1.133; bS = 32 Gbyte/s

Performance drops as number

of rows (inner loop length)

increases.

Does computational intensity

change?

Page 27: “Simple” performance modeling: The Roofline Model

35

DMVM data traffic analysis

Roofline Model

A(r,c)

do c = 1 , C

tmp=x(c)

do r = 1 , R

y(r)=y(r) + A(r,c)* tmp

enddo

enddo

R

y(:) is loaded and stored in each outer

iteration for c>1 update y(:) in cache

A(:,:) is loaded from memory – no

data reuse

y(:) may not fit in innermost cache

more traffic from lower level caches for larger R

tmp stays in a register during inner loop

03/2017

= + *

C

Roofline analysis: Distinguish

code balance in memory

(𝐵𝐶𝑚𝑒𝑚) from code balance in

relevant cache level(s) (𝐵𝐶𝐿3,

𝐵𝐶𝐿2,…)!

Page 28: “Simple” performance modeling: The Roofline Model

36

DMVM (DP) – Single core data traffic analysis

03/2017 Roofline Model

size(y(1:R))

= 160 kB

size(y(1:R))

= 16 kB

likwid-perfctr

measurements

y Exceeding inner cache size:

(8+8) Byte for LD + ST on y

𝐵𝐶𝐿3 = 24B/2F

𝐵𝐶𝑚𝑒𝑚 = 24B/2F

𝐵𝐶𝐿2 = 24B/2F

𝐵𝐶𝐿3 = 8B/2F

Page 29: “Simple” performance modeling: The Roofline Model

37

Reducing traffic by blocking

03/2017 Roofline Model

A(r,c)

do c = 1 , C

tmp=x(c)

do r = 1 , R

y(r)=y(r) + A(r,c)* tmp

enddo

enddo

do rb = 1 , R , RbrbS = rb

rbE = min((rb+Rb-1), R)

do c = 1 , C

do r = rbS , rbE

y(r)=y(r) + A(r,c)*x(c)

enddo

enddo

enddo

R

Rb

y(:) may

not fit into

some cache

more

traffic for

lower level

y(rbS:rbE)

may fit into

some cache ifRb is small

enough

traffic

reduction

Page 30: “Simple” performance modeling: The Roofline Model

38

Reducing traffic by blocking

LHS only updated once in some cache level if blocking is applied

Price: RHS is loaded multiple times instead of once!

How often? R / Rb times

Consequence: Traffic reduction of LHS & RHS by a factor of

R / (Rb x 2C)

Still a large reduction if block size can be made larger than about 10

03/2017 Roofline Model

Rb

= + *

Page 31: “Simple” performance modeling: The Roofline Model

39

DMVM (DP) – Reducing traffic by inner loop blocking

“1D blocking” for inner loop

Blocking factor Rb cache level

Fully reuse subset of y(rbS:rbE)

from L1/L2 cache

03/2017 Roofline Model

do rb = 1 , R , Rb

rbS = rb

rbE = min((rb+Rb-1), R)

do c = 1 , C

do r = rbS , rbE

y(r)=y(r) + A(r,c)*x(c)

enddo

enddo

enddo

L2 cache

blocking

L1 cache

blocking

Page 32: “Simple” performance modeling: The Roofline Model

40

DMVM (DP) – Validation of blocking optimization

03/2017 Roofline Model

Rb= 2000

Page 33: “Simple” performance modeling: The Roofline Model

41

DMVM (DP) – OpenMP parallelization

03/2017 Roofline Model

!$omp parallel do reduction(+:y)

do c = 1 , C

do r = 1 , R

y(r) = y(r) + A(r,c) * x(c)

enddo ; enddo

!$omp end parallel do

!$omp parallel do private(rbS,rbE)

do rb = 1 , R , RbrbS = rb

rbE = min((rb+Rb-1), R)

do c = 1 , C

do r = rbS , rbE

y(r) = y(r) + A(r,c) * x(c)

enddo ; enddo ; enddo

!$omp end parallel do

plain code

blocked code

Page 34: “Simple” performance modeling: The Roofline Model

42

DMVM (DP) – OpenMP parallelization & saturation

03/2017 Roofline Model

blocking good for

single thread

performance (reduced

in-cache traffic)

memory traffic

unchanged

saturation

unchanged!

saturation influenced

by serial performance

Intel Xeon E5 2695 v3 (Haswell-EP)

2.3 GHz base clock speed, bS = 32 GB/s

Roofline limit

BC = 4 Byte/Flop

bS = 32GB/s

So, is blocking

useless?

NO (see later)

Can we do

anything to

improve 𝐵𝐶𝑚𝑒𝑚?

NO, not here

Page 35: “Simple” performance modeling: The Roofline Model

43

DMVM (DP) – saturation & clock speed

03/2017 Roofline Model

Clock speed has high

impact on single

thread performance

Even the worst code

almost saturates the

bandwidth @1.3

GHz

Intel Xeon E5 2695 v3 (Haswell-EP)

2.3 GHz base clock speed, bS = 32 GB/s

Clock speed or

parallelism can

“heal” slow code!

saturation influenced

by clock speed and

serial performance

Page 36: “Simple” performance modeling: The Roofline Model

44

Conclusions from the dMVM example

We have found the reasons for the breakdown of single-core

performance with growing number of matrix rows

LHS vector fitting in different levels of the cache hierarchy

Validated theory by performance counter measurements

Inner loop blocking was employed to improve code balance in L3

and/or L2

Validated by performance counter measurements

Blocking led to better single-threaded performance

Saturated performance unchanged (as predicted by Roofline)

Because the problem is still small enough to fit the LHS at least into the L3

cache

03/2017 Roofline Model

Page 37: “Simple” performance modeling: The Roofline Model

Case study: A Jacobi smoother

The basics in two dimensions

Layer conditions

Optimization by spatial blocking

Page 38: “Simple” performance modeling: The Roofline Model

46

Stencil schemes

Stencil schemes frequently occur in PDE solvers on regular lattice

structures

Basically it is a sparse matrix vector multiply (spMVM) embedded

in an iterative scheme (outer loop)

but the regular access structure allows for matrix free coding

Complexity of implementation and performance depends on

stencil operator, e.g. Jacobi-type, Gauss-Seidel-type, …

spatial extent, e.g. 7-pt or 25-pt in 3D,…

03/2017 Roofline Model

do iter = 1, max_iterations

Perform sweep over regular grid: y(:) x(:)

Swap y x

enddo

y x

Page 39: “Simple” performance modeling: The Roofline Model

47

Jacobi-type 5-pt stencil in 2D

03/2017 Roofline Model

do k=1,kmax

do j=1,jmax

y(j,k) = const * ( x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

j

k

sweep

Lattice

Update

(LUP)

y(0:jmax+1,0:kmax+1) x(0:jmax+1,0:kmax+1)

Appropriate performance metric: “Lattice Updates per second” [LUP/s](here: Multiply by 4 FLOP/LUP to get FLOP/s rate)

Page 40: “Simple” performance modeling: The Roofline Model

48

Jacobi 5-pt stencil in 2D: data transfer analysis

03/2017 Roofline Model

do k=1,kmax

do j=1,jmax

y(j,k) = const * ( x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

SWEEP

LD+ST y(j,k)

(incl. write

allocate)LD x(j+1,k)

Available in cache

(used 2 updates before)

LD x(j,k+1)LD x(j,k-1)Naive balance (incl. write allocate):

x( :, :) : 3 LD +

y( :, :) : 1 ST+ 1LD

BC = 5 Words / LUP = 40 B / LUP (assuming double precision)

Page 41: “Simple” performance modeling: The Roofline Model

50

Jacobi 5-pt stencil in 2D: Single core performance

03/2017 Roofline Model

jmax=kmax jmax*kmax = const

L3

Ca

ch

e

Intel Xeon E5-2690 v2

(“IvyBridge”@3 GHz)

~24 B / LUP ~40 B / LUP

Code balance (BC)

measured with likwid-perfctr

Intel Compiler

ifort V13.1

jmax

Questions:

1. How to achieve

24 B/LUP also for large jmax?

2. How to sustain

>600 MLUP/s for jmax > 104 ?

Page 42: “Simple” performance modeling: The Roofline Model

Case study: A Jacobi smoother

The basics in two dimensions

Layer conditions

Optimization by spatial blocking

Page 43: “Simple” performance modeling: The Roofline Model

52

Analyzing the data flow

03/2017 Roofline Model

cachedWorst case: Cache not large enough to hold 3 layers (rows) of grid

(assume “Least Recently Used” replacement strategy)

j

k

x(0:jmax+1,0:kmax+1)

Halo

cell

sH

alo

cell

s

Page 44: “Simple” performance modeling: The Roofline Model

53

Analyzing the data flow

03/2017 Roofline Model

j

k

Worst case: Cache not large enough to hold 3 layers (rows) of grid

(+assume „Least Recently Used“ replacement strategy)

x(0:jmax+1,0:kmax+1)

Page 45: “Simple” performance modeling: The Roofline Model

54

Analyzing the data flow

03/2017 Roofline Model

Reduce inner (j-)

loop dimension

successively

Best case: 3

„layers“ of grid fit

into the cache!

j

k

x(0:jmax2+1,0:kmax+1)

x(0:jmax1+1,0:kmax+1)

Page 46: “Simple” performance modeling: The Roofline Model

55

Analyzing the data flow: Layer condition

2D 5-pt Jacobi-type stencil

03/2017 Roofline Model

do k=1,kmax

do j=1,jmax

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo 3 * jmax * 8B < CacheSize/2

“Layer condition”

double

precision

3 rows of jmax Safety margin

(Rule of thumb)

Layer condition:• Does not depend on outer loop length (kmax)

• No strict guideline (cache associativity – data traffic for y not included)

• Needs to be adapted for other stencils (e.g., 3D 7-pt stencil)

Page 47: “Simple” performance modeling: The Roofline Model

56

Analyzing the data flow: Layer condition (2D 5-pt Jacobi)

03/2017 Roofline Model

3 * jmax * 8B < CacheSize/2

Layer condition fulfilled?

y: (1 LD + 1 ST) / LUP x: 1 LD / LUP

BC = 24 B / LUP

do k=1,kmax

do j=1,jmax

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

YES

do k=1,kmax

do j=1,jmax

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo BC = 40 B / LUP

y: (1 LD + 1 ST) / LUP

NO

x: 3 LD / LUP

Page 48: “Simple” performance modeling: The Roofline Model

57

Analyzing the data flow: Layer condition (2D 5-pt Jacobi)

Establish layer condition for all domain sizes

Idea: Spatial blocking

Reuse elements of x() as long as they stay in cache

Sweep can be executed in any order, e.g. compute blocks in j-direction

“Spatial Blocking” of j-loop:

Determine for given CacheSize an appropriate jblock value:

03/2017 Roofline Model

do jb=1,jmax,jblock ! Assume jmax is multiple of jblock

do k=1,kmax

do j= jb, (jb+jblock-1) ! Length of inner loop: jblock

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

enddo New layer condition (blocking)3 * jblock * 8B < CacheSize/2

jblock < CacheSize / 48 B

Page 49: “Simple” performance modeling: The Roofline Model

58

Establish the layer condition by blocking

03/2017 Roofline Model

Split up

domain into

subblocks:

e.g. block

size = 5

Page 50: “Simple” performance modeling: The Roofline Model

59

Establish the layer condition by blocking

03/2017 Roofline Model

Additional data

transfers (overhead)

at block boundaries!

Page 51: “Simple” performance modeling: The Roofline Model

61

Establish layer condition by spatial blocking

03/2017 Roofline Model

jmax=kmax jmax*kmax = const

L3

Ca

ch

e

L1: 32 KB

L2: 256 KB

L3: 25 MBjmax

Which cache to block for?

Intel Xeon E5-2690 v2

(“IvyBridge”@3 GHz)

Intel Compiler

ifort V13.1

jblock < CacheSize / 48 B

L2: CS=256 KBjblock=min(jmax,5333) L3: CS=25 MB

jblock=min(jmax,533333)

Page 52: “Simple” performance modeling: The Roofline Model

62

Validating the model: Memory code balance

03/2017 Roofline Model

jmax

Measured main memory

code balance (BC)

24 B / LUP

40 B / LUP

Intel Xeon E5-2690 v2

(“IvyBridge”@3 GHz)

Intel Compiler

ifort V13.1

Blocking factor

(CS=25 MB) still a

little too large

Main memory access is not

reason for different performance

(but L3 access is!)

jmax

Page 53: “Simple” performance modeling: The Roofline Model

63

Validating the model: L3 cache balance

03/2017 Roofline Model

Measured L3 cache

code balance (BC)

40 B / LUP

24 B / LUP

jmax

Intel Xeon E5-2690 v2

(“IvyBridge”@3 GHz)

Intel Compiler

ifort V13.1

Main memory (via L3)L1 cache

L2 or L3 cache

Data accesses to L3 cache (blocking)

Impact of total L3 traffic:

24 B/LUP vs. 40 B/LUP

jmax

Page 54: “Simple” performance modeling: The Roofline Model

67

From 2D to 3D

2D

Towards 3D understanding

Picture can be considered as 2D cut of 3D domain for (new) fixed i-coordinate:

x(0:jmax+1,0:kmax+1) x(i, 0:jmax+1,0:kmax+1)

03/2017 Roofline Model

x(0:jmax+1,0:kmax+1)j

k

Page 55: “Simple” performance modeling: The Roofline Model

68

From 2D to 3D

x(0:imax+1, 0:jmax+1,0:kmax+1) – Assume i-direction

contiguous in main memory (Fortran notation)

Stay at 2D picture and consider one cell of j-k plane as a

contiguous slab of elements in i-direction: x(0:imax,j,k)

03/2017 Roofline Model

j

k

i

j, k

Page 56: “Simple” performance modeling: The Roofline Model

69

Layer condition: From 2D 5-pt to 3D 7-pt Jacobi-type stencil

03/2017 Roofline Model

3 * jmax * 8B < CacheSize/2

BC = 24 B / LUP

do k=1,kmax

do j=1,jmax

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) = const * (x(i-1,j,k) + x(i+1,j,k)

+ x(i,j-1,k) + x(i,j+1,k) &

+ x(i,j,k-1) + x(i,j,k+1) )

enddo

enddo

enddo

3 * jmax *imax * 8B < CacheSize/2 BC = 24 B / LUP

2D

3D

Page 57: “Simple” performance modeling: The Roofline Model

70

3D 7-pt Jacobi-type Stencil (sequential)

do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) =const.*(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k) &

+ x(i,j,k-1) +x(i,j,k+1) )

enddo

enddo

enddo

“Layer condition” 3*jmax*imax*8B < CS/2 “Layer condition” OK

5 accesses to x() served by cache

Question:

Does parallelization/multi-threading change the layer condition?

03/2017 Roofline Model

Page 58: “Simple” performance modeling: The Roofline Model

73

Jacobi Stencil – OpenMP parallelization (3D)

!$OMP PARALLEL DO SCHEDULE(STATIC)

do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k)

+ x(i,j,k-1) +x(i,j,k+1) )

enddo

enddo

enddo

Layer condition (cubic domain; shared CacheSize=25 MB)

1 thread: imax=jmax < 720 10 threads: imax=jmax < 230

Basic guideline:

Parallelize outermost loop

Equally large chunks in k-direction

“Layer condition” for each thread

(c) RRZE 2017 Roofline Model case studies

Layer condition for a shared cache: nthreads * 3 * imax*jmax * 8B < CS/2

Layer condition for private caches: 3 * imax*jmax * 8B < CS/2

Page 59: “Simple” performance modeling: The Roofline Model

74

Jacobi Stencil – OpenMP parallelization (II)

!$OMP PARALLEL DO SCHEDULE(STATIC)

do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k) &

+ x(i,j,k-1) +x(i,j,k+1) )

enddo

enddo

enddo

Intel® Xeon® Processor E5-2690 v2

10 cores@3 GHz

CS = 25 MB (L3)

bS = 48 GB/s

Roofline model:

P = min(Pmax, bS / (24 B/LUP))

“Layer condition (shared cache)”: nthreads *3*jmax*imax*8B < CS/2

BC = 24 B / LUP

03/2017 Roofline Model

“Light speed” performance:

P = 2000 MLUP/s

What is 𝑃max here? homework!

Page 60: “Simple” performance modeling: The Roofline Model

75

Socket scaling – validate Roofline model

03/2017 Roofline Model

Intel Xeon E5-2690 v2

(“IvyBridge”@3 GHz)

Intel Compiler

ifort V13.1

OpenMP Parallel

bS = 48 GB/s

BC= 24 B/LUP

BC= 40 B/LUP

𝑃 = min(𝑃max, 𝑏𝑆 𝐵𝐶) What is 𝑃max here? homework!

Page 61: “Simple” performance modeling: The Roofline Model

Case study: A Jacobi smoother

The basics in two dimensions

Layer conditions

Validating the model in 3D

Optimization by spatial blocking in 3D

Page 62: “Simple” performance modeling: The Roofline Model

77

Jacobi Stencil – OpenMP parallelization (I)

Validation: Measured data traffic

from main memory [Bytes/LUP]1 thread: Layer condition OK –

but can not saturate bandwidth

10 threads: performance drops around imax=230

03/2017 Roofline Model

Layer condition

violated 40 B/LUP

Page 63: “Simple” performance modeling: The Roofline Model

78

Jacobi Stencil – OpenMP parallelization (I)

If original “Layer condition” does not hold: nthreads*3*jmax*imax*8B > CS/2

maxMLUPs = MemBW / (40 B/LUP)

!$OMP PARALLEL DO SCHEDULE(STATIC)

do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k)

+ x(i,j,k-1) +x(i,j,k+1) )

enddo

enddo

enddo

But assume: nthreads*jmax*imax*8B < CS/2

(8+8) B/LUP for y() (ST+WA)

+ 8 B/LUP for x(i,j,k+1)

+ 8 B/LUP for x(i,j+1,k)

+ 8 B/LUP for x(i,j,k-1)

40 B/LUP

“Light speed” performance:

P = 1200 MLUP/s

03/2017 Roofline Model

Page 64: “Simple” performance modeling: The Roofline Model

Case study: A Jacobi smoother

The basics in two dimensions

Layer conditions

Validating the model in 3D

Spatial blocking in 3D

Page 65: “Simple” performance modeling: The Roofline Model

80

Jacobi Stencil – simple spatial blocking

do jb=1,jmax,jblock ! Assume jmax is multiple of jblock

!$OMP PARALLEL DO SCHEDULE(STATIC)

do k=1,kmax

do j=jb, (jb+jblock-1) ! Loop length jblock

do i=1,imax

y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k)

+ x(i,j,k-1) +x(i,j,k+1))

enddo

enddo

enddo

enddo

“Layer condition” (j-Blocking)nthreads*3*jblock*imax*8B < CS/2

Test system: Intel® Xeon® Processor E5-2690 v2 (10 cores / 3 GHz), as before

Ensure layer condition by choosing jblock appropriately (cubic domain):

jblock < CS/(imax* nthreads* 48B )

03/2017 Roofline Model

Page 66: “Simple” performance modeling: The Roofline Model

81

Jacobi Stencil – simple spatial blocking

Determine: jblock < CS/(2*nthreads*3*imax*8B)

imax = jmax = kmax

#blocks

changes

CS=10 MB:

~ 90+ % roofline limit

Validation: Measured data traffic

from main memory [Bytes/LUP]

03/2017 Roofline Model

Page 67: “Simple” performance modeling: The Roofline Model

82

Why not block the inner loop level?

24B/update

performance

model

inner (i) loop

blocking

middle (j)

loop blocking

optimum j

block size

03/2017 Roofline Model

Page 68: “Simple” performance modeling: The Roofline Model

89

Conclusions from the Jacobi example

We have made sense of the memory-bound performance vs.

problem size

“Layer conditions” lead to predictions of code balance

“What part of the data comes from where” is a crucial question

The model works only if the bandwidth is “saturated”

In-cache modeling is more involved

Avoiding slow data paths == re-establishing the most favorable

layer condition

Improved code showed the speedup predicted by the model

Optimal blocking factor can be estimated

Be guided by the cache size the layer condition

No need for exhaustive scan of “optimization space”

Food for thought

Multi-dimensional loop blocking – would it make sense?

Can we choose a “better” OpenMP loop schedule?

What would change if we parallelized inner loops?

03/2017 Roofline Model

Page 69: “Simple” performance modeling: The Roofline Model

101

Roofline diagram – exercise

Draw the roofline diagram

assuming double precision

for one node of the emmy cluster – the single processor chip data can be

found on slide 79 of the first presentation (“Intel Xeon E5 2690 V2”)

Add these roofs:

Code only performs single precision computations

Code only performs scalar instructions – no vectorization or SIMD

instructions

OpenMP code which ignores the golden rule of ccNUMA

03/2017 Roofline Model