The Roofline Model

63
LAWRENCE BERKELEY NATIONAL LABORATORY F U T U R E T E C H N O L O G I E S G R O U P The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 [email protected]

description

The Roofline Model. Samuel Williams Lawrence Berkeley National Laboratory. [email protected]. Outline. Challenges / Goals Fundamentals Roofline Performance Model Example: Heat Equation Example: SpMV Alternate Roofline Formulations Summary. Challenges / Goals. VMT PPE. SPE. SPE. - PowerPoint PPT Presentation

Transcript of The Roofline Model

Page 1: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

1

The Roofline Model

Samuel WilliamsLawrence Berkeley National Laboratory

[email protected]

Page 2: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 2

Outline

Challenges / Goals Fundamentals Roofline Performance Model Example: Heat Equation Example: SpMV Alternate Roofline Formulations Summary

Page 3: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

3

Challenges / Goals

Page 4: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 4

Four Architectures

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllersH

yper

Tran

spor

tOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erTr

ansp

ort

4GB

/s(e

ach

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)in

terc

onne

ct

86.4 GB/s

768MB 900MHz GDDR3 Device DRAM

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

192KB L2 (Textures only)

24 ROPs

6 x 64b memory controllers

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

Sun Victoria FallsAMD Barcelona

NVIDIA G80IBM Cell Blade

Page 5: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 5

Challenges / Goals

We have extremely varied architectures. Moreover, the characteristics of numerical methods can vary

dramatically.

The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next.

We wish to understand whether or not we’ve attained good performance (high fraction of a theoretical peak)

We wish to identify performance bottlenecks and enumerate potential remediation strategies.

Page 6: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

6

Fundamentals

Page 7: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 7

Little’s Law

Little’s Law:Concurrency = Latency * Bandwidth

- or -Effective Throughput = Expressed Concurrency / Latency

Bandwidth conventional memory bandwidth #floating-point units

Latency memory latency functional unit latency

Concurrency: bytes expressed to the memory subsystem concurrent (parallel) memory operations

For example, consider a CPU with 2 FPU’s each with a 4-cycle latency. Little’s law states that we must express 8-way ILP to fully utilize the machine.

Page 8: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 8

Little’s Law Examples

consider a CPU with 2 FPU’s each with a 4-cycle latency.

Little’s law states that we must express 8-way ILP to fully utilize the machine.

Solution: unroll/jam the code to express 8 independent FP operations.

Note, simply unrolling dependent operations (e.g. reduction) does not increase ILP. It simply amortizes loop overhead.

Applied to FPUsApplied to Memory consider a CPU with 20GB/s of

bandwidth and 100ns memory latency.

Little’s law states that we must express 2KB of concurrency (independent memory operations) to the memory subsystem to attain peak performance

On today’s superscalar processors, hardware stream prefetchers speculatively load consecutive elements.

Solution: express the memory access pattern in a streaming fashion in order to engage the prefetchers.

Page 9: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 9

Three Classes of Locality

Temporal Locality reusing data (either registers or cache lines) multiple times amortizes the impact of limited bandwidth. transform loops or algorithms to maximize reuse.

Spatial Locality data is transferred from cache to registers in words. However, data is transferred to the cache in 64-128Byte lines using every word in a line maximizes spatial locality. transform data structures into structure of arrays (SoA) layout

Sequential Locality Many memory address patterns access cache lines sequentially. CPU’s hardware stream prefetchers exploit this observation to hide

speculatively load data to memory latency. Transform loops to generate (a few) long, unit-stride accesses.

Page 10: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 10

Arithmetic Intensity

True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes

Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality)

Others have constant intensity

Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods

Page 11: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA

11

Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among

between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory

Page 12: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA

12

Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among

between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory

Page 13: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

13

Roofline Model

Page 14: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 14

Overlap of Communication

Consider a simple example in which a FP kernel maintains a working set in DRAM.

We assume we can perfectly overlap computation with communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation)

Thus, time, is the maximum of the time required to transfer the data and the time required to perform the floating point operations.

Byte’s / STREAM Bandwidth

Flop’s / Flop/s

time

Page 15: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline ModelBasic Concept

15

Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.

where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernel’s arithmetic intensity (based on DRAM traffic after

being filtered by the cache), programmers can inspect the figure, and bound performance.

Moreover, provides insights as to which optimizations will potentially be beneficial.

AttainablePerformanceij

= minFLOP/s with Optimizations1-i

AI * Bandwidth with Optimizations1-j

Page 16: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 16

Example

Consider the Opteron 2356: Dual Socket (NUMA) limited HW stream prefetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD

datapaths 4-cycle FP latency

Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?

Page 17: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline ModelBasic Concept

17

Plot on log-log scale Given AI, we can easily

bound performance But architectures are much

more complicated

We will bound performance as we eliminate specific forms of in-core parallelism

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 18: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcomputational ceilings

18

Opterons have dedicated multipliers and adders.

If the code is dominated by adds, then attainable performance is half of peak.

We call these Ceilings They act like constraints on

performance

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidthmul / add imbalance

Page 19: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcomputational ceilings

19

Opterons have 128-bit datapaths.

If instructions aren’t SIMDized, attainable performance will be halved

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidthmul / add imbalance

w/out SIMD

Page 20: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcomputational ceilings

20

On Opterons, floating-point instructions have a 4 cycle latency.

If we don’t express 4-way ILP, performance will drop by as much as 4x

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidthmul / add imbalance

Page 21: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcommunication ceilings

21

We can perform a similar exercise taking away parallelism from the memory subsystem

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 22: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcommunication ceilings

22

Explicit software prefetch instructions are required to achieve peak bandwidth

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out S

W pr

efetch

Page 23: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcommunication ceilings

23

Opterons are NUMA As such memory traffic

must be correctly balanced among the two sockets to achieve good Stream bandwidth.

We could continue this by examining strided or random memory access patterns

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out S

W pr

efetch

w/out N

UMA

Page 24: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcomputation + communication ceilings

24

We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

peak DP

mul / add imbalance

w/out ILP

Stream

Ban

dwidth

w/out S

W pr

efetch

w/out N

UMA

Page 25: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

25

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic

FLOPsCompulsory Misses

AI =

Page 26: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 26

Cache Behavior

Knowledge of the underlying cache operation can be critical. For example, caches are organized into lines. Lines are organized

into sets & ways (associativity) Thus, we must mimic the effect of Mark Hill’s 3C’s of caches Impacts of conflict, compulsory, and capacity misses are both

architecture- and application-dependent. Ultimately they reduce the actual flop:byte ratio.

Moreover, many caches are write allocate. a write allocate cache read in an entire cache line upon a write miss. If the application ultimately overwrites that line, the read was

superfluous (further reduces flop:byte ratio) Because programs access data in words, but hardware transfers it

in 64 or 128B cache lines, spatial locality is key Array-of-structure data layouts can lead to dramatically lower

flop:byte ratios. e.g. if a program only operates on the “red” field of a pixel, bandwidth is

wasted.

Page 27: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

27

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w

rite allocation traffic

FLOPsAllocations + Compulsory Misses

AI =

Page 28: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

28

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w

rite allocation traffic

+capacity miss traffic

FLOPsCapacity + Allocations + Compulsory

AI =

Page 29: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

29

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w

rite allocation traffic

+capacity miss traffic

+conflict miss traffic

FLOPsConflict + Capacity + Allocations + Compulsory

AI =

Page 30: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

30

Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w

rite allocation traffic

+capacity miss traffic

+conflict miss traffic

Page 31: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 31

Instruction Issue Bandwidth

On a superscalar processor, there is likely ample instruction issue bandwidth.

This allows loads, integer, and FP instructions to be issued simultaneously.

As such, we assumed that expression of parallelism was the underlying challenge for in-core.

However, on some architectures, finite instruction-issue bandwidth can become a major impediment to performance.

Page 32: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline ModelInstruction Mix

32

As the instruction mix shifts away from floating-point, finite issue bandwidth begins to affect limits on in-core performance.

On Niagara2, with dual issues units but only 1 FPU, FP instructions must constitute ≥50% of the mix to attain peak performance.

A similar approach should be used on GPUs where proper use of CUDA solves the parallelism challenges.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Niagara2)

Stream

Ban

dwidth

12% FP

25% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP, ≥50% FP

Page 33: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 33

Optimization Categorization

Maximizing (attained)In-core Performance

Minimizing (total)Memory Traffic

Maximizing (attained)Memory Bandwidth

Page 34: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 34

Optimization Categorization

MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

Page 35: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 35

Optimization Categorization

MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

???

?unroll &jam

explicitSIMD

reorder

eliminatebranches

Page 36: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

???

?unroll &jam

explicitSIMD

reorder

eliminatebranches

36

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

MaximizingMemory Bandwidth• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

Page 37: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

???

?unroll &jam

explicitSIMD

reorder

eliminatebranches

37

Optimization Categorization

MaximizingIn-core Performance

MaximizingMemory Bandwidth

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

MinimizingMemory Traffic

Eliminate:• Capacity misses• Conflict misses• Compulsory misses• Write allocate behavior

? ???

cacheblockingarray

padding

compressdata

streamingstores

Page 38: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 38

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

MaximizingMemory Bandwidth

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

Eliminate:• Capacity misses• Conflict misses• Compulsory misses• Write allocate behavior

? ???

cacheblockingarray

padding

compressdata

streamingstores?

??

?unroll &jam

explicitSIMD

reorder

eliminatebranches

Page 39: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

39

Examples

Page 40: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 40

Multicore SMPs Used

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Hyp

erTr

ansp

ortOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erTr

ansp

ort

4GB

/s(e

ach

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

MT

SPA

RC

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

Page 41: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

41

Heat Equation

Page 42: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

7-point Stencil

42

PDE grid

+Y

+Z

+X

stencil for heat equation PDE

y+1

y-1

x-1

z-1

z+1

x+1x,y,z

Simplest derivation of the Laplacian operator results in a constant coefficient 7-point stencil

for all x,y,z:u(x,y,z,t+1) = alpha*u(x,y,z,t) + beta*(

u(x,y,z-1,t) + u(x,y-1,z,t) + u(x-1,y,z,t) + u(x+1,y,z,t) + u(x,y+1,z,t) + u(x,y,z+1,t))

Clearly each stencil performs: 8 floating-point operations 8 memory references

all but 2 should befiltered by an idealcache

6 memory streamsall but 2 should be filtered(less than # HW prefetchers)

Ideally, AI=0.5. However, write-allocate bounds it to 0.33.

Page 43: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 43

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3

High locality requirements Capacity and conflict

misses will severely impair flop:byte ratio

Clovertown

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filte

r

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve Cellimplementation

Page 44: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 44

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3

High locality requirements Capacity and conflict

misses will severely impair flop:byte ratio

Clovertown

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filte

r

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve Cellimplementation

Page 45: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 45

Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …)

Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3

Clovertown has huge caches but is pinned to lower BW ceiling

Cache management is essential when capacity/thread is low

Clovertown

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filte

r

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve Cellimplementation

Page 46: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 46

Roofline model for Stencil(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~0.5 on x86/Cell

Clovertown

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filte

r

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

Page 47: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

47

SpMV

Page 48: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 48

Sparse MatrixVector Multiplication

What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure

What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors

Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD)

(a)algebra conceptualization

(c)CSR reference code

for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0;}

A x y

(b)CSR data structure

A.val[ ]

A.rowStart[ ]

...

...

A.col[ ]...

Page 49: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 49

Roofline model for SpMV

Double precision roofline models

In-core optimizations 1..i DRAM optimizations 1..j

FMA is inherent in SpMV (place at bottom)

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

datas

et da

taset

fits in

snoo

p filte

r

GFlopsi,j(AI) = min InCoreGFlopsi

StreamBWj * AI

Page 50: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 50

Roofline model for SpMV(overlay arithmetic intensity)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

datas

et da

taset

fits in

snoo

p filte

r

Page 51: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 51

Roofline model for SpMV(out-of-the-box parallel)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166 For simplicity: dense

matrix in sparse format1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

datas

et da

taset

fits in

snoo

p filte

r

Page 52: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 52

Roofline model for SpMV(NUMA & SW prefetch)

compulsory flop:byte ~ 0.166

utilize all memory channels

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

datas

et da

taset

fits in

snoo

p filte

r

Page 53: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 53

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio, and FP% of instructions

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

datas

et da

taset

fits in

snoo

p filte

r

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

Page 54: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 54

Various Kernels

We have examined and heavily optimized a number of kernels and applications for both CPUs and GPUs.

We observe that for most, performance is highly correlated with DRAM bandwidth – particularly on the GPU.

Note, GTC has a strong scatter/gather component that skews STREAM-based rooflines.

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 321/81/4

1/21/32

single-precision peak

double-precision peak

STREAM band

width

RTM/wave eqn.

7pt Stencil27pt Stencil

Xeon X5550 (Nehalem)

DP add-only

SpMV

DGEMM

GTC/chargei

GTC/pushi

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 321/81/4

1/21/32

single-precision peak

double-precision peak

Device

band

width RTM/wave eqn.

NVIDIA C2050 (Fermi)

DP add-only

SpMV

7pt Stencil

27pt Stencil

DGEMM

GTC/chargei

GTC/pushi

Page 55: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

55

Alternate Rooflines

Page 56: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 56

No overlap of communication and computation

Previously, we assumed perfect overlap of communication or computation.

What happens if there is a dependency (either inherent or by a lack of optimization) that serializes communication and computation ?

Byte’s / STREAM Bandwidth

Flop’s / Flop/s

time

Byte’s / STREAM Bandwidth Flop’s / Flop/s

time

Time is the sum of communication time and computation time. The result is that flop/s grows asymptotically.

Page 57: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 57

No overlap of communication and computation

Consider a generic machine If we can perfectly decouple

and overlap communication with computation, the roofline is sharp/angular.

However, without overlap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x.

Page 58: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 58

Alternate Bandwidths

Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM benchmark)

STREAM is NOT a good proxy for short stanza/random cacheline access patterns as memory latency (instead of just bandwidth) is being exposed.

Thus one might conceive of alternate memory benchmarks to provide a bandwidth upper bound (ceiling)

Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:LLC byte ratios.

For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:PCIe byte ratio.

Page 59: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 59

Alternate Computations

Arising from HPC kernels, its no surprise roofline use DP Flop/s. Of course, it could use

SP flop/s, integer ops, bit operations, pairwise comparisons (sorting), graphics operations, etc…

Page 60: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 60

Time-based roofline

In some cases, it is easier to visualize performance in terms of seconds (i.e. time-to-solution).

We can invert the roofline (seconds per flop) and simply multiply by the number of requisite flop’s

Additionally, we could change the horizontal axis from locality to some more appealing metric.

Page 61: The Roofline Model

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY 61

Empirical Roofline

Thus far, all in-core estimates have been based on first principles analysis of the underlying computer architecture (frequency, SIMD-width, latency, etc…)

Conceivably, one could design a series of compiled benchmarks that would extract the relevant roofline parameters.

Similarly, one could use performance counters to extract application characteristics so one could accurately determine application coordinates.

Page 62: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

62

Questions?AcknowledgmentsResearch supported by DOE Office of Science under contract number DE-AC02-05CH11231.

Page 63: The Roofline Model

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

63

BACKUP SLIDES