PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit...

55
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort [email protected]

Transcript of PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit...

Page 1: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

PARALLEL PROGRAMMING

MANY-CORE COMPUTING:

INTRO (1/5)

Rob van Nieuwpoort

[email protected]

Page 2: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Schedule 2

1. Introduction, performance metrics & analysis

2. Many-core hardware

3. Cuda class 1: basics

4. Cuda class 2: advanced

5. Case study: LOFAR telescope with many-cores

Page 3: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

What are many-cores? 3

Page 4: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

What are many-cores 4

From Wikipedia: “A many-core processor is a multi-

core processor in which the number of cores is large

enough that traditional multi-processor techniques

are no longer efficient — largely because of issues

with congestion in supplying instructions and data

to the many processors.”

Page 5: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

What are many-cores 5

How many is many?

Several tens of cores

How are they different from multi-core CPUs?

Non-uniform memory access (NUMA)

Private memories

Network-on-chip

Examples

Multi-core CPUs (48-core AMD magny-cours)

Graphics Processing Units (GPUs)

GPGPU = general purpose programming on GPUs

Cell processor (PlayStation 3)

Server processors (Sun Niagara)

Page 6: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Many-core questions

The search for performance Build hardware

What architectures?

Evaluate hardware

What metrics?

How do we measure?

Use it

What workloads?

Expected performance?

Program it

How to program?

How to optimize?

Benchmark

How to analyze performance?

6

Page 7: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Today’s Topics 7

Introduction

Why many-core programming?

History

Hardware introduction

Performance model:

Arithmetic Intensity and Roofline

Page 8: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Why do we need many-cores? 8

Page 9: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

T12

Westmere

NV30 NV40

G70

G80

GT200

3GHz Dual

Core P4

3GHz

Core2 Duo

3GHz Xeon

Quad

Why do we need many-cores? 9

Page 10: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Why do we need many-cores? 10

Page 11: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Why do we need many-cores? 11

Page 12: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

China's Tianhe-1A

#5 in top500 list

4.701 pflops peak

2.566 pflops max

14,336 Xeon X5670 processors

7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

12

Page 13: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Power efficiency 13

Page 14: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Graphics in 1980 14

Page 15: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Graphics in 2000 15

Page 16: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Graphics now: GPU movie 16

Page 17: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Realism of modern GPUs 17

http://www.youtube.com/watch?v

=bJDeipvpjGQ&feature=play

er_embedded#t=49s

Courtesy

techradar.com

Page 18: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Why do we need many-cores?

Performance

Large scale parallelism

Power Efficiency

Use transistors more efficiently

Price (GPUs)

Huge market, bigger than Hollywood

Mass production, economy of scale

“spotty teenagers” pay for our HPC needs!

18

Page 19: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

1995 2000 2005 2010

RIVA 128 3M xtors

GeForce® 256 23M xtors

GeForce FX 125M xtors

GeForce 8800

681M xtors

GeForce 3 60M xtors

“Fermi” 3B xtors

GPGPU history 19

Page 20: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

GPGPU History

Use Graphics primitives for HPC

Ikonas [England 1978]

Pixel Machine [Potmesil & Hoffert 1989]

Pixel-Planes 5 [Rhoades, et al. 1992]

Programmable shaders, around 1998

DirectX / OpenGL

Map application onto graphics domain!

GPGPU

Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...

20

Page 21: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

CUDA C/C++ Continuous Innovation

2007 2008 2009 2010

July 07 Nov 07 April 08

Aug 08 July 09

Nov 09 Mar 10

CUDA Toolkit 1.1

• Win XP 64

• Atomics

support

• Multi-GPU

support

CUDA Toolkit 2.0

Double

Precision

• Compiler

Optimizations

• Vista 32/64

• Mac OSX

• 3D Textures

• HW Interpolation

CUDA Toolkit 2.3

• DP FFT

• 16-32 Conversion

intrinsics

• Performance

enhancements

CUDA Toolkit 1.0

• C Compiler

• C Extensions

• Single

Precision

• BLAS

• FFT

• SDK

40 examples

CUDA

Visual Profiler 2.2

cuda-gdb

HW Debugger

Parallel Nsight

Beta CUDA Toolkit 3.0

C++ inheritance

Fermi support

Tools updates

Driver / RT interop

21

Page 22: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Parallel Nsight

Visual Studio

Visual Profiler

For Linux

cuda-gdb

For Linux

Cuda Tools 22

Page 23: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Many-core hardware introduction 23

Page 24: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

The search for performance 24

Page 25: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

The search for performance 25

We have M(o)ore transistors …

Bigger cores?

We are hitting the walls!

power, memory, instruction-level parallelism (ILP)

How do we use them?

Large-scale parallelism

Many-cores !

Page 26: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Choices … 26

Core type(s):

Fat or slim ?

Vectorized (SIMD) ?

Homogeneous or heterogeneous?

Number of cores:

Few or many ?

Memory

Shared-memory or distributed-memory?

Parallelism

Instruction-level parallelism, threads, vectors, …

Page 27: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

A taxonomy 27

Based on “field-of-origin”:

General-purpose

Intel, AMD

Graphics Processing Units (GPUs)

NVIDIA, ATI

Gaming/Entertainment

Sony/Toshiba/IBM

Embedded systems

Philips/NXP, ARM

Servers

Oracle, IBM, Intel

High Performance Computing

Intel, IBM, …

Page 28: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

General Purpose Processors 28

Architecture

Few fat cores

Vectorization (SSE, AVX)

Homogeneous

Stand-alone

Memory

Shared, multi-layered

Per-core cache and shared cache

Programming

Processes (OS Scheduler)

Message passing

Multi-threading

Coarse-grained parallelism

Page 29: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Server-side 29

General-purpose-like with more hardware threads

Lower performance per thread

high throughput

Examples

Sun Niagara II

8 cores x 8 threads

IBM POWER7

8 cores x 4 threads

Intel SCC

48 cores, all can run their own OS

Page 30: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Graphics Processing Units 30

Architecture

Hundreds/thousands of slim cores

Homogeneous

Accelerator

Memory

Very complex hierarchy

Both shared and per-core

Programming

Off-load model

Many fine-grained symmetrical threads

Hardware scheduler

Page 31: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Cell/B.E. 31

Architecture

Heterogeneous

8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE)

Memory

Per-core memory, network-on-chip

Programming

User-controlled scheduling

6 levels of parallelism, all under user control

Fine- and coarse-grain parallelism

Page 32: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Take home message 32

Variety of platforms

Core types & counts

Memory architecture & sizes

Parallelism layers & types

Scheduling

Open questions:

Why so many?

How many platforms do we need?

Can any application run on any platform?

Page 33: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Hardware performance metrics 33

Page 34: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Hardware Performance metrics 34

Clock frequency [GHz] = absolute hardware speed

Memories, CPUs, interconnects

Operational speed [GFLOPs]

Operations per cycle

Memory bandwidth [GB/s]

differs a lot between different memories on chip

Power [Watt]

Derived metrics

FLOP/Byte, FLOP/Watt

Page 35: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Theoretical peak performance 35

Peak = chips * cores * vectorWidth *

FLOPs/cycle * clockFrequency

Examples from DAS-4:

Intel Core i7 CPU

2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle

* 2.4 GHz = 154 GFLOPs

NVIDIA GTX 580 GPU

1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle

* 1.544 GhZ = 1581 GFLOPs

ATI HD 6970

1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle

* 0.880 GhZ = 2703 GFLOPs

Page 36: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

DRAM Memory bandwidth 36

Throughput =

memory bus frequency * bits per cycle * bus width

Memory clock != CPU clock!

In bits, divide by 8 for GB/s

Examples:

Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s

NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s

ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s

Page 37: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Memory bandwidths 37

On-chip memory can be orders of magnitude faster

Registers, shared memory, caches, …

E.g., AMD HD 7970 L1 cache achieves 2 TB/s

Other memories: depends on the interconnect

Intel’s technology: QPI (Quick Path Interconnect)

25.6 GB/s

AMD’s technology: HT3 (Hyper Transport 3)

19.2 GB/s

Accelerators: PCI-e 2.0

8 GB/s

Page 38: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Power 38

Chip manufactures specify Thermal Design Power (TDP)

We can measure dissipated power

Whole system

Typically (much) lower than TDP

Power efficiency

FLOPS / Watt

Examples (with theoretical peak and TDP)

Intel Core i7: 154 / 160 = 1.0 GFLOPs/W

NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W

ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W

Page 39: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Summary

Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte

Sun Niagara 2 8 64 11.2 76 0.1

IBM bg/p 4 8 13.6 13.6 1.0

IBM Power 7 8 32 265 68 3.9

Intel Core i7 4 16 85 25.6 3.3

AMD Barcelona 4 8 37 21.4 1.7

AMD Istanbul 6 6 62.4 25.6 2.4

AMD Magny-Cours 12 12 125 25.6 4.9

Cell/B.E. 8 8 205 25.6 8.0

NVIDIA GTX 580 16 512 1581 192 8.2

NVIDIA GTX 680 8 1536 3090 192 16.1

AMD HD 6970 384 1536 2703 176 15.4

AMD HD 7970 32 2048 3789 264 14.4

Page 40: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Absolute hardware performance 40

Only achieved in the optimal conditions:

Processing units 100% used

All parallelism 100% exploited

All data transfers at maximum bandwidth

No application is like this

Even difficult to write micro-benchmarks

Page 41: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Operational Intensity and the Roofline model

Performance analysis 41

Page 42: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Software performance metrics (3 P’s) 42

Performance

Execution time

Speed-up vs. best available sequential application

Achieved GFLOPs

Computational efficiency

Achieved GB/s

Memory efficiency

Productivity and Portability

Programmability

Production costs

Maintenance costs

Page 43: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Arithmetic intensity 43

The number of arithmetic (floating point) operations

per byte of memory that is accessed

Is the program compute intensive or data intensive on a

particular architecture?

Ignore “overheads”

Loop counters

Array index calculations

Etc.

Page 44: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

RGB to gray 44

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

Page 45: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

RGB to gray 45

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

2 additions, 3 multiplies = 5 operations

3 reads, 1 write = 4 memory accesses

AI = 5/4 = 1.25

Page 46: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Compute or memory intensive? 46

RGB to Gray

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Sun Niagara 2

IBM bg/p

IBM Power 7

Intel Core i7

AMD Barcelona

AMD Istanbul

AMD Magny-Cours

Cell/B.E.

NVIDIA GTX 580

NVIDIA GTX 680

AMD HD 6970

Page 47: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Applications AI 47

A r i t h m e t i c I n t e n s i t y

O( N ) O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra

(BLAS3)

Particle Methods

Page 48: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Operational intensity 48

The number of operations per byte of DRAM traffic

Difference with Arithmetic Intensity

Operations, not just arithmetic

Caches

“After they have been filtered by the cache hierarchy”

Not between processor and cache

But between cache and DRAM memory

Page 49: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Attainable performance 49

Attainable GFlops/sec

= min(Peak Floating-Point Performance,

Peak Memory Bandwidth * Operational Intensity)

Page 50: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

The Roofline model 50

AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

Page 51: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Roofline: comparing architectures 51

AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9

Page 52: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Roofline: computational ceilings 52

AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

Page 53: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Roofline: bandwidth ceilings 53

AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

Page 54: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Roofline: optimization regions 54

Page 55: PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5)rob/parallel-programming... · CUDA Toolkit 1.1 CUDA Toolkit 2.0 • Win XP 64 • Atomics support Tools updates • Multi-GPU

Use the Roofline model 55

Determine what to do first to gain performance

Increase memory streaming rate

Apply in-core optimizations

Increase arithmetic intensity

Reader

Samuel Williams, Andrew Waterman, David Patterson

“Roofline: an insightful visual performance model for

multicore architectures”