Heterogeneous Parallelism at...

39
Herb Sutter

Transcript of Heterogeneous Parallelism at...

Page 1: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Herb Sutter

Page 2: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

1975-2005

Put a computer on every desk, in every home, in every pocket.

2005-2011

Put a parallel supercomputer on every desk, in every home, in every pocket.

2011-201x

Put a heterogeneous supercomputer on every desk, in every home, in every pocket.

Welcome to the jungle

The free lunch is so over

Page 3: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility
Page 4: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility
Page 5: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

Page 6: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

Xbox 360 & mainstream

computer

AMD 80x86

Phenom II Athlon

Fusion APU

AMD GPU

Other GPU

Page 7: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

AMD 80x86

Phenom II Athlon

Fusion APU

AMD GPU

Cloud + GPU

Microsoft Azure Cloud Computing

Other GPU

Xbox 360 & mainstream

computer

Page 8: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

Page 9: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

(GP)GPU

Multicore CPU

Cloud IaaS/HaaS

ISO C++0x

ISO C++

Page 10: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility
Page 11: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

(GP)GPU

Multicore CPU

Cloud IaaS/HaaS

C++ PPL ISO

C++0x

Page 12: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

(GP)GPU

Multicore CPU

Cloud IaaS/HaaS

C++ PPL ISO

C++0x

DirectCompute

?

Page 13: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

(GP)GPU

Multicore CPU

Cloud IaaS/HaaS

C++ PPL ISO

C++0x

DirectCompute C++ AMP

Accelerated Massive Parallelism

Page 14: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, int M, int N, int W ) { for (int y = 0; y < M; y++) for (int x = 0; x < N; x++) { float sum = 0; for(int i = 0; i < W; i++) sum += A[y*W + i] * B[i*N + x]; C[y*N + x] = sum; } }

Convert this (serial loop nest)

Page 15: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, int M, int N, int W ) { for (int y = 0; y < M; y++) for (int x = 0; x < N; x++) { float sum = 0; for(int i = 0; i < W; i++) sum += A[y*W + i] * B[i*N + x]; C[y*N + x] = sum; } }

Convert this (serial loop nest)

… to this (parallel loop, CPU or GPU)

void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, int M, int N, int W ) { array_view<const float,2> a(M,W,A), b(W,N,B); array_view<writeonly<float>,2> c(M,N,C);

parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { float sum = 0; for(int i = 0; i < a.x; i++) sum += a(idx.y, i) * b(i, idx.x); c[idx] = sum; } ); }

Page 16: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility
Page 17: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

17 | The Programmer’s Guide to the APU Galaxy | June 2011

EVOLUTION OF HETEROGENEOUS COMPUTING A

rch

ite

ctu

re M

atu

rity

& P

rog

ram

me

r A

cce

ssib

ility

Po

or

Ex

ce

lle

nt

2012 - 2020 2009 - 2011 2002 - 2008

Graphics & Proprietary

Driver-based APIs

Proprietary Drivers Era

“Adventurous” programmers

Exploit early programmable

“shader cores” in the GPU

Make your program look like

“graphics” to the GPU

CUDA™, Brook+, etc

OpenCL™, DirectCompute

Driver-based APIs

Standards Drivers Era

Expert programmers

C and C++ subsets

Compute centric APIs, data

types

Multiple address spaces with

explicit data movement

Specialized work queue based

structures

Kernel mode dispatch

Fusion™ System Architecture

GPU Peer Processor

Architected Era

Mainstream programmers

Full C++

GPU as a co-processor

Unified coherent address space

Task parallel runtimes

Nested Data Parallel programs

User mode dispatch

Pre-emption and context

switching

Page 18: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

Page 19: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

PPL Parallel Patterns Library

(VS2010)

ISO C++0x

Single-core to multi-core

?

Page 20: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

PPL Parallel Patterns Library

(VS2010)

ISO C++0x

Single-core to multi-core

forall( x, y )

forall( z; w; v )

forall( k, l, m, n )

. . . ?

Page 21: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

λ Single-core to multi-core

PPL Parallel Patterns Library

(VS2010)

ISO C++0x parallel_for_each(

items.begin(), items.end(), [=]( Item e ) {

… your code here …

} );

Page 22: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

1 language feature for multicore

and STL, functors, callbacks, events, ...

Page 23: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Multi-core to hetero-core

C++ AMP Accelerated

Massive Parallelism

ISO C++0x

?

Page 24: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

restrict

Multi-core to hetero-core

C++ AMP Accelerated

Massive Parallelism

ISO C++0x parallel_for_each(

items.grid, [=](index<2> i) restrict(direct3d) {

… your code here …

} );

Page 25: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

1 language feature for heterogeneous cores

Page 26: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

Page 27: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Problem: Some cores don’t support the entire C++ language.

Solution: General restriction qualifiers enable expressing language subsets within the language. Direct3d math functions in the box.

Example

double sin( double ); // 1a: general code double sin( double ) restrict(direct3d); // 1b: specific code

double cos( double ) restrict(direct3d); // 2: same code for either

parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { … sin( data.angle ); // ok, chooses overload based on context cos( data.angle ); // ok … });

Page 28: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Initially supported restriction qualifiers: restrict(cpu): The implicit default.

restrict(direct3d): Can execute on any DX11 device via DirectCompute. Restrictions follow limitations of DX11 device model

(e.g., no function pointers, virtual calls, goto).

Potential future directions: restrict(pure): Declare and enforce a function has no side effects.

Great to be able to state declaratively for parallelism.

General facility for language subsets, not just about compute targets.

Page 29: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Problem: Memory may be flat, nonuniform, incoherent, and/or disjoint.

Solution: Portable view that works like an N-dimensional “iterator range.” Future-proof: No explicit .copy()/.sync(). As needed by each actual device.

Example

void MatrixMult( float* C, const vector<float>& A, const vector<float>& B, int M, int N, int W ) { array_view<const float,2> a(M,W,A), b(W,N,B); // 2D view over C array array_view<writeonly<float>,2> c(M,N,C); // 2D view over C++ std::vector

parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { … } ); }

Page 30: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

TM

Page 31: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Bring CPU debugging experience to the GPU

Page 32: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Bring CPU debugging experience to the GPU

Page 33: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility
Page 34: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility
Page 35: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

TM

Page 36: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

# cores, not counting SIMD

OoO CPU

InO CPU

GPU

Cloud OoO

Cloud GPU

Page 37: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

# cores, not counting SIMD

OoO CPU

InO CPU

GPU

Cloud OoO

Cloud GPU

Welcome to the jungle

The free lunch is so over

Page 38: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Pro

cess

ors

Memory

Page 39: Heterogeneous Parallelism at Microsoftdeveloper.amd.com/wordpress/media/2012/10/4-Sutter...2012/10/04  · Great to be able to state declaratively for parallelism. General facility

Herb Sutter

C++ PPL: 9:45am C++ AMP: 2:00pm, Room 406