Write clean (parallel) code!

Write clean (parallel) code!

Bjarne StroustrupTexas A&M Universityhttp://www.research.att.com/~bs

2

Assumptions• You know more about architecture and optimizing parallel code

than I do• A keynote articulates ideals

– Supported by reason– Questions are usually more important than answers– (Must not show code)– (Must have fancy graphics)

• We want to see parallel and distributed programs in mainstream use– Very soon (1 year, 2 years, 5 years)– We have had about 50 years of research, let’s use it

3

Take pity on us

• You are here

• We’ll have 10,000,000 programmers• Just 50% of all programmers are above average

4

Clean code• Is easy to read and write

– Provided you know the application domain– Maybe the application does require a Ph.D.

• But it has better not be a Ph.D. in Computer Science

• Is easy to debug– For some definition of “easy”

• Is easy to maintain• Is easy to port• Runs efficiently• Does not contain performance viruses

5

A performance virus• Is a piece of code that runs well, but after a small

change or a port runs abysmally, e.g.– An O(n*n) algorithm for a small n in a test run– A program with a hardwired constant reflecting a critical

cache size– A program written assuming a shared memory naively

ported to a cluster (or vise versa)– A program assuming exactly 6 processors– …

• A non-portable program is a performance virus as long as hardware performance keep improving

6

New languages• Maybe clean parallel code requires a new language

– DARPA thinks so (?): Fortress, X10, ???– Academics always think so– But maybe not, a new language

• typically dies without helping the user community solve problems– When it’s designer graduates, gets tenure, or is promoted– It’s company gets a new CEO

• consumes resources which could have spent elsewhere– Design, implementation, learning to use

• won’t be useful until about 5 years (or more) after the project starts• aimed at parallelism is unlikely to be competitive with existing

languages for non-parallel code– For years– And non-parallel code will be most code for a (long) while

7

What is simple enough?• Perfect automatic parallelization of arbitrary applications

– Sure, but impossible• “I want this to be parallel, but spare me the details”

– What, not how– Leave “ordinary code” untouched

• Provide “ordinary users” with parallel components– Into which they can plug their own code– That they can drop into their existing applications– Try to verify that the user code doesn’t mess with shared data

• “Threads and locks” programming is evil– Breeds complexity – Too complex for “ordinary programmers”– Required in real-world code– We need all the (language and tool) help we can get

8

Is there enough parallelism?• Traditional data parallel code has been scientific and numeric• Most code is neither• We must expand the application areas for parallel techniques

– Data mining/analysis• Finance• Advertising

– Graphics– Everyday tasks in parallel

• Compilation• Typesetting• Regression testing

• Each application area for parallel techniques cannot have its own language, tools, and techniques– Currently, it seems that every field is (re?) inventing the wheel(s)

• We must integrate data parallel techniques in languages and general-purpose tool chains

9

Is one kind of parallelism sufficient?

• Of course not• From what level should we build support for new

kinds/models of parallelism?

Existing high-level specialized model (can’t)

General high-level platform with facilities for specializationC

hardware

Brand new model/language

Embedd in HLL for generality

10

What would I like to see?• Nah, sure, if you must, someone has to do it

– 10% improved cache performance?– 1% extra performance on each of 256 processors– Using 1,000,000 processors for heat transfer computations– Real-time rendering of blood, gore, and porn

• Yes!– “two simultaneous troffs”– Compiling 30 source files simultaneously on “an ordinary PC”– Running 20,000 regression tests 1,000-way parallel– Indexing the web in a couple of hours– 10* speed up in Photoshop real-time rendering

• That is:– By all means research the most advanced cases,– But don’t forget what’s soon to be “the low end”

11

Keep Simple Things Simple

• Not KISS– It’s not stupid – it’s really hard– Only simple things can be simple

• That’s no excuse for making everything complicated– Defaults!– Tools– Design for usability

– Apply recursively• Everything should be no more complicated than necessary

12

Two examples• STAPL

– TAMU Standard Template Adaptive Parallel Library – Main designer: Lawrence Rauchwerger– http://parasol.tamu.edu/groups/rwergergroup/research/stapl/

• TBB– Intel Threading Building Blocks– Main designer: Arch Robinson– http://osstbb.intel.com/

• Both C++ and STL inspired– Neither perfect (so far)

13

TBB• Main designer

– Arch Robison• Builds on the work of

many friends– Alex Stepanov– Lawrence Rauchwerger– …

• Builds on principles I strongly support– Library building– Composition of code– Lightweight concurrency

14

TBB quotes• Stepanov

– Building libraries is an important task– Non-intrusive, coexist, orthogonal, not hide useful

information, efficient• Robison

– Parallel programming is no longer optional– There is no one true way to do parallel programming– Separate logical tasks from physical threads– Strictly a library– Recursive parallelism (based on ranges)– Performance matters

15

TBB• STL inspired (among others)

– Templates – Function objects– Iterators (serial inspired) and ranges (necessary for parallelism)– RAII

• Coexistence with system threads• Task based, not directly thread based

– (relatively) cheap startup and join• (18 to 200 times faster than threads)

• Parallel algorithms• Parallel containers• …

16

Parallel algorithms

• Currenctly supplied:– parallel_for()– parallel_reduce()– parallel_scan()– parallel_while()– parallel_sort()– pipeline()

• Each with ways of controlling grain/split– Default/key: recursive divide-and-concur

17

Parallel algorithms• Rely on function objects

class SumFoo {float* my_a;

public:float sum;

void operator()(const blocked_range<size_t>& r) {float *a = my_a;for (size_t i=r.begin(); i!=r.end(); +=i)

sum += Foo(a[i]);}

SumFoo(SumFoo& x, split) : my_a(x.my.a), sum(0) { }

void join(const SumFoo& y) { sum += y.sum; }};

18

Parallel containers

• “Clones” of the most useful STL containers– concurrent_vector<T>– concurrent_hash_map<T>– concurrent_queue<T>

• (too few – see STAPL)– A consistent set of algorithms or a consistent set of

containers do not increase complexity• Irregularity does

19

Simple enough?• Not yet

– That doesn’t mean that I know how to do it better• To get mainstream use

– Ordinary concepts and algorithms has to be “as simple as the textbook”

– Parallel mechanism has to be “simpler than the textbook”• Elaboration/optimization is then possible• Parameterization with defaults plus tools

• Overuse of (C/Fortran built-in) arrays in interfaces– I’d like an Array<T,N> type

• With all the usual arithmetic operations

20

Simple enough?• Some of it is C++’s fault (my fault)• C++0x will provide

– Tasks– Atomic types– Concepts

• A type system combinations of types and integers– auto (etc.)

• deduce type from initializer (and other notational improvements)– Uniform and more flexible initialization– Move semantics

• Decrease the cost of copying– Lambdas

• Maybe– Attribute syntax

• Dangerous (MPI)

21

Simple enough?

• Some of it is the expert’s fault (your fault)• Experts

– Focus on the hardest problems– Focus on details– Write academic papers (only)– The cult of completeness

22

Complete enough?

• Completeness will come with time– Will it destroy simplicity?– It must not!

• Let’s look at STAPL– A more mature/complete system– Distinction

• Application developer (e.g. physicist)– Parallel/global-address-space/algorithmic/dependency world

• Library developer (computer scientist)– Distributed/threaded world

STAPL: Standard Template Adaptive Parallel Library

STAPL: A library of parallel, generic constructs based on the C++ Standard Template Library (STL)

– Components for Program Development

pAlgorithms, pContainers, Views, pRange

– Portability and OptimizationSTAPL RTS and Adaptive Remote Method Invocation (ARMI) Communication LibraryFramework for Algorithm Selection and Tuning (FAST)

24

STAPL

pvector<double,default_distribution> x(N);// …double d = accumulate(default_view(x),0);

workfunction

view

tasks

dependencies

(fct,view)+

(f1,v1)->(f2,v2)

fct

…

data datadatadata

distribution

partitions

[0:n2)

pRange:View (with sub-views):

pContainer:

container

[0:n1)

container

[0:n3)

container

Scalability of pAlgorithms

• Results obtained on an IBM P3 machine at NERSC• Scalability is relative to 64 processors

Discrete OrdinatesParticle Transport Computation

• Important application for DOE– E.g., Sweep3D and UMT2K

• Large, on-going DOE project at TAMU to develop application in STAPL (TAXI)

One sweep Eight simultaneous sweeps

Related work

NYNYNNYViews

V,H,QAAMV,LVA,V,L,G,M

Data Structures(Array, Vector,List, Graph, Matrix)

YYNNYNYReuse Seq ContainersYNNNNNYFramework for

pContainers

N

N

Y/Y

SPMD/MPMD

Shared/Part

Lang

Titanium

N

N

Y/Y

MPMD

Shared

Lib

TBB

N

Y

Y/N

SPMD

Shared

Lib

POOMA

NNYYAdaptive

YNYY/YData Partition/Mapping

Y/NY/YY/NY/YGeneric Data Type/ Generic Algorithms

SPMDSPMDMPMDSPMD/MPMD

Programming Model

SharedSharedShared/Part

SharedMemory Address Space

Lib

HTA

LibLangLibLanguage/Library

PSTLCharm++STAPLFeatures\Project

28

So, what is “clean”?

• == “not dirty”– Absence of extraneous matter

• Remember Fortran!• Threads, locks, MPI, “annotations” are today’s assembler

– Portability• Of correctness• Of performance

29

So, what’s “clean”

• This isn’tvoid f(double* p, int n){

if (n>GRAIN) {int m = n%GRAIN;//%% parallelfor (int i = 0; i<min(m,N_PROCESSORS); ++i) {

double* pp = p+i*GRAIN;acquire(lock[i]);// …release(lock[i]);

}join();

}

30

So, what’s “clean”?• Explicit locking

– Instead: parallel algorithms, tasks, futures, message queues, parallel algorithms

• Explicit release of resource (e.g. lock)– RTTI

• Explicit calculation based of memory size or number of processors– Instead: supportive runtime

• Explicit tread management– Some notion of task

31

So, what’s “clean”?• But everyone knows that!

– No: “nobody” knows that• 99.9% of programmers have an extraordinarily naïve view of concurrency

(threads and locks, if that)• And they will try to write concurrent code

• I’m sad that I can’t teach the basics of concurrent programming to my freshmen– Worried really

• Example:Array<double,3> m(xm,ym,zm);// …Array<double,2> v(zm) = 0;parellel::for_each(range(0,zm),

<&>(i;v,m) { v[i] = parallel::accumulate(m[i]); }

Almost clean

clean

32

Who cares about “clean code”?

• Anybody– Who maintains code– Who is a beginner– Who is not an academic– Who uses “ordinary commercial software”

33

So

• Devote major efforts to– usability by non-expert programmers

• Including computer scientists– usability by domain experts

• Much of the needed work is design more than academic research– Some is very serious research– Surprise: you can’t completely separate the two

• Please

Write clean (parallel) code!

Documents

Transcript of Write clean (parallel) code!