Some submissions, most doing very well about gaining...

26
Coursework Coursework 1 almost due Some submissions, most doing very well Don’t try too hard – about gaining skills not competing Remember that the specification is live Coursework 2 is out Wasn’t visible for two days..

Transcript of Some submissions, most doing very well about gaining...

Page 1: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Coursework

• Coursework 1 almost due

– Some submissions, most doing very well

– Don’t try too hard – about gaining skills not competing

– Remember that the specification is live

• Coursework 2 is out

– Wasn’t visible for two days..

Page 2: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

erode(., 1)

erode(., 1)

erode(., 1)

erode(., 1)

im

im

im

Consider inter-function dependencies

function [im] = erode(im, n)

tmp = im(2:end-1,2:end-1);

tmp = min(tmp,im(2:end-1,1:end-2));

tmp = min(tmp,im(1:end-2,2:end-1));

tmp = min(tmp,im(3:end, 2:end-1));

tmp = min(tmp,im(2:end-1,3:end));

im(2:end-1,2:end-1)=tmp;

if n>1

erode(im, n-1);

end

end

erode(., 1)

erode(., 1)

erode(., 1)

erode(., 1)

im

im

im=erode(im,4) im=erode(im,2)

im=erode(im,2)

Page 3: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Adjust solution space for the problem

• Video often has interesting performance requirements

– Time to process any one frame is usually irrelevant

– Main performance metric is usually frames/second

– Latency is not important in many situations – buffer freely

• Need to determine application performance metrics

– Latency: time from start to end of processing

– Throughput: average frames per second

– Jitter: difference between desired and actual time frame shown

– Dropped frames: tolerance for frames which don’t make it

– Distortion: acceptable pixel-level errors within each frame

• If we are allowed some latency, pipelining is possible

Page 4: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Building a pipeline in matlab

• Disclaimer: this is not how you should use matlab, it is

to make you think about morphisms between parallel

constructs. Later in TBB we’ll do pipelining properly.

• If we want to build a pipeline, we need:

– Combinatorial logic: data transformation

– Registers: data storage

– A clock: synchronisation point

• Some natural analogues in video/matlab

– Clock: frame display loop

– Combinatorial logic: lambda functions

– Registers: variables

erode(., 1)

erode(., 1)

erode(., 1)

erode(., 1)

im

im

im

Some problems here...

Page 5: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Reminder: Functions and closures

• When you use @(x)(...) you are creating a closure

– The function inherits the environment it was defined in

• Environment: set of variable names and values in existence

– Variables referenced (named) in function are captured by value

• The function closes over the environment / values it was born with

• Variables can never be given new values within the closure

• Quite different to C++ lambdas, which we’ll use in TBB

– C++ anonymous functions can capture by value or by reference

– Can modify/update variables by reference

– Modifications may be visible to others

Page 6: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Another brief matlab aside : cell arrays

• Matlab mainly works with matrices

– Grids of elements with one or more dimensions

– Each element must have the same type

• Can be doubles, uint8s, chars, booleans, ...

• Can also be arbitrary structures

• Cell arrays allow more flexibility with contents

– Grids of elements with one or more dimensions

– Each element can have a different type

– Get contents using curly {} brackets rather than smooth () brackets

>> x={}; x{1}=5; x{2}='wibble‘; x{3}=x;

>> x

x =

[5] 'wibble' {1x2 cell}

Page 7: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

A poor pipeline attempt

• How can we maintain state (in C, matlab, ...)?

function [ out ] = pipeline2( f1, f2, in )

%PIPELINE_PAR Apply f1(f2(in)) over two separate calls

%

% >> pipeline2( @(x)(x+1), @(y)(y*2), 5 )

%

% ans = []

%

% >> pipeline2( @(x)(x+1), @(y)(y*2), 100 )

%

% ans = 11

%

% >> pipeline2( @(x)(x+1), @(y)(y*2), 7 )

%

% ans = 201

Page 8: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

A poor pipeline attempt

function [ out ] = pipeline2_bad( f1, f2, in )

global register;

buffer={register, in};

parfor i=1:2

if ~isempty(buffer{i})

if i==1

buffer{i} = f1( buffer{i} );

else

buffer{i} = f2( buffer{i} );

end

end

end

out=buffer{1};

register=buffer{2};

end

Page 9: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Global state is bad

• Bad enough in sequential code

– Caller of function needs to understand semantics

– Reduces opportunity for optimisations by the compiler

• Terrible in parallel code

– Imagine two parts of the program want to use pipeline2

– Very difficult to understand who will access the global register

– In other languages, potential for ``torn write’’

– Would have to protect global register with a mutex...

• General modern approach: minimise global state

– Eliminate as much mutual exclusion as possible

– Make state local, and obvious

– Wherever possible use pure functions, with no state

Page 10: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

A second attempt

function [out,register] = pipeline2(f1,f2,register,in)

functions={f1,f2};

inputs={register, in};

outputs=cell(2,1);

parfor i=1:2

f=functions{i};

outputs{i}=f(inputs{i});

end

out=outputs{1};

register=outputs{2};

end

Page 11: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Extend to the n-ary case

function [out,registers] = pipeline_par(funcs,registers,in )

if ~iscell(registers)

registers=cell(1, length(funcs)-1);

end

registers{end+1} = in;

parfor i=1:length(registers)

f=funcs{i};

if ~isempty(registers{i})

registers{i} = f( registers{i} );

end

end

out=registers{1};

registers={registers{2:end}};

end

Push input onto the back

Pop output off the front

Page 12: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Pipeline parallelism

• Problem: want to calculate yi=f1(f2(...(fn(xi)...)), i=1,2,...

• Goal: high throughput only, maximise outputs / sec

• Solution:

– Multiple tasks each handling one function in parallel

– Synchronise all tasks at the end of each round

• Requirements:

– f1..fn are side-effect free, so can safely call them in parallel

– Application is latency tolerant

– Intermediate memory usage is not a problem

Page 13: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Data parallelism

• Problem: want to calculate vector yi=f(xi), 1≤i≤n

• Goal: low latency, minimise total execution time

• Solution:

– Multiple tasks each handling one piece of data in parallel

– Synchronise all tasks at the end of each round

• Requirements:

– f is side-effect free, so can safely call them in parallel

Page 14: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Our first two design patterns

• Data-parallelism : apply one function to lots of data

– Simple but powerful form of parallelism, natural in HW and SW

– Widely applicable, many applications allow SIMD operation

– Can be used to build some other types of parallelism

• Pipeline parallelism : apply lots of functions to one datum

– Often well supported in hardware; less so in software

– Fewer applications, as must be able to tolerate latency

– Difficult to use as a primitive for other forms of parallelism

• Both are restricted in scope

– Must know amount of data / number of functions at start-up

– Very simple dependency model based on barriers

– Future design patterns: relax these restrictions

Page 15: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Control dependency view : data par.

y1=f(x1) y2=f(x2) ... yn-1=f(xn-1) yn=f(xn)

parfor

parfor

y1=f(x1) y2=f(x2) ... yn-1=f(xn-1) yn=f(xn)

parfor

Synchronisation

Barrier

Synchronisation

Barrier

Page 16: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Control dependency view: simple pipeline

x’=f1(x) x’=f2(x) x’=fn-1(x) x’=fn(x)

parfor

parfor

Synchronisation

Barrier

x’=f1(x) x’=f2(x) x’=fn-1(x) x’=fn(x)

Synchronisation

Barrierparfor

Page 17: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Practical pipelining

• Matlab is not really very good at pipeline parallelism

– To be fair, not what it is designed for

– It can be used to describe pipelines well, using SimuLink

• Many libraries and approaches support it well

– Unix Pipes: one of the simplest general purpose tools

– Threaded Building Blocks: allows complex pipelines

– OpenCL 2.0: builtin support for FIFOs between kernels

• Too new for us to look at

• Designed to allow hardware-level pipeline parallelism

– Lots of video and audio-processing streaming APIs

Page 18: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Unix pipes as pipeline parallelism

Mike Gancarz: “The UNIX Philosophy”:

1. Small is beautiful.

2. Make each program do one thing well.

3. Build a prototype as soon as possible.

4. Choose portability over efficiency.

5. Store data in flat text files.

6. Use software leverage to your advantage.

7. Use shell scripts to increase leverage and portability.

8. Avoid captive user interfaces.

9. Make every program a filter.

CW2

There are multiple re-statements, but this one I prefer.

Page 19: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Anatomy of a filter program

• Most OS’s and languages have standard streams

– stdin : Input text or binary data being passed to the program

– sdout : Output text or binary data being produced by program

– stderr : Diagnostic information produced during execution

• Streams are initialised when program starts

– Arguments are passed to main by shell or OS

– Standard streams are automatically

connected, e.g. to keyboard/display

– Program has to deal with the extra

arguments, may open files, ...

Filterstdin

stdout

file1.txt“flag”

stderr

Page 20: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Stream re-routing

shellkeyboard display

stdout

stderr

Page 21: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Stream re-routing

catstdin

stdout

stderr

index.html

shellkeyboard display

stdout

stderr

cat index.html

Page 22: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Stream re-routing

grepstdin

stdout

“html”

stderr

cat

“index.html”

shellkeyboard display

stdout

stderr

cat index.html | grep html

Page 23: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Stream re-routing

grepstdin

stdout

“html”

stderr

shellkeyboard display

stdout

stderr

grep html

Page 24: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Stream re-routing

grep

stdout

“id”

shellkeyboard display

stdout

stderr

sortcurl

stderr

http://slashdot.org

stdin

stdout

stdin

stdout

curl http://slashdot.org | grep id | sort

Page 25: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Advantages of streaming data

• Intermediate data never has to touch disk

– IO is expensive; we are often limited by disk bandwidth

– When processing terabytes of data there is not enough space

• For “Big-Data” can often only store compressed version

• High performance computing is increasingly data-limited

• Parallel processing comes for free

– Each stage in the pipeline is its own parallel process

– OS will block processes when they are waiting for data

• Synchronisation is local, rather than global for pipeline

– Block when there is not enough data on stdin

– Block when there is not enough buffer space on stdout

– Apart from that: process away!

Page 26: Some submissions, most doing very well about gaining ...cas.ee.ic.ac.uk/people/dt10/.../2014/hpce/hpce-lec4... · Coursework • Coursework 1 almost due –Some submissions, most

Disadvantages?