Some submissions, most doing very well about gaining...
Transcript of Some submissions, most doing very well about gaining...
Coursework
• Coursework 1 almost due
– Some submissions, most doing very well
– Don’t try too hard – about gaining skills not competing
– Remember that the specification is live
• Coursework 2 is out
– Wasn’t visible for two days..
erode(., 1)
erode(., 1)
erode(., 1)
erode(., 1)
im
im
im
Consider inter-function dependencies
function [im] = erode(im, n)
tmp = im(2:end-1,2:end-1);
tmp = min(tmp,im(2:end-1,1:end-2));
tmp = min(tmp,im(1:end-2,2:end-1));
tmp = min(tmp,im(3:end, 2:end-1));
tmp = min(tmp,im(2:end-1,3:end));
im(2:end-1,2:end-1)=tmp;
if n>1
erode(im, n-1);
end
end
erode(., 1)
erode(., 1)
erode(., 1)
erode(., 1)
im
im
im=erode(im,4) im=erode(im,2)
im=erode(im,2)
Adjust solution space for the problem
• Video often has interesting performance requirements
– Time to process any one frame is usually irrelevant
– Main performance metric is usually frames/second
– Latency is not important in many situations – buffer freely
• Need to determine application performance metrics
– Latency: time from start to end of processing
– Throughput: average frames per second
– Jitter: difference between desired and actual time frame shown
– Dropped frames: tolerance for frames which don’t make it
– Distortion: acceptable pixel-level errors within each frame
• If we are allowed some latency, pipelining is possible
Building a pipeline in matlab
• Disclaimer: this is not how you should use matlab, it is
to make you think about morphisms between parallel
constructs. Later in TBB we’ll do pipelining properly.
• If we want to build a pipeline, we need:
– Combinatorial logic: data transformation
– Registers: data storage
– A clock: synchronisation point
• Some natural analogues in video/matlab
– Clock: frame display loop
– Combinatorial logic: lambda functions
– Registers: variables
erode(., 1)
erode(., 1)
erode(., 1)
erode(., 1)
im
im
im
Some problems here...
Reminder: Functions and closures
• When you use @(x)(...) you are creating a closure
– The function inherits the environment it was defined in
• Environment: set of variable names and values in existence
– Variables referenced (named) in function are captured by value
• The function closes over the environment / values it was born with
• Variables can never be given new values within the closure
• Quite different to C++ lambdas, which we’ll use in TBB
– C++ anonymous functions can capture by value or by reference
– Can modify/update variables by reference
– Modifications may be visible to others
Another brief matlab aside : cell arrays
• Matlab mainly works with matrices
– Grids of elements with one or more dimensions
– Each element must have the same type
• Can be doubles, uint8s, chars, booleans, ...
• Can also be arbitrary structures
• Cell arrays allow more flexibility with contents
– Grids of elements with one or more dimensions
– Each element can have a different type
– Get contents using curly {} brackets rather than smooth () brackets
>> x={}; x{1}=5; x{2}='wibble‘; x{3}=x;
>> x
x =
[5] 'wibble' {1x2 cell}
A poor pipeline attempt
• How can we maintain state (in C, matlab, ...)?
function [ out ] = pipeline2( f1, f2, in )
%PIPELINE_PAR Apply f1(f2(in)) over two separate calls
%
% >> pipeline2( @(x)(x+1), @(y)(y*2), 5 )
%
% ans = []
%
% >> pipeline2( @(x)(x+1), @(y)(y*2), 100 )
%
% ans = 11
%
% >> pipeline2( @(x)(x+1), @(y)(y*2), 7 )
%
% ans = 201
A poor pipeline attempt
function [ out ] = pipeline2_bad( f1, f2, in )
global register;
buffer={register, in};
parfor i=1:2
if ~isempty(buffer{i})
if i==1
buffer{i} = f1( buffer{i} );
else
buffer{i} = f2( buffer{i} );
end
end
end
out=buffer{1};
register=buffer{2};
end
Global state is bad
• Bad enough in sequential code
– Caller of function needs to understand semantics
– Reduces opportunity for optimisations by the compiler
• Terrible in parallel code
– Imagine two parts of the program want to use pipeline2
– Very difficult to understand who will access the global register
– In other languages, potential for ``torn write’’
– Would have to protect global register with a mutex...
• General modern approach: minimise global state
– Eliminate as much mutual exclusion as possible
– Make state local, and obvious
– Wherever possible use pure functions, with no state
A second attempt
function [out,register] = pipeline2(f1,f2,register,in)
functions={f1,f2};
inputs={register, in};
outputs=cell(2,1);
parfor i=1:2
f=functions{i};
outputs{i}=f(inputs{i});
end
out=outputs{1};
register=outputs{2};
end
Extend to the n-ary case
function [out,registers] = pipeline_par(funcs,registers,in )
if ~iscell(registers)
registers=cell(1, length(funcs)-1);
end
registers{end+1} = in;
parfor i=1:length(registers)
f=funcs{i};
if ~isempty(registers{i})
registers{i} = f( registers{i} );
end
end
out=registers{1};
registers={registers{2:end}};
end
Push input onto the back
Pop output off the front
Pipeline parallelism
• Problem: want to calculate yi=f1(f2(...(fn(xi)...)), i=1,2,...
• Goal: high throughput only, maximise outputs / sec
• Solution:
– Multiple tasks each handling one function in parallel
– Synchronise all tasks at the end of each round
• Requirements:
– f1..fn are side-effect free, so can safely call them in parallel
– Application is latency tolerant
– Intermediate memory usage is not a problem
Data parallelism
• Problem: want to calculate vector yi=f(xi), 1≤i≤n
• Goal: low latency, minimise total execution time
• Solution:
– Multiple tasks each handling one piece of data in parallel
– Synchronise all tasks at the end of each round
• Requirements:
– f is side-effect free, so can safely call them in parallel
Our first two design patterns
• Data-parallelism : apply one function to lots of data
– Simple but powerful form of parallelism, natural in HW and SW
– Widely applicable, many applications allow SIMD operation
– Can be used to build some other types of parallelism
• Pipeline parallelism : apply lots of functions to one datum
– Often well supported in hardware; less so in software
– Fewer applications, as must be able to tolerate latency
– Difficult to use as a primitive for other forms of parallelism
• Both are restricted in scope
– Must know amount of data / number of functions at start-up
– Very simple dependency model based on barriers
– Future design patterns: relax these restrictions
Control dependency view : data par.
y1=f(x1) y2=f(x2) ... yn-1=f(xn-1) yn=f(xn)
parfor
parfor
y1=f(x1) y2=f(x2) ... yn-1=f(xn-1) yn=f(xn)
parfor
Synchronisation
Barrier
Synchronisation
Barrier
Control dependency view: simple pipeline
x’=f1(x) x’=f2(x) x’=fn-1(x) x’=fn(x)
parfor
parfor
Synchronisation
Barrier
x’=f1(x) x’=f2(x) x’=fn-1(x) x’=fn(x)
Synchronisation
Barrierparfor
Practical pipelining
• Matlab is not really very good at pipeline parallelism
– To be fair, not what it is designed for
– It can be used to describe pipelines well, using SimuLink
• Many libraries and approaches support it well
– Unix Pipes: one of the simplest general purpose tools
– Threaded Building Blocks: allows complex pipelines
– OpenCL 2.0: builtin support for FIFOs between kernels
• Too new for us to look at
• Designed to allow hardware-level pipeline parallelism
– Lots of video and audio-processing streaming APIs
Unix pipes as pipeline parallelism
Mike Gancarz: “The UNIX Philosophy”:
1. Small is beautiful.
2. Make each program do one thing well.
3. Build a prototype as soon as possible.
4. Choose portability over efficiency.
5. Store data in flat text files.
6. Use software leverage to your advantage.
7. Use shell scripts to increase leverage and portability.
8. Avoid captive user interfaces.
9. Make every program a filter.
CW2
There are multiple re-statements, but this one I prefer.
Anatomy of a filter program
• Most OS’s and languages have standard streams
– stdin : Input text or binary data being passed to the program
– sdout : Output text or binary data being produced by program
– stderr : Diagnostic information produced during execution
• Streams are initialised when program starts
– Arguments are passed to main by shell or OS
– Standard streams are automatically
connected, e.g. to keyboard/display
– Program has to deal with the extra
arguments, may open files, ...
Filterstdin
stdout
file1.txt“flag”
stderr
Stream re-routing
shellkeyboard display
stdout
stderr
Stream re-routing
catstdin
stdout
stderr
index.html
shellkeyboard display
stdout
stderr
cat index.html
Stream re-routing
grepstdin
stdout
“html”
stderr
cat
“index.html”
shellkeyboard display
stdout
stderr
cat index.html | grep html
Stream re-routing
grepstdin
stdout
“html”
stderr
shellkeyboard display
stdout
stderr
grep html
Stream re-routing
grep
stdout
“id”
shellkeyboard display
stdout
stderr
sortcurl
stderr
http://slashdot.org
stdin
stdout
stdin
stdout
curl http://slashdot.org | grep id | sort
Advantages of streaming data
• Intermediate data never has to touch disk
– IO is expensive; we are often limited by disk bandwidth
– When processing terabytes of data there is not enough space
• For “Big-Data” can often only store compressed version
• High performance computing is increasingly data-limited
• Parallel processing comes for free
– Each stage in the pipeline is its own parallel process
– OS will block processes when they are waiting for data
• Synchronisation is local, rather than global for pipeline
– Block when there is not enough data on stdin
– Block when there is not enough buffer space on stdout
– Apart from that: process away!
Disadvantages?