Course admin stuff
• First coursework released this evening
– (Why this evening? Maximal lectures before submission)
– Due Jan 27th 23:59, two weeks time
• Where to find it
– Spec is on github: https://github.com/HPCE/hpce-2014-cw1
• Can just google “m8pple”
– Submission for this coursework is via blackboard
HPCE / dt10/ 2015 / 1.1
Expectations for coursework
• Coursework is not lab [1]
– You have to manage when, where, and how long you spend on it
– 100% coursework does not mean easy [1]
• Long hours are neither sufficient nor necessary for an A
– Like anything else: some people are just good at it
– But, a good correlation between organisation and marks
• You are expected to be reasonably independent
– This is a masters level course
HPCE / dt10/ 2015 / 1.2
[1] – Though the earlier parts kind of are.
Working together
• The software community has a tradition of sharing
– Many open-source projects, some of which you will rely on
– Lots of forums for discussing problems: stack-overflow, ...
• Approach this work in the same way
– You may encounter the same problems as other students
– Discuss solutions with each other, help each other out
– One-on-one discussions, github issues, whatever
• https://github.com/HPCE/hpce-2014-cw1/blob/master/background-bugs.md
– Give credit or thanks if appropriate: be excellent to each other
• But you have to balance co-operation and competition
– The later courseworks require good ideas and strategies
– Up to you to protect your IP.
HPCE / dt10/ 2015 / 1.3
Plagiarism
• All submitted material must be written by you
– Do not share any code with each other (except within pairs)
• No plagiarism checking software: I just read the code
– Some similarity of structure is expected, but there are limits
– Students are amusingly bad at obfuscation
– You need to be able to explain any code you submit in the oral
– Suspected plagiarism will be passed to the plagiarism committee
• If necessary you may use code from third-party sources
– e.g. open-source projects, samples, stack overflow, ...
– Origin and extent must be very clearly shown
– Need to be able to justify (orally) why it was used
– Should be aware of potential licensing implications
HPCE / dt10/ 2015 / 1.4
Matlab: why?
• Matlab is not really high performance
– Interpreted (though it is JIT compiled these days)
– Dynamically typed
– Poor at loops (even with the JIT)
• But it is high productivity
– The toolboxes can be a massive timesaver
– Development is interactive due to the REPL interface
• REPL = Read-Eval-Print Loop
• Encourages experimentation
• And it can be fast
– Just need to talk to it right
HPCE / dt10/ 2015 / 1.5
All about vectorisation
• Traditional wisdom: “you need to vectorise matlab”
– The best kind of wisdom: it’s actually correct
• But vectorisation is not only useful for matlab
– Memory system: Maximises cache performance
– Execution: Can utilise SIMD units in modern CPUs » SIMD = Single-Instruction Multiple Data: SSE, AVX, MMX, ...
– Scheduling: Reduce dependency tracking overhead
– Parallelism: Vectorised code often implies parallel code
• Principles from matlab vectorisation apply in OpenCL
– Easier to learn them here
HPCE / dt10/ 2015 / 1.6
What can be vectorised?
• The obvious: single for loops
• Life sometimes is easy
• Though watch out...
tic;
x = x .* x;
toc
x = x’ * x;
x=randn(1,1e6);
tic;
for t=1:n
x(t) = x(t) * x(t);
end
toc
HPCE / dt10/ 2015 / 1.7
Less obvious things...
• Vectorised in the weak sense
– One statement is operating on lots of data (not just vectors)
• Making use of a vector mask
– Apply some condition to everything in a vector
– Get mask indicating where the condition is true
– Select just those elements that meet the criterion
x=2:R; % All integers in range 2..R
o=x'*x; % Outer product of vector with itself
% Check which numbers are in the product matrix
mask=ismember(x,o);
res=x(~mask); % select any that aren't
HPCE / dt10/ 2015 / 1.8
mask=true(1,R); % All numbers initially prime
p=2; % Start from the smallest prime
while p < R
% Mark all multiples of the current prime
mask(2*p:p:end)=0;
% Find next number above p that is still marked
p=find(mask(p+1:end),1,'first')+p;
end
primes=find(mask); % Gather indices that are marked
• Even weaker form of vectorisation
– ``find’’ within loop is order dependent – can’t parallelise
• But natively supported by matlab
– ``find’’ is a primitive
– As basic as an add instruction in ``add’’ in x86
– Most of the cost is in scheduling the primitive; execution is cheap
HPCE / dt10/ 2015 / 1.9
More practical: image processing
• Quantisation: reduce the colour depth of images
– e.g. Take 256 level grayscale, produce 1-bit image
HPCE / dt10/ 2015 / 1.10
im=imread('lena-std_512x512.png' );
im=double(rgb2gray(im))/256;
res=zeros(size(im));
for x=1:size(im,1)
for y=1:size(im,2)
res(x,y) = im(x,y) > 0.5;
end
end
imshow(res);
res = im>0.5;
imshow(res);
HPCE / dt10/ 2015 / 1.11
res = im>0.5;
imshow(res);
res = arrayfun(f, im);
imshow(res);
f = @(v)( v>0.5 );
res = f(im);
imshow(res)
Create anonymous function with argument “v”
Define expression as body for anonymous function
Assign function to variable f
Variable f can now be called as a function
Can pass f to other functions
HPCE / dt10/ 2015 / 1.12
More intelligent quantisation
• Dithering: cumulative error due to quantisation is tracked
HPCE / dt10/ 2015 / 1.13
More intelligent quantisation
• Dithering: cumulative error due to quantisation is tracked
• Can it be vectorised ?
acc=0;
for x=1:w
for y=1:h
acc = acc + src(x,y);
quantised = round(acc*(levels-1))/(levels-1);
res(x,y)=quantised;
acc = acc - quantised;
end
end
Loop carried dependency through acc
HPCE / dt10/ 2015 / 1.14
acc=0;
x=1:w;
for y=1:h
acc = acc + src(x,y);
quantised = round(acc*(levels-1))/(levels-1);
res(x,y)=quantised;
acc = acc - quantised;
end
acc=0;
for x=1:w
for y=1:h
acc = acc + src(x,y);
quantised = round(acc*(levels-1))/(levels-1);
res(x,y)=quantised;
acc = acc - quantised;
end
end
HPCE / dt10/ 2015 / 1.15
2D error diffusion
• Attempt to diffuse error both across and down image
• Reduce tendency towards banding effects
HPCE / dt10/ 2015 / 1.16
• More difficult loop carried dependency
• Have a write before read dependency
– Current loop iteration reads from (x,y)
– Writes (x,y+1), (x+1,y), and (x+1,y+1)
– Three constraints per node
for x=1:w-1
for y=1:h-1
desired = src(x,y);
quantised = round(desired*levelsSub1)*invLevelsSub1;
src(x,y)=quantised;
error = desired - quantised;
src(x+1,y) = src(x+1,y) + error*0.4;
src(x,y+1) = src(x,y+1) + error*0.4;
src(x+1,y+1) = src(x+1,y+1) + error*0.2;
end
end
(x,y) (x+1,y)
(x,y+1) (x+1,y+1)
HPCE / dt10/ 2015 / 1.17
Generalise to the full grid
(1,1) (1,2) (1,3) (1,4)
(2,1) (2,2) (2,3) (2,4)
(3,1) (3,2) (3,3) (3,4)
(4,1) (4,2) (4,3) (4,4)
HPCE / dt10/ 2015 / 1.18
Serial execution: y then x
(1,1) (1,2) (1,3) (1,4)
(2,1) (2,2) (2,3) (2,4)
(3,1) (3,2) (3,3) (3,4)
(4,1) (4,2) (4,3) (4,4)
HPCE / dt10/ 2015 / 1.19
Serial execution: x then y
(1,1) (1,2) (1,3) (1,4)
(2,1) (2,2) (2,3) (2,4)
(3,1) (3,2) (3,3) (3,4)
(4,1) (4,2) (4,3) (4,4)
HPCE / dt10/ 2015 / 1.20
Vectorisation: can’t do it along x
(1,1) (1,2) (1,3) (1,4)
(2,1) (2,2) (2,3) (2,4)
(3,1) (3,2) (3,3) (3,4)
(4,1) (4,2) (4,3) (4,4)
HPCE / dt10/ 2015 / 1.21
Vectorisation: can’t do it along y
(1,1) (1,2) (1,3) (1,4)
(2,1) (2,2) (2,3) (2,4)
(3,1) (3,2) (3,3) (3,4)
(4,1) (4,2) (4,3) (4,4)
HPCE / dt10/ 2015 / 1.22
Skewing the loops
(1,1) (1,2) (1,3) (1,4)
(2,1) (2,2) (2,3) (2,4)
(3,1) (3,2) (3,3) (3,4)
(4,1) (4,2) (4,3) (4,4)
HPCE / dt10/ 2015 / 1.23
Or viewed another way
(1,1) (1,2) (1,3) (1,4)
(2,1) (2,2) (2,3) (2,4)
(3,1) (3,2) (3,3) (3,4)
(4,1) (4,2) (4,3) (4,4)
HPCE / dt10/ 2015 / 1.24
So how to do that in matlab?
• C and Fortran compilers may try to do this for you
– Can do a decent job for small loop kernels
– Have difficult detecting when it is safe to apply
– Active research area: polyhedral compilation techniques
• A lot of the time you have to do it yourself
– Matlab isn’t clever enough: not enough info in matlab code
– Must be explicitly handled by programmer in OpenCL
• Compiler is not allowed to do it
– Even in multi-core it crops up
• Some basic techniques to make it easier
HPCE / dt10/ 2015 / 1.25
Top Related