Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a...
-
Upload
irma-robertson -
Category
Documents
-
view
230 -
download
0
Transcript of Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a...
Algorithmic Techniques on aRing of Processors
Logical Processor Topology When writing a (distributed memory) parallel
application, one typically organizes processors in a logical topology
Linear Array Ring Bi-directional Ring 2-D grid 2-D torus One-level Tree Fully connected graph Arbitrary graph
We’re going to talk about a simple Ring Natural choice to partition regular data like matrices we will come up with algorithms and performance estimates Some of these algorithms could be done better on other
topologies, like bi-directional rings for instance, but the point is to see how to design and reason about parallel algorithms
Communication on the Ring
Each processor is identified by a rank
RANK() There is a way to find the
total number of processors
NUMPROCS() Each processor can send a
message to its successor SEND(addr, L) RECV(addr, L)
We’re looking only at SPMD programs
P0
P1
P2
P3
Pp-1
Cost of communication
It is actually difficult to precisely model the cost of communication, or the way in which communication loads the processor
We will be using a simple modelTime = + L
: start-up cost L: message size : inverse of the bandwidth
We assume that is a message of length L is sent from P0 to P1, then the communication cost is q( + L )
There are many assumptions in our model, some not very realistic, but we’ll discuss them later
Broadcast
We want to write a program that that has Pk send the same message of length L to all other processors
Broadcast(k,addr,L) On the ring, we just send to the next
processor, and so on, with no parallel communications whatsoever
This is of course not the way one should implement a broadcast in practice MPI uses some type of tree topology
Broadcast
Brodcast(k,addr,L) q = RANK() p = NUMPROCS() if (q == k) SEND(adr,L) else if (q == k-1 mod p) RECV(adr,L) else RECV(adr,L) SEND(adr,L) endif endif
Assumes a blocking receive
Sending may be non-blocking
The broadcast time is
(p-1)( + L )
Scatter
Pk stores the message destined to Pq at address addr[q], including a message at addr[k].
The principle is just to pipeline communication by starting to send the message destined to Pk-1, the most distant processor.
Scatter
q = rank()
p = numprocs()
if (q == k)
for i = 1 to p-1
SEND(addr[k+p-1 mod p],L)
addr addr[k]else
RECV(tempR,L)
for i = 1 to k-1-q mod p
tempS tempR
SEND(tempS,L) || RECV(tempR,L)
addr tempR
Swapping of send bufferand receive buffer (pointer)
Sending and Receivingin Parallel, with a non blocking Send
Same execution time as the broadcast
(p-1)( + L )
All-to-all
q = rank()p = numprocs()addr[q] my_addrfor i = 1 to p-1 SEND(addr[q-i+1 mod p],L) || RECV(addr[q-i mod p],L)
Same execution time as the scatter
(p-1)( + L )
A faster broadcast? How can one accelerate the broadcast? So far we’ve seen (p-1)( + L ) One can cut the message in many small pieces, say in r pieces where L is divisible by r. The root processor just sends r messages The performance is as follows
Consider the last processor to get the last piece of the message There need to be p-1 steps for the first piece to arrive, which takes (p-1)( + L / r) Then the remaining r-1 pieces arrive one after another, which takes (r-1)( + L / r) For a total of: (p - 2 + r) ( + L / r)
A faster broadcast?
The question is, what is the value of r that minimizes (p - 2 + r) ( + L / r) ? One can view the above expression as (c+ar)(d+b/r), with four constants a, b, c, d The non-constant part of the expression is then ad.r + cb/r, which must be minimized It is known that this value is minimized for sqrt(cb / ad)
and we have
ropt = sqrt(L(p-2) / ) with the optimal time (sqrt((p-2) + sqrt(L ))2
which tends to L when L is large, which is independent of p.
Matrix-Vector product
y = A x
for i = 0 to n-1 /* compute a dot-product */ y[i] = 0
for j = 0 to n-1
y[i] = y[i] + a[i,j] * x[j]
Just distribute the dot-product computations among processors
Let n be the size of the matrix, p the number of processors
Let’s assume that n is divisible by p, and let r = n/p
Each processor needs r rows of the matrix
Matrix-vector Product
What about the distribution of vector x? It could be replicated across all processors and
then all computations would be independent But since each processor computes only a piece
of y, it is more elegant to have x distributed like A, with each processor owning r components of vector x
This is typically what would be done in real code so that data is distributed across processors
For a vector it may be more efficient to fully duplicate it, but in general you don’t want to do that for matrices or other data structures
Each processor has in its memory r rows of matrix A in an array a[r][n] r components of vector x in an array my_x[r]
Global vs. Local indices Having only a piece of the overall data structure is common
makes it possible to partition the workload makes it possible to run larger problems by aggregating
distributed memory Typically when writing code like this on have
global index (I,J) that references an element of the matrix local index (i,j) that references an element of the local array
that stores a piece of the matrix Translation between global and local indices
think of the algorithm in terms of global indices implement it in terms of local indices
P4
Global: A[5][7]Local: a[1][3]
a[i,j] = A[Mblock*floor(rank/3) + i][Nblock*ceil(rank mod 3) + j]P5
P2P1P0
P3
MblockN
blo
ck
Principle of the Algorithm
A00 A01 A02 A03 A04 A05 A06 A07
A10 A11 A12 A13 A14 A15 A16 A17P0
x0
x1
A20 A21 A22 A23 A24 A25 A26 A27
A30 A31 A32 A33 A34 A35 A36 A37P1
x2
x3
A40 A41 A42 A43 A44 A45 A46 A47
A50 A51 A52 A53 A54 A55 A56 A57P2
x4
x5
A60 A61 A62 A63 A64 A65 A66 A67
A70 A71 A72 A73 A74 A75 A76 A77P3
x6
x7
Initial data distribution for: n = 8 p = 4 r = 2
Principle of the Algorithm
A00 A01 A10 A11 P0
x0
x1
A22 A23 A32 A33 P1
x2
x3
A44 A45 A54 A55 P2
x4
x5
A66 A67
A76 A77P3
x6
x7
Step 1
Principle of the Algorithm
A06 A07
A16 A17P0
x6
x7
A20 A21 A30 A31 P1
x0
x1
P2
x2
x3
P3
x4
x5
Step 2
A22 A23 A32 A33
A44 A45 A54 A55
Principle of the Algorithm
A06 A07
A16 A17
P0
x4
x5
A20 A21 A30 A31
P1
x6
x7
P2
x0
x1
P3
x2
x3
Step 3
A22 A23 A32 A33
A44 A45 A54 A55
Principle of the Algorithm
A06 A07
A16 A17
P0
x2
x3
A20 A21 A30 A31
P1
x4
x5
P2
x6
x7
P3
x0
x1
Step 4
A22 A23 A32 A33
A44 A45 A54 A55
Principle of the Algorithm
A06 A07
A16 A17
P0
x0
x1
A20 A21 A30 A31
P1
x2
x3
P2
x4
x5
P3
x6
x7
Final state
A22 A23 A32 A33
A44 A45 A54 A55
The final exchange of vector x is not strictly necessary, but one may want to have it distributed as the end of the computation like it was distributed at the beginning.
Algorithm
Mat_vec(in A, in x, out y)
q rank() p numprocs() tempS x /* My piece of the vector (r elements) */ for step = 0 to p-1
SEND(tempS,r)
|| RECV(tmpsR,r)
|| for i = 0 to r-1
y[i] y[i] + a[i,(q - step mod p) * r + j] * tempS[j]
tempS tempR
Uses two buffers (tempS for sending and tempR to receiving)
Computation and Communications occur in parallel
Performance
There are p identical steps During each step each processor performs
three concurrent activities: computation, receive, and sending
Each step goes as fast as the slowest off the 3 concurrent activities Computation: r2Tcomp
Communication: + rTcomm(Tcomp and Tcomm are times for individual computations and transfers)
T(p) = p * max(r2Tcomp, + rTcomm)
For fixed p, when n gets large the computation time dominates and efficiency tends to 1
Performance (2)
Note that an algorithm that initially broadcasts the entire vector to all processors and then have every processor compute independently would be in time
(p-1)( + n*Tcomm) + pr2 * Tcomp
which has the same asymptotic performance is a simpler algorithm wastes only a tiny little bit of memory is arguably much less elegant
It is important to think of simple solutions and see what works best given expected matrix sized, etc.
An Image Processing Application
We have seen a few parallel applications with different ways of partitioning the work matrix-matrix and matrix-vector multiply sharks and fishes numerical methods ...
We’re going to look in depth at another type of parallel application, and see what the performance trade-offs are
We’re still working on the ring topology The application model is representative of
several image processing applications
The sequential application
Generic algorithmic framework that can be used for distance from contour computation of an optimal trajectory others..
Let P be a nxn grid, where each point is a pixel. A point p not on the edge has 8 neighbors
NW N NE
W p E
SESSW
Principle of the algorithm
Sweep through the grid, back and forth first from the top-left corner to the bottom-right corner
(FW) then back from the bottom-right corner to the top-left
corner (BW)FW pass
p FW_update(p,W,NW,N,NE)BW pass
p BW_update(p,E,SE,S,SW)
NW N NE
W p
p E
S SESW
stencil
Why is this useful? Distance from contour
Let P be a binary image of an object F, with a pixel value of zero if the pixel belongs to F, and of ∞ otherwise
We want to replace each pixel value by the pixel’s distance to F’s complement, according to some metric
Can be done in two passes: FW: p min(p,W+t1,NW+t2,N+t1,NE+t2) BW: p min(p,E+t1,SE+t2,S+t1,SW+t2) t1 = 1, t2 = ∞: Mahattan distance t1 = 3, t2 = 4: good approximation of Euclidian distance
Once one has this distance, it is easy to compute things such as surface, contour length, etc.
Computation of optimal trajectory Each pixel has a “cost” value Goal: compute minimal cost trajectories from one pixel to all others A bit complicated with non-trivial updates (Bitz and Kung, 1988) O(n2) passes in the worst case (<n in practice)
...
Parallelization?
Stencil applications are common and many people have looked at parallelizing them
Here the stencil is interesting because it is asymmetric and leads to a “wavefront” computation
We want to do this on a ring of processor Usual trade-offs apply
load balance the work among processors Not pay too much for communication Get all computers to start computing early
A Greedy Algorithm
Processors send pixels to their neighbors as soon as they are computed Very small start-up time Good load balancing
Let say we have p = n processors and each line i of the image is assigned to a processor Pi
In a FW phase, as soon as Pi computes a pixel, it must send it to Pi+1
Given the shape of the stencil, a processor needs two values from its predecessor to start computation on a line
Execution Steps
P0
P1
P2
P3
P4
P5
...
Execution Steps
0P0
P1
P2
P3
P4
P5
...
Execution Steps
0 1P0
P1
P2
P3
P4
P5
...
Execution Steps
0 1
2
P0
P1
P2
P3
P4
P5
...
Execution Steps
0 1 2
2
P0
P1
P2
P3
P4
P5
...
Execution Steps
0 1 2 3
2 3
P0
P1
P2
P3
P4
P5
...
Execution Steps
0 1 2 3 4 5 6 7 8
2 3 4 5 6 7 8 9
4 5 6 7 8 9
6 7 8 9
8 9
P0
P1
P2
P3
P4
...
9
. . .
. . .
. . .
. . .
. . .
At “step” 2i+j, processor Pi does: receive pixel (i-1,j+1) from Pi-1
compute pixel (i,j) send pixel (i,j) to Pi+1
Note the similarity to a systolic network
Performance? Assume that sends are non-blocking and receives are blocking For each row, each processor follows a sequence
get a pixel compute send a pixel || get a pixel
P0 C S C S C S C S C S . . C S
P1 C S C S C S C S . . C S C S
P2 C S C S C S . . C S C S C S
..
Pi C S . . . . . . C S C S C S
..
..
Pp-1 C S . . . C S C S C S C
C = ComputeS = Send, Receive, or Send || Receive
2i x (Tcomp + + Tcomm) n x (Tcomp + + Tcomm)
Tcomm = Time to send a pixel to a neighbor
Tcomp = Time to compute a pixel
Performance
Processor Pp-1 is the last one to finish It finishes at time:
T = (3n-2)(Tcomp) + (3n-3)( + Tcomm) Therefore we have a O(n) complexity
We would have stopped here in the land of PRAMs, etc.
The problem is the 3n term. In practice is orders of magnitude larger than Tcomm short messages are known to be a bad idea in most
platforms We have
small start-up time reasonable good balancing expensive communications
What if p < n? This is the realistic case When p<n, one must partition the data
we assume that p divides n One could give the first n/p lines to P0, the next n/p lines to P1, etc.
the last processor would start computing very late due to the stencil shape A better way is to interleave image lines between processors
classical load-balancing technique that we mentioned for sharks and fishes
P0 C S C S C S C S . . C S
P1 C S C S C S . . C S C S
..
Pp-1 C S C S C S . . C S C S C S
P0 C S . . . . . . C S C S C S
..
..
Pp-1 C S . . . C S C S C S C
2p x (Tcomp + + Tcomm) n x (Tcomp + + Tcomm)
Condition for no idle time
Processor P0 finishes computing its first line at time T0 = n x (Tcomp + beta + Tcomm)
Processor P0 receives data from processor Pp-1 to compute its second line at time Tp = 2p x (Tcomp + beta + Tcomm)
It Tp > T0 we have idle time Therefore, Tp ≤ T0, i.e. n ≥ 2p If n > 2p, then P0 must store pixels received from Tp-1 until it
can start computing on them Trade-off between idle-time and memory consumption, with
the perfect balance exactly when n = 2p This notion of finishing receiving data right when the next
computing should start is a common way to obtain “good” schedules (we’ll see that we we talk about Divisible Load Scheduling).
We still have the same problem of expensive communications
Idea #1: cheaper communications
Get rid of most of the network latencies we need to send longer messages so we let each processor compute k consecutive pixels at each step we initiate the process by having P0 compute some number of pixels, l0
Process P0 starts computing l0 pixels without any communication
P0 sends these l0 pixels to P1 and then computes its next k pixels
P1 can start computing l0-1 pixels in parallel with P0’s computation of its
next k pixels When P1 is done computing it sends l0-1 pixels to P2. P2 can start
computing its first l0-2 pixels
etc. When one reaches the end of a line, one just starts the next line, in
the interleaved pattern we saw before At each step, but for the first and perhaps the last one (depending
on whether k divides n-l0), each processor computes k pixels.
Execution steps
0
l0
P0
P1
P2
P3
l0-1
First line
Execution steps
0 1
1
l0 k
P0
P1
P2
P3
l0-1 k
First line
Execution steps
0 1 2
1 2
2
l0 k k
P0
P1
P2
P3
l0-1
l0-2
k
First line
Execution steps
0 1 2 4
1 2 3
2 3 4
3 4
l0 k k
4
k
P0
P1
P2
P3
l0-1
l0-2
k k
k k
l0-3
First line Second line
3
k
4
k
k
The condition so that there is no idle time is: n ≥ (k+1)p(we’ll prove it later)
Execution steps
0 1 2 4
1 2 3 5
2 3 4 6
3 4 5 6
l0 k k
4 5
k k
P0
P1
P2
P3
l0-1
l0-2
k k
k k k
l0-3
5 6
kk
8
7
k
8
First line Second line
3
k
4
k
5
k
7
k k k k k
The larger k, the less expensive are communicationsThe smaller k, the longer the delay between stages
There is an optimal k
Idea #2: fewer communications
To do fewer communications, one can associate blocks of r lines to each processor (to increase locality)
No communications between lines within a block Block allocation is interleaved (block cyclic) Example: p = 4, n = 36, r = 3
P0 P1 P2 P3
0,1,2 3,4,5 6,7,8 9,10,11
lines 12,13,14
15,16,17
18,19,20
21,22,23
24,25,26
27,28,29
30,31,32
33,34,35
Execution Steps
0 1 2 3 4 5 6 7 8 9 1011
2 33
4 5 6 7 8 9 10
5 6 7 8 9 10
5 6 7 8 9 10
4
11
11
11 12 13 14
12 13 14
12
15 16
13
15
First block
14
16
11 12 13
13 14 15
16
17 18
15
Second block
n = 44, p= 4, r = 3, k = 4
0
1
1
2
2
2
3 4
3
3
4
3
5
5
4
6 7
5 6
6 7
8
9
7
n = 44, p= 4, r = 3, k = 13
Execution Steps
0 1 2 3 4 5 6 7 8 9 1011
2 33
4 5 6 7 8 9 10
5 6 7 8 9 10
5 6 7 8 9 10
4
11
11
11 12 13 14
12 13 14
12
15 16
13
15
First block
14
16
11 12 13
13 14 15
16
17 18
15
Second block
n = 44, p= 4, r = 3, k = 4
0
1
1
2
2
2
3 4
3
3
4
3
5
5
4
6 7
5 6
6 7
8
9
7
n = 44, p= 4, r = 3, k = 13
IDLE
Condition for no idle time?
We will see that it is n ≥ p(r+k) Since we’ve reduced communication, we have
increased the start-up delay. We now have two trade-offs?
large k and large r: cheap communication small k and small r: low start-up delay small k and large r? large k and small r?
We need a thorough performance analysis in order to determine the optimal values of k and r
This is what people who design // algorithms do to tune the performance
Performance Analysis
Let us assume that pxr divides n The sequential time is n2 Tcomp
As before, the algorithm can be seen as a succession of 2 stages:
send and/or receive data, in parallel compute
At each stage a proc received k pixels from its predecessor, then computes r sub-lines of k pixels, then send the last k pixels to its successor
Therefore: communication cost per stage: + kTcomm
computation cost per stage: rkTcomp
total cost: Tstage = + kTcomm + rkTcomp
Performance Analysis
First thing to do: figure out at which stage, sq, processor q, 0 ≤ q ≤ p-1, starts computing a row of lq many pixels
P0 starts computing l0=l0 pixels at stage s0=0 P1 starts computing l1=l0-r mod k pixels at stage s1=1+(r-l0)/k
0
1
2
r = 3, l0 = 12, k = 13
l1 = 12 - 3 mod 13 = 9s1 = 1 + (3 - 13)/13 = 1
0 1
2r = 3, l0 = 2, k = 4
l1 = 2 - 3 mod 4 = 3s1 = 1 + (3 -2)/4 = 2
More generally: lq = (l0 - qr) mod k and sq = q + (qr - l0)/k
0 1
2
3
r = 3, l0 = 2, k = 4
l2 = 2 - 2*3 mod 4 = 4 (there is a subtlety here) s2 = 2 + (2*3 -2)/4 = 3
Performance Analysis
Now we just need to count the total number of stages, Sp
Pp-1 is the last processor to complete After computing its first “chunk”, Pp-1 has
(n2/(pr) - lp-1 + r - 1)/k chunks left to compute
Therefore: Sp = sp-1 + (n2/(pr) - lp-1 + r - 1)/k
T// = Tstage x Sp
Performance Analysis
Our analysis is valid only if there is no idle time (we don’t really care about modeling the cases with idle times anyway)
P0 receives its first pixels at stage tp
At that point, it has already computed l0 + k(tp-1) pixels in its first line (l0 at first, and then k at each stage)
We must have that the number of pixels left to compute in the first line + the ones that it can compute in the first line of its second block, using pixels sent out by Pp-1, must be greater or equal than k. (Similar argument for the first algorithm we looked at, just a bit more complex)
This condition can be written asn - (l0 + k(tp-1) + lp) ≥ k
which is equivalent ton ≥ p(r+k)
Performance Analysis Neglecting constant terms one obtains T// = ( + k Tcomm + r k Tcomp) x
((p - 1)(1 + r / k) + n2/(p r k)) provided that r ≤ n/p, and 1 ≤ k ≤ n / p - r Given n, p, and r, one can compute kopt(r) that minimizes T//:
Then one plugs that value in T// with different values of r to find the best r.
Voila :)
Lessons Learned It’s often a good idea to start thinking of the problem in a systolic
array fashion (we have as many procs as elements) and then think of a data distribution when fewer procs are available
Communication costs can be reduced by delaying communications and “bundling” them in a single, longer message.
Communications can be reduced by “blocking” Better load balancing by having an interleaved or “cyclic”
distribution Many algorithms best implemented with a “block cyclic” data
distribution Performance analysis is difficult, although it is easy to find big-O
estimates Choosing the best bundling and blocking factors in non-trivial and
completely problem dependent, although there are some rules of thumb
This must all be put in perspective with hardware (e.g., cache size)
Good parallel computing is hard
Solving Linear Systems of Eq. Method for solving Linear Systems
The need to solve linear systems arises in an estimated 75% of all scientific computing problems [Dahlquist 1974]
Gaussian Elimination is perhaps the most well-known method based on the fact that the solution of a linear system is
invariant under scaling and under row additions One can multiply a row of the matrix by a constant as long as one
multiplies the corresponding element of the right-hand side by the same constant
One can add a row of the matrix to another one as long as one adds the corresponding elements of the right-hand side
Idea: scale and add equations so as to transform matrix A in an upper triangular matrix:
??
???
x =
equation n-i has i unknowns, with
?
Gaussian Elimination1 1 1
1 -2 2
1 2 -1
0
4
2x =
1 1 1
0 -3 1
0 1 -2
0
4
2x =
1 1 1
0 -3 1
0 0 -5
0
4
10
x =
Substract row 1 from rows 2 and 3
Multiple row 3 by 3 and add row 2
-5x3 = 10 x3 = -2-3x2 + x3 = 4 x2 = -2x1 + x2 + x3 = 0 x1 = 4
Solving equations inreverse order (backsolving)
Gaussian Elimination
The algorithm goes through the matrix from the top-left corner to the bottom-right corner
the ith step eliminates non-zero sub-diagonal elements in column i, substracting the ith row scaled by aji/aii from row j, for j=i+1,..,n.
i
0
values already computed
values yet to beupdated
pivot row i
to b
e z
ero
ed
Sequential Gaussian Elimination
Simple sequential algorithm
// for each column i// zero it out below the diagonal by adding// multiples of row i to later rowsfor i = 1 to n-1 // for each row j below row i for j = i+1 to n // add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
Several “tricks” that do not change the spirit of the algorithm but make implementation easier and/or more efficient
Right-hand side is typically kept in column n+1 of the matrix and one speaks of an augmented matrix
Compute the A(i,j)/A(i,i) term outside of the loop
Pivoting: Motivation
A few pathological cases
Division by small numbers round-off error in computer arithmetics Consider the following system
0.0001x1 + x2 = 1.000
x1 - x2 = 1.000 exact solution: x1=1.9998... and x2 = 0.99980...
say we round off after 4 digits after the decimal point
10-4 1
1 -1
1
1
1 104
1 -1
104
1
1 104
0 -104
104
-104
0 1
1 1
0
1
-1 - 104 = -10,001 = -0.10001 E+5 = -0.1000 E+5round-off error
Partial Pivoting
One can just swap rows
Final solution is closer to the real solution. (Magical) Numerical stability is an entire field
Partial Pivoting For numerical stability, one doesn’t go in order, but pick the next row in
rows i to n that has the largest element in row i This row is swapped with row i (along with elements of the right hand
side) before the substractions the swap is not done but rather one keeps an indirection array
Total Pivoting Look for the greatest element ANYWHERE in the matrix Swap columns Swap rows
1 -1
10-4 1
1
1
10-4 -10-4
0 1
10-4
1
good round-off
2
1
Parallel Gaussian Elimination
Assume that we have one processor per matrix element (as in a PRAM or a systolic array)
Reduction Broadcast Compute
Broadcasts Compute
to find the max aji
max aji needed to computethe scaling factor Independent computation
of the scaling factor
Every update needs thescaling factor and the element from the pivot row
Independentcomputations
Parallel Gaussian Elimination Once one understands the algorithm assuming that we
have one proc per element, one can decide on a data distribution when we have fewer procs
One column per proc: remove reduction and some broadcasts One column block per proc increases locality, when one
doesn’t have as many procs as columns One MUST use a cyclic distribution since the matrix is
traversed top-left to bottom-right Good approach: pick a block size, allocate column blocks to
processors, interleaved manner: 1-D block cyclic distribution Better approach when many processors: also partitions rows in
blocks to achieve a 2-D block cyclic distribution
The 2-D block cyclic distribution is sort of the panacea of dense linear algebra as it allows for good locality and good load-balancing, at the cost of more complicated code.
LU Factorization Gaussian Elimination is simple but
What if we have to solve many Ax = b systems for different values of b? This happens a LOT in real applications
Another method is the “LU Factorization” Ax = b Say we could rewrite A = L U, where L is a lower triangular matrix, and
U is an upper triangular matrix O(n3) Then Ax = b is written L U x = b Solve L y = b O(n2) Solve U x = y O(n2)
??????
x =??????
x =
equation i has i unknowns equation n-i has i unknowns
triangular system solves are easy
LU Factorization: Principle It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.
Magically, A = L x U ! Should be done with pivoting as well
1 2 -1
4 3 1
2 2 3
1 2 -1
0 -5
5
2 2 3
gaussianelimination
save thescalingfactor
1 2 -1
4 -5
5
2 2 3
gaussianelimination
+save thescalingfactor
1 2 -1
4 -5
5
2 -2
5gaussianelimination
+save thescalingfactor
1 2 -1
4 -5 5
2 2/5 3
1 0 0
4 1 0
2 2/5 1
L = 1 2 -1
0 -5 5
0 0 3U =
LU Factorization
We’re going to look at the simplest possible version No pivoting
just creates a bunch of indirections that are easy but make the code look complicated
No blocking this is not what one should do on a modern machine (i.e., one with a
cache), but again, adding blocking transforms a 5 line algorithm into several pages of code (just go look at the LAPACK code and see how complicated everything looks)
Very often the principle can be very simple, but the code extremely complex just for optimizations and for dealing with numerical stability
The ScaLAPACK 2-D block-cyclic LU factorization code is layered on top of many libraries
If you were to write it as a one-level procedure that uses MPI directly, it would be many, many pages
it deals with rectangular blocks, all the horrible cases in which nothing divides anything, prime numbers, nothing’s a perfect square, etc.
Sequential Algorithm
stores the scaling factors
k
k
LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk
for j = k+1 to n-1` // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj
}}
Sequential AlgorithmLU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk
for j = k+1 to n-1` // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj
}}
k
ij
k
update
Parallel LU on a ring
Since the algorithm operates by columns from left to right, we should distribute columns to processors
At each step, the processor that owns column k does the “prepare” task and then broadcasts the bottom part of column k to all others. The other processors can then update.
Assume there is a function alloc(k) that returns the rank of the processor that owns column k
We will write everything in terms of global indices, as to avoid annoying index arithmetic
LU-broadcast algorithm
LU-broadcast(A,n) { q rank() p numprocs() for k = 0 to n-2 { if (alloc(k) == q) // preparing column k for i = k+1 to n-1 buffer[i-k-1] aik -aik / akk
broadcast(alloc(k),buffer,n-k-1) for j = k+1 to n-1 if (alloc(j) == q) // update of column j for i=k+1 to n-1 aij aij + buffer[i-k-1] * akj
}}
Dealing with local indices
Assume that p divides n Each processor needs to store r=n/p
columns and its local indices go from 0 to r-1
After step k, only columns with index greater than k will be used
Simple idea: use a local index, l, that everyone initializes to 0
At step k, processor alloc(k) increases its local index so that next time it will point to its next local column
LU-broadcast algorithm
... double a[n-1][r-1];
q rank() p numprocs() l 0 for k = 0 to n-2 { if (alloc(k) == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 // note that this is simpler a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }}
we have replaces aij matrix elements by arrays(typically a good idea to first write the algorithmwith global indices and then move on to local indices)
Load-Balancing
How to distribute the column? There are fewer and fewer columns at the
execution proceeds cyclic distribution More subtle: the amount of computation to be
done isn’t proportional to the data size. The last column is updated n-1 times, while the first
column only once Columns of higher indices require more computation cyclic distribution (proc p-1 may have a bit more
work, but asymptotically it’s insignificant)
Performance analysis is a bit complex: n + (1/2) n2 Tcomm +o(1) for communications (1/2) n2 Tcomp + o(1) for column preparations (1/3)n3 p Tcomp + O(n2) for updates
Pipelining on the Ring
So far, the algorithm we’ve seen uses a simple broadcast
Nothing was specific to being on a ring of processors and it’s portable
in fact you could just write raw MPI that just looks like our pseudo-code and have a very limited, inefficient, LU factorization that works only for some number of processors)
But it’s not efficient The n-1 communication steps are not overlapped with
computations Turns out that on a ring, with a cyclic distribution of
the columns, one can interleave pieces of the broadcast with the computation
It almost looks like inserting the source code from the broadcast code we saw at the very beginning throughout the LU code
LU-pipeline algorithm double a[n-1][r-1];
q rank() p numprocs() l 0 for k = 0 to n-2 { if (k == q mod p) // Prep(k) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 send(buffer,n-k-1) else recv(buffer,n-k-1) if (q ≠ k-1 mod p) send(buffer, n-k-1) for j = l to r-1 // Update(k,j) for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }}
Prep(0)Send(0)
Update(0,4)Update(0,8)Update(0,12)
Recv(0)Send(0)
Update(0,1)Update(0,5)Update(0,9)Update(0,13)
Recv(0)Send(0)
Update(0,2)Update(0,6)Update(0,10)Update(0,14)
Recv(0)Update(0,3)Update(0,7)Update(0,11)Update(0,15)Prep(1)
Send(1)Update(1,5)Update(1,9)Update(1,13)
Recv(1)Send(1)
Update(1,2)Update(1,6)Update(1,10)Update(1,14)
Recv(1)Send(1)
Update(1,3)Update(1,7)Update(1,11)Update(1,15)
Recv(1)Update(1,4)Update(1,8)Update(1,12)
Prep(2)Send(2)
Update(2,6)Update(2,10)Update(2,14)
Recv(2)Send(2)
Update(2,3)Update(2,7)Update(2,11)Update(2,15)
Recv(2)Send(2)
Update(2,4)Update(2,8)Update(2,12)
Recv(2)Update(2,5)Update(2,9)Update(2,13)
Prep(3)Send(3)
Update(3,7)Update(3,11)Update(3,15)
Recv(3)Send(3)
Update(3,4)Update(3,8)Update(3,12)
Recv(3)Send(3)
Update(3,5)Update(3,9)Update(3,13)
Recv(3)Update(3,6)Update(3,10)Update(3,14)
First fourstages
Some communicationoccurs in parallel with computation
A processor sends outdata as soon as it receives it
How can we do better? In the previous algorithm, a processor does all its updates
before doing a Prep() computation that then leads to a communication
But in fact, some of these updates can be done later Idea: Send out pivot as soon as possible Example:
In the previous algorithm P1: Receive(0), Send(0) P1: Update(0,1), Update(0,5), Update(0,9), Update(0,13) P1: Prep(1) P1: Send(1) ...
In the new algorithm P1: Receive(0), Send(0) P1: Update(0,1) P1: Prep(1) P1: Send(1) P1: Update(0,5), Update(0,9), Update(0,13) ...
Prep(0)Send(0)
Update(0,4)Update(0,8)Update(0,12)
Recv(0)Send(0)
Update(0,1)
Update(0,5)Update(0,9)Update(0,13)
Recv(0)Send(0)
Update(0,2)
Update(0,6)Update(0,10)Update(0,14)
Recv(0)Update(0,3)Update(0,7)
Update(0,11)Update(0,15)
Prep(1)Send(1)
Update(1,5)Update(1,9)
Update(1,13)
Recv(1)Send(1)
Update(1,2)
Update(1,6)Update(1,10)
Update(1,14)
Recv(1)Send(1)
Update(1,3)
Update(1,7)Update(1,11)Update(1,15)
Recv(1)Update(1,4)Update(1,8)
Update(1,12)
Prep(2)Send(2)
Update(2,6)Update(2,10)Update(2,14)
Recv(2)Send(2)
Update(2,3)
Update(2,7)Update(2,11)Update(2,15)
Recv(2)Send(2)
Update(2,4)Update(2,8)Update(2,12)
Recv(2)
Update(2,5)
Update(2,9)Update(2,13)
Prep(3)Send(3)
Update(3,7)Update(3,11)Update(3,15)
Recv(3)Send(3)
Update(3,4)Update(3,8)Update(3,12)
Recv(3)Send(3)
Update(3,5)Update(3,9)Update(3,13)
Recv(3)
Update(3,6)Update(3,10)Update(3,14)
First fourstages
Some communicationoccurs in parallel with computation
A processor sends outdata as soon as it receives it
LU-look-ahead algorithm
q rank() p numprocs() l 0 for k = 0 to n-2 { if (k == q mod p) { Prep(k) Send(buffer,n-k-1) for all j = k mod p, j>k: Update(k-1,j) for all j = k mod p, j>k: Update(k,j) } else { Recv(buffer,n-k-1) if (q ≠ k - 1 mod p) then Send(buffer,n-k-1) if (q ≠ k + 1 mod p) then Update(k,k+1) else for all j = k mod p, j>k: Update(k,j) }}
Further improving performance
One can use local overlap of communication and computation multi-threading, good MPI non-blocking
implementation, etc. There is much more to be said about
parallel LU factorization Many research articles Many libraries available
Matrix-multiply on a grid/torus
2-D Torus topology
We’ve looked at a ring, but for some applications it’s convenient to look at a 2-D grid topology
A 2-D grid with “wrap-around” is called a 2-D torus. Advanced parallel linear algebra libraries/languages
allow to combine arbitrary data distribution strategies with arbitrary topologies (ScaLAPACK, HPF)
1-D block cyclic to a ring 2-D block cyclic to a 2-D grid 2-D block non-cyclic to a ring etc..
In practice, for many linear algebra kernel, using a 2-D block-cyclic on a 2-D grid seems to work best in most situations
we’ve seen that blocks are good for locality we’ve seen that cyclic is good for load-balancing
Semantics of a parallel linear algebra routine?
Centralized when calling a function (e.g., LU)
the input data is available on a single “master” machine the input data must then be distributed among workers the output data must be undistributed and returned to the “master” machine
More natural/easy for the user Allows for the library to make data distribution decisions transparently to
the user Prohibitively expensive if one does sequences of operations
and one almost always does so Distributed
when calling a function (e.g., LU) Assume that the input is already distributed Leave the output distributed
May lead to having to “redistributed” data in between calls so that distributions match, which is harder for the user and may be costly as well
For instance one may want to change the block size between calls, or go from a non-cyclic to a cyclic distribution
Most current software adopt distributed more work for the user more flexibility and control
Matrix-matrix multiply Many people have thought of doing a matrix-multiply on a 2-D torus Assume that we have three matrices A, B, and C, of size NxN Assume that we have p processors, so that p=q2 is a perfect square and our processor grid is qxq We’re looking at a block distribution, but not a cyclic distribution
again, that would obfuscate the code too much
We’re going to look at three algorithms: Cannon, Fox, SnyderA00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B00 B01 B02 B03
B10 B11 B12 B13
B20 B21 B22 B23
B30 B31 B32 B33
Cannon’s Algorithm (1969)
Very simple (from systolic arrays) Starts with a data redistribution for matrices A and B
goal is to have only neighbor-to-neighbor communications A is circularly shifted/rotated “horizontally” so that its
diagonal is on the first column of processors B is circularly shifted/rotated “vertically” so that its diagonal
is on the first row of processors Called preskewing
A00 A01 A02 A03
A11 A12 A13 A10
A22 A23 A20 A21
A33 A30 A31 A32
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B00 B11 B22 B33
B10 B21 B32 B03
B20 B31 B02 B13
B30 B01 B12 B23
Cannon’s Algorithm
Preskewing of A and B
For k = 1 to q in parallel
Local C = C + A*B
Vertical shift of B
Horizontal shift of A
Postskewing of A and B
Of course, computation and communication could be done in an overlapped fashion locally at each processor
Execution Steps...
A00 A01 A02 A03
A11 A12 A13 A10
A22 A23 A20 A21
A33 A30 A31 A32
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B00 B11 B22 B33
B10 B21 B32 B03
B20 B31 B02 B13
B30 B01 B12 B23
local computationon proc (0,0)
A01 A02 A03 A00
A12 A13 A10 A11
A23 A20 A21 A22
A30 A31 A32 A33
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B10 B21 B32 B03
B20 B31 B02 B13
B30 B01 B12 B23
B00 B11 B22 B33
Shifts
A01 A02 A03 A00
A12 A13 A10 A11
A23 A20 A21 A22
A30 A31 A32 A33
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B10 B21 B32 B03
B20 B31 B02 B13
B30 B01 B12 B23
B00 B11 B22 B33
local computationon proc (0,0)
Fox’s Algorithm(1987)
Originally developed for CalTech’s Hypercube
Uses broadcasts and is also called broadcast-multiply-roll algorithm broadcasts diagonals of matrix A
Uses a shift of matrix B No preskewing step
first diagonalsecond diagonalthird diagonal...
Execution Steps...
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B00 B01 B02 B03
B10 B11 B12 B13
B20 B21 B22 B23
B30 B31 B32 B33
initialstate
A00 A00 A00 A00
A11 A11 A11 A11
A22 A22 A22 A22
A33 A33 A33 A33
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
Broadcast of A’s 1st diagonal
Localcomputation
A00 A00 A00 A00
A11 A11 A11 A11
A22 A22 A22 A22
A33 A33 A33 A33
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B00 B01 B02 B03
B10 B11 B12 B13
B20 B21 B22 B23
B30 B31 B32 B33
B00 B01 B02 B03
B10 B11 B12 B13
B20 B21 B22 B23
B30 B31 B32 B33
Execution Steps...
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B10 B11 B12 B13
B20 B21 B22 B23
B30 B31 B32 B33
B00 B01 B02 B03
Shift of B
A01 A01 A01 A01
A12 A12 A12 A12
A23 A23 A23 A23
A30 A30 A30 A30
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
Broadcast of A’s 2nd diagonal
Localcomputation
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B10 B11 B12 B13
B20 B21 B22 B23
B30 B31 B32 B33
B00 B01 B02 B03
A01 A01 A01 A01
A12 A12 A12 A12
A23 A23 A23 A23
A30 A30 A30 A30
B10 B11 B12 B13
B20 B21 B22 B23
B30 B31 B32 B33
B00 B01 B02 B03
Fox’s Algorithm
// No initial data movementfor k = 1 to q-1 in parallel Broadcast A’s kth diagonal Local C = C + A*B Vertical shift of B// No final data movement
Note that there is an additional array to store incoming diagonal block
Snyder’s Algorithm (1992)
More complex than Cannon’s or
Fox’s
First transposes matrix B
Uses reduction operations (sums) on
the rows of matrix C
Shifts matrix B
Execution Steps...
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B00 B01 B02 B03
B10 B11 B12 B13
B20 B21 B22 B23
B30 B31 B32 B33
initialstate
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
Transpose B
Localcomputation
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B00 B10 B20 B30
B01 B11 B21 B31
B02 B12 B22 B32
B03 B13 B23 B33
B00 B10 B20 B30
B01 B11 B21 B31
B02 B12 B22 B32
B03 B13 B23 B33
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
Execution Steps...
Shift B
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B01 B11 B21 B31
B02 B12 B22 B32
B03 B13 B23 B32
B00 B10 B20 B30
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B01 B11 B21 B31
B02 B12 B22 B32
B03 B13 B23 B32
B00 B10 B20 B30
Global sumon the rowsof C
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
B01 B11 B21 B31
B02 B12 B22 B32
B03 B13 B23 B32
B00 B10 B20 B30
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
Localcomputation
Execution Steps...
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
Shift B
B02 B12 B22 B32
B03 B13 B23 B33
B00 B10 B20 B30
B01 B11 B21 B31
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
B02 B12 B22 B32
B03 B13 B23 B33
B00 B10 B20 B30
B01 B11 B21 B31
Global sumon the rowsof C
C00 C01 C02 C03
C10 C11 C12 C13
C20 C21 C22 C23
C30 C31 C32 C33
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
B02 B12 B22 B32
B03 B13 B23 B33
B00 B10 B20 B30
B01 B11 B21 B31
Localcomputation
Complexity Analysis
Sort of cumbersome Two models
4-port model: every processor can communicate with its 4 neighbors in one step
Can match underlying architectures like the Intel Paragon
1-port model: only one single communication at a time for each processor
Both models are assumed bi-directional
One-port results
Cannon
Fox
Snyder
Complexity Results
m in these expressions is the block size Expressions for the 4-port model are MUCH more
complicated Remember that this is all for non-cyclic distributions
formulae and code become very complicated for a full-fledge implementation (nothing divides anything, nothing’s a perfect square, etc.)
Performance analysis of real code is known to be hard
It is done in a few restricted cases An interesting approach is to use simulation
Done in ScaLAPACK for instance Essentially: you have written a code so complex you just run a
simulation of it to ffigure out how fast it goes in different cases