Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a...

Algorithmic Techniques on aRing of Processors

Logical Processor Topology When writing a (distributed memory) parallel

application, one typically organizes processors in a logical topology

Linear Array Ring Bi-directional Ring 2-D grid 2-D torus One-level Tree Fully connected graph Arbitrary graph

We’re going to talk about a simple Ring Natural choice to partition regular data like matrices we will come up with algorithms and performance estimates Some of these algorithms could be done better on other

topologies, like bi-directional rings for instance, but the point is to see how to design and reason about parallel algorithms

Communication on the Ring

Each processor is identified by a rank

RANK() There is a way to find the

total number of processors

NUMPROCS() Each processor can send a

message to its successor SEND(addr, L) RECV(addr, L)

We’re looking only at SPMD programs

P0

P1

P2

P3

Pp-1

Cost of communication

It is actually difficult to precisely model the cost of communication, or the way in which communication loads the processor

We will be using a simple modelTime = + L

: start-up cost L: message size : inverse of the bandwidth

We assume that is a message of length L is sent from P0 to P1, then the communication cost is q( + L )

There are many assumptions in our model, some not very realistic, but we’ll discuss them later

Broadcast

We want to write a program that that has Pk send the same message of length L to all other processors

Broadcast(k,addr,L) On the ring, we just send to the next

processor, and so on, with no parallel communications whatsoever

This is of course not the way one should implement a broadcast in practice MPI uses some type of tree topology

Broadcast

Brodcast(k,addr,L) q = RANK() p = NUMPROCS() if (q == k) SEND(adr,L) else if (q == k-1 mod p) RECV(adr,L) else RECV(adr,L) SEND(adr,L) endif endif

Assumes a blocking receive

Sending may be non-blocking

The broadcast time is

(p-1)( + L )

Scatter

Pk stores the message destined to Pq at address addr[q], including a message at addr[k].

The principle is just to pipeline communication by starting to send the message destined to Pk-1, the most distant processor.

Scatter

q = rank()

p = numprocs()

if (q == k)

for i = 1 to p-1

SEND(addr[k+p-1 mod p],L)

addr addr[k]else

RECV(tempR,L)

for i = 1 to k-1-q mod p

tempS tempR

SEND(tempS,L) || RECV(tempR,L)

addr tempR

Swapping of send bufferand receive buffer (pointer)

Sending and Receivingin Parallel, with a non blocking Send

Same execution time as the broadcast

(p-1)( + L )

All-to-all

q = rank()p = numprocs()addr[q] my_addrfor i = 1 to p-1 SEND(addr[q-i+1 mod p],L) || RECV(addr[q-i mod p],L)

Same execution time as the scatter

(p-1)( + L )

A faster broadcast? How can one accelerate the broadcast? So far we’ve seen (p-1)( + L ) One can cut the message in many small pieces, say in r pieces where L is divisible by r. The root processor just sends r messages The performance is as follows

Consider the last processor to get the last piece of the message There need to be p-1 steps for the first piece to arrive, which takes (p-1)( + L / r) Then the remaining r-1 pieces arrive one after another, which takes (r-1)( + L / r) For a total of: (p - 2 + r) ( + L / r)

A faster broadcast?

The question is, what is the value of r that minimizes (p - 2 + r) ( + L / r) ? One can view the above expression as (c+ar)(d+b/r), with four constants a, b, c, d The non-constant part of the expression is then ad.r + cb/r, which must be minimized It is known that this value is minimized for sqrt(cb / ad)

and we have

ropt = sqrt(L(p-2) / ) with the optimal time (sqrt((p-2) + sqrt(L ))2

which tends to L when L is large, which is independent of p.

Matrix-Vector product

y = A x

for i = 0 to n-1 /* compute a dot-product */ y[i] = 0

for j = 0 to n-1

y[i] = y[i] + a[i,j] * x[j]

Just distribute the dot-product computations among processors

Let n be the size of the matrix, p the number of processors

Let’s assume that n is divisible by p, and let r = n/p

Each processor needs r rows of the matrix

Matrix-vector Product

What about the distribution of vector x? It could be replicated across all processors and

then all computations would be independent But since each processor computes only a piece

of y, it is more elegant to have x distributed like A, with each processor owning r components of vector x

This is typically what would be done in real code so that data is distributed across processors

For a vector it may be more efficient to fully duplicate it, but in general you don’t want to do that for matrices or other data structures

Each processor has in its memory r rows of matrix A in an array a[r][n] r components of vector x in an array my_x[r]

Global vs. Local indices Having only a piece of the overall data structure is common

makes it possible to partition the workload makes it possible to run larger problems by aggregating

distributed memory Typically when writing code like this on have

global index (I,J) that references an element of the matrix local index (i,j) that references an element of the local array

that stores a piece of the matrix Translation between global and local indices

think of the algorithm in terms of global indices implement it in terms of local indices

P4

Global: A[5][7]Local: a[1][3]

a[i,j] = A[Mblock*floor(rank/3) + i][Nblock*ceil(rank mod 3) + j]P5

P2P1P0

P3

MblockN

blo

ck

Principle of the Algorithm

A00 A01 A02 A03 A04 A05 A06 A07

A10 A11 A12 A13 A14 A15 A16 A17P0

x0

x1

A20 A21 A22 A23 A24 A25 A26 A27

A30 A31 A32 A33 A34 A35 A36 A37P1

x2

x3

A40 A41 A42 A43 A44 A45 A46 A47

A50 A51 A52 A53 A54 A55 A56 A57P2

x4

x5

A60 A61 A62 A63 A64 A65 A66 A67

A70 A71 A72 A73 A74 A75 A76 A77P3

x6

x7

Initial data distribution for: n = 8 p = 4 r = 2


A00 A01 A10 A11 P0

x0

x1

A22 A23 A32 A33 P1

x2

x3

A44 A45 A54 A55 P2

x4

x5

A66 A67

A76 A77P3

x6

x7

Step 1


A06 A07

A16 A17P0

x6

x7

A20 A21 A30 A31 P1

x0

x1

P2

x2

x3

P3

x4

x5

Step 2

A22 A23 A32 A33

A44 A45 A54 A55


A06 A07

A16 A17

P0

x4

x5

A20 A21 A30 A31

P1

x6

x7

P2

x0

x1

P3

x2

x3

Step 3

A22 A23 A32 A33

A44 A45 A54 A55


A06 A07

A16 A17

P0

x2

x3

A20 A21 A30 A31

P1

x4

x5

P2

x6

x7

P3

x0

x1

Step 4

A22 A23 A32 A33

A44 A45 A54 A55


A06 A07

A16 A17

P0

x0

x1

A20 A21 A30 A31

P1

x2

x3

P2

x4

x5

P3

x6

x7

Final state

A22 A23 A32 A33

A44 A45 A54 A55

The final exchange of vector x is not strictly necessary, but one may want to have it distributed as the end of the computation like it was distributed at the beginning.

Algorithm

Mat_vec(in A, in x, out y)

q rank() p numprocs() tempS x /* My piece of the vector (r elements) */ for step = 0 to p-1

SEND(tempS,r)

|| RECV(tmpsR,r)

|| for i = 0 to r-1

y[i] y[i] + a[i,(q - step mod p) * r + j] * tempS[j]

tempS tempR

Uses two buffers (tempS for sending and tempR to receiving)

Computation and Communications occur in parallel

Performance

There are p identical steps During each step each processor performs

three concurrent activities: computation, receive, and sending

Each step goes as fast as the slowest off the 3 concurrent activities Computation: r2Tcomp

Communication: + rTcomm(Tcomp and Tcomm are times for individual computations and transfers)

T(p) = p * max(r2Tcomp, + rTcomm)

For fixed p, when n gets large the computation time dominates and efficiency tends to 1

Performance (2)

Note that an algorithm that initially broadcasts the entire vector to all processors and then have every processor compute independently would be in time

(p-1)( + n*Tcomm) + pr2 * Tcomp

which has the same asymptotic performance is a simpler algorithm wastes only a tiny little bit of memory is arguably much less elegant

It is important to think of simple solutions and see what works best given expected matrix sized, etc.

An Image Processing Application

We have seen a few parallel applications with different ways of partitioning the work matrix-matrix and matrix-vector multiply sharks and fishes numerical methods ...

We’re going to look in depth at another type of parallel application, and see what the performance trade-offs are

We’re still working on the ring topology The application model is representative of

several image processing applications

The sequential application

Generic algorithmic framework that can be used for distance from contour computation of an optimal trajectory others..

Let P be a nxn grid, where each point is a pixel. A point p not on the edge has 8 neighbors

NW N NE

W p E

SESSW

Principle of the algorithm

Sweep through the grid, back and forth first from the top-left corner to the bottom-right corner

(FW) then back from the bottom-right corner to the top-left

corner (BW)FW pass

p FW_update(p,W,NW,N,NE)BW pass

p BW_update(p,E,SE,S,SW)

NW N NE

W p

p E

S SESW

stencil

Why is this useful? Distance from contour

Let P be a binary image of an object F, with a pixel value of zero if the pixel belongs to F, and of ∞ otherwise

We want to replace each pixel value by the pixel’s distance to F’s complement, according to some metric

Can be done in two passes: FW: p min(p,W+t1,NW+t2,N+t1,NE+t2) BW: p min(p,E+t1,SE+t2,S+t1,SW+t2) t1 = 1, t2 = ∞: Mahattan distance t1 = 3, t2 = 4: good approximation of Euclidian distance

Once one has this distance, it is easy to compute things such as surface, contour length, etc.

Computation of optimal trajectory Each pixel has a “cost” value Goal: compute minimal cost trajectories from one pixel to all others A bit complicated with non-trivial updates (Bitz and Kung, 1988) O(n2) passes in the worst case (<n in practice)

...

Parallelization?

Stencil applications are common and many people have looked at parallelizing them

Here the stencil is interesting because it is asymmetric and leads to a “wavefront” computation

We want to do this on a ring of processor Usual trade-offs apply

load balance the work among processors Not pay too much for communication Get all computers to start computing early

A Greedy Algorithm

Processors send pixels to their neighbors as soon as they are computed Very small start-up time Good load balancing

Let say we have p = n processors and each line i of the image is assigned to a processor Pi

In a FW phase, as soon as Pi computes a pixel, it must send it to Pi+1

Given the shape of the stencil, a processor needs two values from its predecessor to start computation on a line

Execution Steps

P0

P1

P2

P3

P4

P5

...

Execution Steps

0P0

P1

P2

P3

P4

P5

...

Execution Steps

0 1P0

P1

P2

P3

P4

P5

...

Execution Steps

0 1

2

P0

P1

P2

P3

P4

P5

...

Execution Steps

0 1 2

2

P0

P1

P2

P3

P4

P5

...

Execution Steps

0 1 2 3

2 3

P0

P1

P2

P3

P4

P5

...

Execution Steps

0 1 2 3 4 5 6 7 8

2 3 4 5 6 7 8 9

4 5 6 7 8 9

6 7 8 9

8 9

P0

P1

P2

P3

P4

...

9

. . .

. . .

. . .

. . .

. . .

At “step” 2i+j, processor Pi does: receive pixel (i-1,j+1) from Pi-1

compute pixel (i,j) send pixel (i,j) to Pi+1

Note the similarity to a systolic network

Performance? Assume that sends are non-blocking and receives are blocking For each row, each processor follows a sequence

get a pixel compute send a pixel || get a pixel

P0 C S C S C S C S C S . . C S

P1 C S C S C S C S . . C S C S

P2 C S C S C S . . C S C S C S

..

Pi C S . . . . . . C S C S C S

..

..

Pp-1 C S . . . C S C S C S C

C = ComputeS = Send, Receive, or Send || Receive

2i x (Tcomp + + Tcomm) n x (Tcomp + + Tcomm)

Tcomm = Time to send a pixel to a neighbor

Tcomp = Time to compute a pixel

Performance

Processor Pp-1 is the last one to finish It finishes at time:

T = (3n-2)(Tcomp) + (3n-3)( + Tcomm) Therefore we have a O(n) complexity

We would have stopped here in the land of PRAMs, etc.

The problem is the 3n term. In practice is orders of magnitude larger than Tcomm short messages are known to be a bad idea in most

platforms We have

small start-up time reasonable good balancing expensive communications

What if p < n? This is the realistic case When p<n, one must partition the data

we assume that p divides n One could give the first n/p lines to P0, the next n/p lines to P1, etc.

the last processor would start computing very late due to the stencil shape A better way is to interleave image lines between processors

classical load-balancing technique that we mentioned for sharks and fishes

P0 C S C S C S C S . . C S

P1 C S C S C S . . C S C S

..

Pp-1 C S C S C S . . C S C S C S

P0 C S . . . . . . C S C S C S

..

..

Pp-1 C S . . . C S C S C S C

2p x (Tcomp + + Tcomm) n x (Tcomp + + Tcomm)

Condition for no idle time

Processor P0 finishes computing its first line at time T0 = n x (Tcomp + beta + Tcomm)

Processor P0 receives data from processor Pp-1 to compute its second line at time Tp = 2p x (Tcomp + beta + Tcomm)

It Tp > T0 we have idle time Therefore, Tp ≤ T0, i.e. n ≥ 2p If n > 2p, then P0 must store pixels received from Tp-1 until it

can start computing on them Trade-off between idle-time and memory consumption, with

the perfect balance exactly when n = 2p This notion of finishing receiving data right when the next

computing should start is a common way to obtain “good” schedules (we’ll see that we we talk about Divisible Load Scheduling).

We still have the same problem of expensive communications

Idea #1: cheaper communications

Get rid of most of the network latencies we need to send longer messages so we let each processor compute k consecutive pixels at each step we initiate the process by having P0 compute some number of pixels, l0

Process P0 starts computing l0 pixels without any communication

P0 sends these l0 pixels to P1 and then computes its next k pixels

P1 can start computing l0-1 pixels in parallel with P0’s computation of its

next k pixels When P1 is done computing it sends l0-1 pixels to P2. P2 can start

computing its first l0-2 pixels

etc. When one reaches the end of a line, one just starts the next line, in

the interleaved pattern we saw before At each step, but for the first and perhaps the last one (depending

on whether k divides n-l0), each processor computes k pixels.

Execution steps

0

l0

P0

P1

P2

P3

l0-1

First line

Execution steps

0 1

1

l0 k

P0

P1

P2

P3

l0-1 k

First line

Execution steps

0 1 2

1 2

2

l0 k k

P0

P1

P2

P3

l0-1

l0-2

k

First line

Execution steps

0 1 2 4

1 2 3

2 3 4

3 4

l0 k k

4

k

P0

P1

P2

P3

l0-1

l0-2

k k

k k

l0-3

First line Second line

3

k

4

k

k

The condition so that there is no idle time is: n ≥ (k+1)p(we’ll prove it later)

Execution steps

0 1 2 4

1 2 3 5

2 3 4 6

3 4 5 6

l0 k k

4 5

k k

P0

P1

P2

P3

l0-1

l0-2

k k

k k k

l0-3

5 6

kk

8

7

k

8

First line Second line

3

k

4

k

5

k

7

k k k k k

The larger k, the less expensive are communicationsThe smaller k, the longer the delay between stages

There is an optimal k

Idea #2: fewer communications

To do fewer communications, one can associate blocks of r lines to each processor (to increase locality)

No communications between lines within a block Block allocation is interleaved (block cyclic) Example: p = 4, n = 36, r = 3

P0 P1 P2 P3

0,1,2 3,4,5 6,7,8 9,10,11

lines 12,13,14

15,16,17

18,19,20

21,22,23

24,25,26

27,28,29

30,31,32

33,34,35

Execution Steps

0 1 2 3 4 5 6 7 8 9 1011

2 33

4 5 6 7 8 9 10

5 6 7 8 9 10

5 6 7 8 9 10

4

11

11

11 12 13 14

12 13 14

12

15 16

13

15

First block

14

16

11 12 13

13 14 15

16

17 18

15

Second block

n = 44, p= 4, r = 3, k = 4

0

1

1

2

2

2

3 4

3

3

4

3

5

5

4

6 7

5 6

6 7

8

9

7

n = 44, p= 4, r = 3, k = 13

Execution Steps

0 1 2 3 4 5 6 7 8 9 1011

2 33

4 5 6 7 8 9 10

5 6 7 8 9 10

5 6 7 8 9 10

4

11

11

11 12 13 14

12 13 14

12

15 16

13

15

First block

14

16

11 12 13

13 14 15

16

17 18

15

Second block

n = 44, p= 4, r = 3, k = 4

0

1

1

2

2

2

3 4

3

3

4

3

5

5

4

6 7

5 6

6 7

8

9

7

n = 44, p= 4, r = 3, k = 13

IDLE

Condition for no idle time?

We will see that it is n ≥ p(r+k) Since we’ve reduced communication, we have

increased the start-up delay. We now have two trade-offs?

large k and large r: cheap communication small k and small r: low start-up delay small k and large r? large k and small r?

We need a thorough performance analysis in order to determine the optimal values of k and r

This is what people who design // algorithms do to tune the performance

Performance Analysis

Let us assume that pxr divides n The sequential time is n2 Tcomp

As before, the algorithm can be seen as a succession of 2 stages:

send and/or receive data, in parallel compute

At each stage a proc received k pixels from its predecessor, then computes r sub-lines of k pixels, then send the last k pixels to its successor

Therefore: communication cost per stage: + kTcomm

computation cost per stage: rkTcomp

total cost: Tstage = + kTcomm + rkTcomp


First thing to do: figure out at which stage, sq, processor q, 0 ≤ q ≤ p-1, starts computing a row of lq many pixels

P0 starts computing l0=l0 pixels at stage s0=0 P1 starts computing l1=l0-r mod k pixels at stage s1=1+(r-l0)/k

0

1

2

r = 3, l0 = 12, k = 13

l1 = 12 - 3 mod 13 = 9s1 = 1 + (3 - 13)/13 = 1

0 1

2r = 3, l0 = 2, k = 4

l1 = 2 - 3 mod 4 = 3s1 = 1 + (3 -2)/4 = 2

More generally: lq = (l0 - qr) mod k and sq = q + (qr - l0)/k

0 1

2

3

r = 3, l0 = 2, k = 4

l2 = 2 - 2*3 mod 4 = 4 (there is a subtlety here) s2 = 2 + (2*3 -2)/4 = 3


Now we just need to count the total number of stages, Sp

Pp-1 is the last processor to complete After computing its first “chunk”, Pp-1 has

(n2/(pr) - lp-1 + r - 1)/k chunks left to compute

Therefore: Sp = sp-1 + (n2/(pr) - lp-1 + r - 1)/k

T// = Tstage x Sp


Our analysis is valid only if there is no idle time (we don’t really care about modeling the cases with idle times anyway)

P0 receives its first pixels at stage tp

At that point, it has already computed l0 + k(tp-1) pixels in its first line (l0 at first, and then k at each stage)

We must have that the number of pixels left to compute in the first line + the ones that it can compute in the first line of its second block, using pixels sent out by Pp-1, must be greater or equal than k. (Similar argument for the first algorithm we looked at, just a bit more complex)

This condition can be written asn - (l0 + k(tp-1) + lp) ≥ k

which is equivalent ton ≥ p(r+k)

Performance Analysis Neglecting constant terms one obtains T// = ( + k Tcomm + r k Tcomp) x

((p - 1)(1 + r / k) + n2/(p r k)) provided that r ≤ n/p, and 1 ≤ k ≤ n / p - r Given n, p, and r, one can compute kopt(r) that minimizes T//:

Then one plugs that value in T// with different values of r to find the best r.

Voila :)

Lessons Learned It’s often a good idea to start thinking of the problem in a systolic

array fashion (we have as many procs as elements) and then think of a data distribution when fewer procs are available

Communication costs can be reduced by delaying communications and “bundling” them in a single, longer message.

Communications can be reduced by “blocking” Better load balancing by having an interleaved or “cyclic”

distribution Many algorithms best implemented with a “block cyclic” data

distribution Performance analysis is difficult, although it is easy to find big-O

estimates Choosing the best bundling and blocking factors in non-trivial and

completely problem dependent, although there are some rules of thumb

This must all be put in perspective with hardware (e.g., cache size)

Good parallel computing is hard

Solving Linear Systems of Eq. Method for solving Linear Systems

The need to solve linear systems arises in an estimated 75% of all scientific computing problems [Dahlquist 1974]

Gaussian Elimination is perhaps the most well-known method based on the fact that the solution of a linear system is

invariant under scaling and under row additions One can multiply a row of the matrix by a constant as long as one

multiplies the corresponding element of the right-hand side by the same constant

One can add a row of the matrix to another one as long as one adds the corresponding elements of the right-hand side

Idea: scale and add equations so as to transform matrix A in an upper triangular matrix:

??

???

x =

equation n-i has i unknowns, with

?

Gaussian Elimination1 1 1

1 -2 2

1 2 -1

0

4

2x =

1 1 1

0 -3 1

0 1 -2

0

4

2x =

1 1 1

0 -3 1

0 0 -5

0

4

10

x =

Substract row 1 from rows 2 and 3

Multiple row 3 by 3 and add row 2

-5x3 = 10 x3 = -2-3x2 + x3 = 4 x2 = -2x1 + x2 + x3 = 0 x1 = 4

Solving equations inreverse order (backsolving)

Gaussian Elimination

The algorithm goes through the matrix from the top-left corner to the bottom-right corner

the ith step eliminates non-zero sub-diagonal elements in column i, substracting the ith row scaled by aji/aii from row j, for j=i+1,..,n.

i

0

values already computed

values yet to beupdated

pivot row i

to b

e z

ero

ed

Sequential Gaussian Elimination

Simple sequential algorithm

// for each column i// zero it out below the diagonal by adding// multiples of row i to later rowsfor i = 1 to n-1 // for each row j below row i for j = i+1 to n // add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

Several “tricks” that do not change the spirit of the algorithm but make implementation easier and/or more efficient

Right-hand side is typically kept in column n+1 of the matrix and one speaks of an augmented matrix

Compute the A(i,j)/A(i,i) term outside of the loop

Pivoting: Motivation

A few pathological cases

Division by small numbers round-off error in computer arithmetics Consider the following system

0.0001x1 + x2 = 1.000

x1 - x2 = 1.000 exact solution: x1=1.9998... and x2 = 0.99980...

say we round off after 4 digits after the decimal point

10-4 1

1 -1

1

1

1 104

1 -1

104

1

1 104

0 -104

104

-104

0 1

1 1

0

1

-1 - 104 = -10,001 = -0.10001 E+5 = -0.1000 E+5round-off error

Partial Pivoting

One can just swap rows

Final solution is closer to the real solution. (Magical) Numerical stability is an entire field

Partial Pivoting For numerical stability, one doesn’t go in order, but pick the next row in

rows i to n that has the largest element in row i This row is swapped with row i (along with elements of the right hand

side) before the substractions the swap is not done but rather one keeps an indirection array

Total Pivoting Look for the greatest element ANYWHERE in the matrix Swap columns Swap rows

1 -1

10-4 1

1

1

10-4 -10-4

0 1

10-4

1

good round-off

2

1

Parallel Gaussian Elimination

Assume that we have one processor per matrix element (as in a PRAM or a systolic array)

Reduction Broadcast Compute

Broadcasts Compute

to find the max aji

max aji needed to computethe scaling factor Independent computation

of the scaling factor

Every update needs thescaling factor and the element from the pivot row

Independentcomputations

Parallel Gaussian Elimination Once one understands the algorithm assuming that we

have one proc per element, one can decide on a data distribution when we have fewer procs

One column per proc: remove reduction and some broadcasts One column block per proc increases locality, when one

doesn’t have as many procs as columns One MUST use a cyclic distribution since the matrix is

traversed top-left to bottom-right Good approach: pick a block size, allocate column blocks to

processors, interleaved manner: 1-D block cyclic distribution Better approach when many processors: also partitions rows in

blocks to achieve a 2-D block cyclic distribution

The 2-D block cyclic distribution is sort of the panacea of dense linear algebra as it allows for good locality and good load-balancing, at the cost of more complicated code.

LU Factorization Gaussian Elimination is simple but

What if we have to solve many Ax = b systems for different values of b? This happens a LOT in real applications

Another method is the “LU Factorization” Ax = b Say we could rewrite A = L U, where L is a lower triangular matrix, and

U is an upper triangular matrix O(n3) Then Ax = b is written L U x = b Solve L y = b O(n2) Solve U x = y O(n2)

??????

x =??????

x =

equation i has i unknowns equation n-i has i unknowns

triangular system solves are easy

LU Factorization: Principle It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.

Magically, A = L x U ! Should be done with pivoting as well

1 2 -1

4 3 1

2 2 3

1 2 -1

0 -5

5

2 2 3

gaussianelimination

save thescalingfactor

1 2 -1

4 -5

5

2 2 3

gaussianelimination

+save thescalingfactor

1 2 -1

4 -5

5

2 -2

5gaussianelimination

+save thescalingfactor

1 2 -1

4 -5 5

2 2/5 3

1 0 0

4 1 0

2 2/5 1

L = 1 2 -1

0 -5 5

0 0 3U =

LU Factorization

We’re going to look at the simplest possible version No pivoting

just creates a bunch of indirections that are easy but make the code look complicated

No blocking this is not what one should do on a modern machine (i.e., one with a

cache), but again, adding blocking transforms a 5 line algorithm into several pages of code (just go look at the LAPACK code and see how complicated everything looks)

Very often the principle can be very simple, but the code extremely complex just for optimizations and for dealing with numerical stability

The ScaLAPACK 2-D block-cyclic LU factorization code is layered on top of many libraries

If you were to write it as a one-level procedure that uses MPI directly, it would be many, many pages

it deals with rectangular blocks, all the horrible cases in which nothing divides anything, prime numbers, nothing’s a perfect square, etc.

Sequential Algorithm

stores the scaling factors

k

k

LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk

for j = k+1 to n-1` // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj

}}

Sequential AlgorithmLU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk

for j = k+1 to n-1` // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj

}}

k

ij

k

update

Parallel LU on a ring

Since the algorithm operates by columns from left to right, we should distribute columns to processors

At each step, the processor that owns column k does the “prepare” task and then broadcasts the bottom part of column k to all others. The other processors can then update.

Assume there is a function alloc(k) that returns the rank of the processor that owns column k

We will write everything in terms of global indices, as to avoid annoying index arithmetic

LU-broadcast algorithm

LU-broadcast(A,n) { q rank() p numprocs() for k = 0 to n-2 { if (alloc(k) == q) // preparing column k for i = k+1 to n-1 buffer[i-k-1] aik -aik / akk

broadcast(alloc(k),buffer,n-k-1) for j = k+1 to n-1 if (alloc(j) == q) // update of column j for i=k+1 to n-1 aij aij + buffer[i-k-1] * akj

}}

Dealing with local indices

Assume that p divides n Each processor needs to store r=n/p

columns and its local indices go from 0 to r-1

After step k, only columns with index greater than k will be used

Simple idea: use a local index, l, that everyone initializes to 0

At step k, processor alloc(k) increases its local index so that next time it will point to its next local column

LU-broadcast algorithm

... double a[n-1][r-1];

q rank() p numprocs() l 0 for k = 0 to n-2 { if (alloc(k) == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 // note that this is simpler a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }}

we have replaces aij matrix elements by arrays(typically a good idea to first write the algorithmwith global indices and then move on to local indices)

Load-Balancing

How to distribute the column? There are fewer and fewer columns at the

execution proceeds cyclic distribution More subtle: the amount of computation to be

done isn’t proportional to the data size. The last column is updated n-1 times, while the first

column only once Columns of higher indices require more computation cyclic distribution (proc p-1 may have a bit more

work, but asymptotically it’s insignificant)

Performance analysis is a bit complex: n + (1/2) n2 Tcomm +o(1) for communications (1/2) n2 Tcomp + o(1) for column preparations (1/3)n3 p Tcomp + O(n2) for updates

Pipelining on the Ring

So far, the algorithm we’ve seen uses a simple broadcast

Nothing was specific to being on a ring of processors and it’s portable

in fact you could just write raw MPI that just looks like our pseudo-code and have a very limited, inefficient, LU factorization that works only for some number of processors)

But it’s not efficient The n-1 communication steps are not overlapped with

computations Turns out that on a ring, with a cyclic distribution of

the columns, one can interleave pieces of the broadcast with the computation

It almost looks like inserting the source code from the broadcast code we saw at the very beginning throughout the LU code

LU-pipeline algorithm double a[n-1][r-1];

q rank() p numprocs() l 0 for k = 0 to n-2 { if (k == q mod p) // Prep(k) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 send(buffer,n-k-1) else recv(buffer,n-k-1) if (q ≠ k-1 mod p) send(buffer, n-k-1) for j = l to r-1 // Update(k,j) for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }}

Prep(0)Send(0)

Update(0,4)Update(0,8)Update(0,12)

Recv(0)Send(0)

Update(0,1)Update(0,5)Update(0,9)Update(0,13)

Recv(0)Send(0)


Recv(0)Update(0,3)Update(0,7)Update(0,11)Update(0,15)Prep(1)

Send(1)Update(1,5)Update(1,9)Update(1,13)

Recv(1)Send(1)


Recv(1)Send(1)


Recv(1)Update(1,4)Update(1,8)Update(1,12)

Prep(2)Send(2)


Recv(2)Send(2)


Recv(2)Send(2)



Prep(3)Send(3)


Recv(3)Send(3)


Recv(3)Send(3)



First fourstages

Some communicationoccurs in parallel with computation

A processor sends outdata as soon as it receives it

How can we do better? In the previous algorithm, a processor does all its updates

before doing a Prep() computation that then leads to a communication

But in fact, some of these updates can be done later Idea: Send out pivot as soon as possible Example:

In the previous algorithm P1: Receive(0), Send(0) P1: Update(0,1), Update(0,5), Update(0,9), Update(0,13) P1: Prep(1) P1: Send(1) ...

In the new algorithm P1: Receive(0), Send(0) P1: Update(0,1) P1: Prep(1) P1: Send(1) P1: Update(0,5), Update(0,9), Update(0,13) ...

Prep(0)Send(0)


Recv(0)Send(0)

Update(0,1)


Recv(0)Send(0)

Update(0,2)


Recv(0)Update(0,3)Update(0,7)

Update(0,11)Update(0,15)

Prep(1)Send(1)


Update(1,13)

Recv(1)Send(1)

Update(1,2)


Update(1,14)

Recv(1)Send(1)

Update(1,3)


Recv(1)Update(1,4)Update(1,8)

Update(1,12)

Prep(2)Send(2)


Recv(2)Send(2)

Update(2,3)


Recv(2)Send(2)


Recv(2)

Update(2,5)


Prep(3)Send(3)


Recv(3)Send(3)


Recv(3)Send(3)


Recv(3)


First fourstages

Some communicationoccurs in parallel with computation

A processor sends outdata as soon as it receives it

LU-look-ahead algorithm

q rank() p numprocs() l 0 for k = 0 to n-2 { if (k == q mod p) { Prep(k) Send(buffer,n-k-1) for all j = k mod p, j>k: Update(k-1,j) for all j = k mod p, j>k: Update(k,j) } else { Recv(buffer,n-k-1) if (q ≠ k - 1 mod p) then Send(buffer,n-k-1) if (q ≠ k + 1 mod p) then Update(k,k+1) else for all j = k mod p, j>k: Update(k,j) }}

Further improving performance

One can use local overlap of communication and computation multi-threading, good MPI non-blocking

implementation, etc. There is much more to be said about

parallel LU factorization Many research articles Many libraries available

Matrix-multiply on a grid/torus

2-D Torus topology

We’ve looked at a ring, but for some applications it’s convenient to look at a 2-D grid topology

A 2-D grid with “wrap-around” is called a 2-D torus. Advanced parallel linear algebra libraries/languages

allow to combine arbitrary data distribution strategies with arbitrary topologies (ScaLAPACK, HPF)

1-D block cyclic to a ring 2-D block cyclic to a 2-D grid 2-D block non-cyclic to a ring etc..

In practice, for many linear algebra kernel, using a 2-D block-cyclic on a 2-D grid seems to work best in most situations

we’ve seen that blocks are good for locality we’ve seen that cyclic is good for load-balancing

Semantics of a parallel linear algebra routine?

Centralized when calling a function (e.g., LU)

the input data is available on a single “master” machine the input data must then be distributed among workers the output data must be undistributed and returned to the “master” machine

More natural/easy for the user Allows for the library to make data distribution decisions transparently to

the user Prohibitively expensive if one does sequences of operations

and one almost always does so Distributed

when calling a function (e.g., LU) Assume that the input is already distributed Leave the output distributed

May lead to having to “redistributed” data in between calls so that distributions match, which is harder for the user and may be costly as well

For instance one may want to change the block size between calls, or go from a non-cyclic to a cyclic distribution

Most current software adopt distributed more work for the user more flexibility and control

Matrix-matrix multiply Many people have thought of doing a matrix-multiply on a 2-D torus Assume that we have three matrices A, B, and C, of size NxN Assume that we have p processors, so that p=q2 is a perfect square and our processor grid is qxq We’re looking at a block distribution, but not a cyclic distribution

again, that would obfuscate the code too much

We’re going to look at three algorithms: Cannon, Fox, SnyderA00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B00 B01 B02 B03

B10 B11 B12 B13

B20 B21 B22 B23

B30 B31 B32 B33

Cannon’s Algorithm (1969)

Very simple (from systolic arrays) Starts with a data redistribution for matrices A and B

goal is to have only neighbor-to-neighbor communications A is circularly shifted/rotated “horizontally” so that its

diagonal is on the first column of processors B is circularly shifted/rotated “vertically” so that its diagonal

is on the first row of processors Called preskewing

A00 A01 A02 A03

A11 A12 A13 A10

A22 A23 A20 A21

A33 A30 A31 A32

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B00 B11 B22 B33

B10 B21 B32 B03

B20 B31 B02 B13

B30 B01 B12 B23

Cannon’s Algorithm

Preskewing of A and B

For k = 1 to q in parallel

Local C = C + A*B

Vertical shift of B

Horizontal shift of A

Postskewing of A and B

Of course, computation and communication could be done in an overlapped fashion locally at each processor

Execution Steps...

A00 A01 A02 A03

A11 A12 A13 A10

A22 A23 A20 A21

A33 A30 A31 A32

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B00 B11 B22 B33

B10 B21 B32 B03

B20 B31 B02 B13

B30 B01 B12 B23

local computationon proc (0,0)

A01 A02 A03 A00

A12 A13 A10 A11

A23 A20 A21 A22

A30 A31 A32 A33

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B10 B21 B32 B03

B20 B31 B02 B13

B30 B01 B12 B23

B00 B11 B22 B33

Shifts

A01 A02 A03 A00

A12 A13 A10 A11

A23 A20 A21 A22

A30 A31 A32 A33

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B10 B21 B32 B03

B20 B31 B02 B13

B30 B01 B12 B23

B00 B11 B22 B33

local computationon proc (0,0)

Fox’s Algorithm(1987)

Originally developed for CalTech’s Hypercube

Uses broadcasts and is also called broadcast-multiply-roll algorithm broadcasts diagonals of matrix A

Uses a shift of matrix B No preskewing step

first diagonalsecond diagonalthird diagonal...

Execution Steps...

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B00 B01 B02 B03

B10 B11 B12 B13

B20 B21 B22 B23

B30 B31 B32 B33

initialstate

A00 A00 A00 A00

A11 A11 A11 A11

A22 A22 A22 A22

A33 A33 A33 A33

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

Broadcast of A’s 1st diagonal

Localcomputation

A00 A00 A00 A00

A11 A11 A11 A11

A22 A22 A22 A22

A33 A33 A33 A33

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B00 B01 B02 B03

B10 B11 B12 B13

B20 B21 B22 B23

B30 B31 B32 B33

B00 B01 B02 B03

B10 B11 B12 B13

B20 B21 B22 B23

B30 B31 B32 B33

Execution Steps...

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B10 B11 B12 B13

B20 B21 B22 B23

B30 B31 B32 B33

B00 B01 B02 B03

Shift of B

A01 A01 A01 A01

A12 A12 A12 A12

A23 A23 A23 A23

A30 A30 A30 A30

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

Broadcast of A’s 2nd diagonal

Localcomputation

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B10 B11 B12 B13

B20 B21 B22 B23

B30 B31 B32 B33

B00 B01 B02 B03

A01 A01 A01 A01

A12 A12 A12 A12

A23 A23 A23 A23

A30 A30 A30 A30

B10 B11 B12 B13

B20 B21 B22 B23

B30 B31 B32 B33

B00 B01 B02 B03

Fox’s Algorithm

// No initial data movementfor k = 1 to q-1 in parallel Broadcast A’s kth diagonal Local C = C + A*B Vertical shift of B// No final data movement

Note that there is an additional array to store incoming diagonal block

Snyder’s Algorithm (1992)

More complex than Cannon’s or

Fox’s

First transposes matrix B

Uses reduction operations (sums) on

the rows of matrix C

Shifts matrix B

Execution Steps...

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B00 B01 B02 B03

B10 B11 B12 B13

B20 B21 B22 B23

B30 B31 B32 B33

initialstate

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

Transpose B

Localcomputation

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B00 B10 B20 B30

B01 B11 B21 B31

B02 B12 B22 B32

B03 B13 B23 B33

B00 B10 B20 B30

B01 B11 B21 B31

B02 B12 B22 B32

B03 B13 B23 B33

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

Execution Steps...

Shift B

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B01 B11 B21 B31

B02 B12 B22 B32

B03 B13 B23 B32

B00 B10 B20 B30

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B01 B11 B21 B31

B02 B12 B22 B32

B03 B13 B23 B32

B00 B10 B20 B30

Global sumon the rowsof C

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

B01 B11 B21 B31

B02 B12 B22 B32

B03 B13 B23 B32

B00 B10 B20 B30

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

Localcomputation

Execution Steps...

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

Shift B

B02 B12 B22 B32

B03 B13 B23 B33

B00 B10 B20 B30

B01 B11 B21 B31

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

B02 B12 B22 B32

B03 B13 B23 B33

B00 B10 B20 B30

B01 B11 B21 B31

Global sumon the rowsof C

C00 C01 C02 C03

C10 C11 C12 C13

C20 C21 C22 C23

C30 C31 C32 C33

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

B02 B12 B22 B32

B03 B13 B23 B33

B00 B10 B20 B30

B01 B11 B21 B31

Localcomputation

Complexity Analysis

Sort of cumbersome Two models

4-port model: every processor can communicate with its 4 neighbors in one step

Can match underlying architectures like the Intel Paragon

1-port model: only one single communication at a time for each processor

Both models are assumed bi-directional

One-port results

Cannon

Fox

Snyder

Complexity Results

m in these expressions is the block size Expressions for the 4-port model are MUCH more

complicated Remember that this is all for non-cyclic distributions

formulae and code become very complicated for a full-fledge implementation (nothing divides anything, nothing’s a perfect square, etc.)

Performance analysis of real code is known to be hard

It is done in a few restricted cases An interesting approach is to use simulation

Done in ScaLAPACK for instance Essentially: you have written a code so complex you just run a

simulation of it to ffigure out how fast it goes in different cases

Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a...

Documents

Transcript of Algorithmic Techniques on a Ring of Processors. Logical Processor Topology When writing a...