All-Reduce and Prefix-Sum Operations

All-Reduce and Prefix-Sum Operations • In all-reduce, each node starts with a buffer of size m and the final

results of the operation are identical buffers of size m on each node that are formed by combining the original p buffers using an associative operator.

• Identical to all-to-one reduction followed by a one-to-all broadcast. This formulation is not the most efficient. Uses the pattern of all-to-all broadcast, instead. The only difference is that message size does not increase here. Time for this operation is (ts + twm) log p.

• Different from all-to-all reduction, in which p simultaneous all-to-one reductions take place, each with a different destination for the result.

The Prefix-Sum Operation

• Given p numbers n0,n1,…,np-1 (one on each node), the problem is to compute the sums sk = ∑i

k= 0 ni for all k between 0 and p-1 .

• Initially, nk resides on the node labeled k, and at the end of the procedure, the same node holds Sk.


Computing prefix sums on an eight-node hypercube. At each node, square brackets show the local prefix sum accumulated in the result buffer and parentheses enclose

the contents of the outgoing message buffer for the next step.


• The operation can be implemented using the all-to-all broadcast kernel.

• We must account for the fact that in prefix sums the node with label k uses information from only the k-node subset whose labels are less than or equal to k.

• This is implemented using an additional result buffer. The content of an incoming message is added to the result buffer only if the message comes from a node with a smaller label than the recipient node.

• The contents of the outgoing message (denoted by parentheses in the figure) are updated with every incoming message.


Prefix sums on a d-dimensional hypercube.

Scatter and Gather

• In the scatter operation, a single node sends a unique message of size m to every other node (also called a one-to-all personalized communication).

• In the gather operation, a single node collects a unique message from each node.

• While the scatter operation is fundamentally different from broadcast, the algorithmic structure is similar, except for differences in message sizes (messages get smaller in scatter and stay constant in broadcast).

• The gather operation is exactly the inverse of the scatter operation and can be executed as such.

Gather and Scatter Operations

Scatter and gather operations.

Example of the Scatter Operation

The scatter operation on an eight-node hypercube.

Cost of Scatter and Gather

• There are log p steps, in each step, the machine size halves and the data size halves.

• We have the time for this operation to be:

• This time holds for a linear array as well as a 2-D mesh. • These times are asymptotically optimal in message size.

All-to-All Personalized Communication • Each node has a distinct message of size m for every other node. • This is unlike all-to-all broadcast, in which each node sends the same

message to all other nodes. • All-to-all personalized communication is also known as total

exchange.

All-to-All Personalized Communication

All-to-all personalized communication.

All-to-All Personalized Communication: Example • Consider the problem of transposing a matrix. • Each processor contains one full row of the matrix. • The transpose operation in this case is identical to an all-to-all

personalized communication operation.

All-to-All Personalized Communication: Example

All-to-all personalized communication in transposing a 4 x 4 matrix using four processes.

All-to-All Personalized Communication on a Ring • Each node sends all pieces of data as one consolidated message of

size m(p – 1) to one of its neighbors. • Each node extracts the information meant for it from the data

received, and forwards the remaining (p – 2) pieces of size m each to the next node.

• The algorithm terminates in p – 1 steps. • The size of the message reduces by m at each step.

All-to-All Personalized Communication on a Ring

All-to-all personalized communication on a six-node ring. The label of each message is of the form {x,y}, where x is the label of the node that originally owned the message, and y is the label of the node that is the final destination of the message. The label

({x1,y1}, {x2,y2},…, {xn,yn}, indicates a message that is formed by concatenating n individual messages.

All-to-All Personalized Communication on a Ring: Cost

• We have p – 1 steps in all. • In step i, the message size is m(p – i). • The total time is given by:

• The tw term in this equation can be reduced by a factor of 2 by communicating messages in both directions.

All-to-All Personalized Communication on a Mesh • Each node first groups its p messages according to the columns of

their destination nodes. • All-to-all personalized communication is performed independently in

each row with clustered messages of size m√p. • Messages in each node are sorted again, this time according to the

rows of their destination nodes. • All-to-all personalized communication is performed independently in

each column with clustered messages of size m√p.

All-to-All Personalized Communication on a Mesh

The distribution of messages at the beginning of each phase of all-to-all personalized communication on a 3 x 3 mesh. At the end of the second phase, node i has messages ({0,i},…,{8,i}), where 0 ≤ i ≤ 8. The groups of nodes communicating together in each phase are enclosed

in dotted boundaries.

All-to-All Personalized Communication on a Mesh: Cost

• Time for the first phase is identical to that in a ring with √p processors, i.e., (ts + twmp/2)(√p – 1).

• Time in the second phase is identical to the first phase. Therefore, total time is twice of this time, i.e.,

• It can be shown that the time for rearrangement is less much less than this communication time.

All-to-All Personalized Communication on a Hypercube • Generalize the mesh algorithm to log p steps. • At any stage in all-to-all personalized communication, every node

holds p packets of size m each. • While communicating in a particular dimension, every node sends

p/2 of these packets (consolidated as one message). • A node must rearrange its messages locally before each of the log p

communication steps.

All-to-All Personalized Communication on a Hypercube

An all-to-all personalized communication algorithm on a three-dimensional hypercube.

All-to-All Personalized Communication on a Hypercube: Cost • We have log p iterations and mp/2 words are communicated in each

iteration. Therefore, the cost is:

• This is not optimal!

All-to-All Personalized Communication on a Hypercube: Optimal Algorithm • Each node simply performs p – 1 communication steps, exchanging

m words of data with a different node in every step. • A node must choose its communication partner in each step so that

the hypercube links do not suffer congestion. • In the jth communication step, node i exchanges data with node (i

XOR j). • In this schedule, all paths in every communication step are

congestion-free, and none of the bidirectional links carry more than one message in the same direction.

All-to-All Personalized Communication on a Hypercube: Optimal Algorithm

Seven steps in all-to-all personalized communication on an eight-node hypercube.

All-to-All Personalized Communication on a Hypercube: Optimal Algorithm

A procedure to perform all-to-all personalized communication on a d-dimensional hypercube. The message Mi,j initially resides on node i and is

destined for node j.

All-to-All Personalized Communication on a Hypercube: Cost Analysis of Optimal Algorithm • There are p – 1 steps and each step involves non-congesting message

transfer of m words. • We have:

• This is asymptotically optimal in message size.

Dense Matrix Algorithms

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

To accompany the text “Introduction to Parallel Computing”,Addison Wesley, 2003.

Topic Overview

• Matrix-Vector Multiplication • Matrix-Matrix Multiplication • Solving a System of Linear Equations

Matix Algorithms: Introduction

• Due to their regular structure, parallel computations involving matrices and vectors readily lend themselves to data-decomposition.

• Typical algorithms rely on input, output, or intermediate data decomposition.

• Most algorithms use one- and two-dimensional block, cyclic, and block-cyclic partitionings.

Matrix-Vector Multiplication

• We aim to multiply a dense n x n matrix A with an n x 1 vector x to yield the n x 1 result vector y.

• The serial algorithm requires n2 multiplications and additions.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning• The n x n matrix is partitioned among n processors, with each

processor storing complete row of the matrix. • The n x 1 vector x is distributed such that each process owns one of

its elements.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning

Multiplication of an n x n matrix with an n x 1 vector usingrowwise block 1-D partitioning. For the one-row-per-process

case, p = n.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning• Since each process starts with only one element of x , an all-to-all

broadcast is required to distribute all the elements to all the processes.

• Process Pi now computes . • The all-to-all broadcast and the computation of y[i] both take time

Θ(n) . Therefore, the parallel time is Θ(n) .

Matrix-Vector Multiplication:Rowwise 1-D Partitioning• Consider now the case when p < n and we use block 1D partitioning.• Each process initially stores n=p complete rows of the matrix and a

portion of the vector of size n=p.• The all-to-all broadcast takes place among p processes and involves

messages of size n=p.• This is followed by n=p local dot products.• Thus, the parallel run time of this procedure is

This is cost-optimal.

Matrix-Vector Multiplication: Rowwise 1-D Partitioning

Scalability Analysis:

• We know that T0 = pTP - W, therefore, we have,

• For isoefficiency, we have W = KT0, where K = E/(1 – E) for desired efficiency E.

• From this, we have W = O(p2) (from the tw term).• There is also a bound on isoefficiency because of concurrency. In this

case, p < n, therefore, W = n2 = Ω(p2).• Overall isoefficiency is W = O(p2).

Matrix-Vector Multiplication: 2-D Partitioning• The n x n matrix is partitioned among n2 processors such that each

processor owns a single element.• The n x 1 vector x is distributed only in the last column of n

processors.

Matrix-Vector Multiplication: 2-D Partitioning

Matrix-vector multiplication with block 2-D partitioning. For theone-element-per-process case, p = n2 if the matrix size is n x n .

Matrix-Vector Multiplication: 2-D Partitioning• We must first align the vector with the matrix appropriately. • The first communication step for the 2-D partitioning aligns the

vector x along the principal diagonal of the matrix. • The second step copies the vector elements from each diagonal

process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column.

• Finally, the result vector is computed by performing an all-to-one reduction along the columns.

Matrix-Vector Multiplication: 2-D Partitioning• Three basic communication operations are used in this algorithm:

one-to-one communication to align the vector along the main diagonal, one-to-all broadcast of each vector element among the n processes of each column, and all-to-one reduction in each row.

• Each of these operations takes Θ(log n) time and the parallel time is Θ(log n) .

• The cost (process-time product) is Θ(n2 log n) ; hence, the algorithm is not cost-optimal.

Matrix-Vector Multiplication: 2-D Partitioning• When using fewer than n2 processors, each process owns an

block of the matrix. • The vector is distributed in portions of elements in the last

process-column only. • In this case, the message sizes for the alignment, broadcast, and

reduction are all . • The computation is a product of an submatrix with a

vector of length .

Matrix-Vector Multiplication: 2-D Partitioning• The first alignment step takes time

• The broadcast and reductions take time

• Local matrix-vector products take time

• Total time is

Matrix-Vector Multiplication: 2-D Partitioning• Scalability Analysis:

• • Equating T0 with W, term by term, for isoefficiency, we have,

as the dominant term. • The isoefficiency due to concurrency is O(p). • The overall isoefficiency is (due to the network

bandwidth). • For cost optimality, we have, . For this, we have,

Matrix-Matrix Multiplication • Consider the problem of multiplying two n x n dense, square matrices A

and B to yield the product matrix C =A x B.• The serial complexity is O(n3).• We do not consider better serial algorithms (Strassen's method),

although, these can be used as serial kernels in the parallel algorithms.• A useful concept in this case is called block operations. In this view, an n

x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix.

• In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.

Matrix-Matrix Multiplication

• Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < ) of size each.

• Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix.

• Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < .

• All-to-all broadcast blocks of A along rows and B along columns.• Perform local submatrix multiplication.

Matrix-Matrix Multiplication

• The two broadcasts take time • The computation requires multiplications of

sized submatrices. • The parallel run time is approximately

• The algorithm is cost optimal and the isoefficiency is O(p1.5) due to bandwidth term tw and concurrency.

• Major drawback of the algorithm is that it is not memory optimal.

Matrix-Matrix Multiplication: Cannon's Algorithm• In this algorithm, we schedule the computations of the

processes of the ith row such that, at any given time, each process is using a different block Ai,k.

• These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh Ai,k after each rotation.

Matrix-Matrix Multiplication: Cannon's Algorithm

Communication steps in Cannon's algorithm on 16 processes.

Matrix-Matrix Multiplication: Cannon's Algorithm• Align the blocks of A and B in such a way that each process multiplies

its local submatrices. This is done by shifting all submatrices Ai,j to the left (with wraparound) by i steps and all submatrices Bi,j up (with wraparound) by j steps.

• Perform local block multiplication.• Each block of A moves one step left and each block of B moves one

step up (again with wraparound).• Perform next block multiplication, add to partial result, repeat until

all blocks have been multiplied.

Matrix-Matrix Multiplication: Cannon's Algorithm• In the alignment step, since the maximum distance over which a

block shifts is , the two shift operations require a total of time.

• Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time.

• The computation time for multiplying matrices of size is .

• The parallel time is approximately:

• The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm, except, this is memory optimal.

Matrix-Matrix Multiplication: DNS Algorithm• Uses a 3-D partitioning.• Visualize the matrix multiplication algorithm as a cube . matrices A

and B come in two orthogonal faces and result C comes out the other orthogonal face.

• Each internal node in the cube represents a single add-multiply operation (and thus the complexity).

• DNS algorithm partitions this cube using a 3-D block scheme.

Matrix-Matrix Multiplication: DNS Algorithm• Assume an n x n x n mesh of processors.• Move the columns of A and rows of B and perform broadcast.• Each processor computes a single add-multiply.• This is followed by an accumulation along the C dimension.• Since each add-multiply takes constant time and accumulation and

broadcast takes log n time, the total runtime is log n.• This is not cost optimal. It can be made cost optimal by using n / log n

processors along the direction of accumulation.

Matrix-Matrix Multiplication: DNS Algorithm

The communication steps in the DNS algorithm while multiplying 4 x 4 matrices A and B on 64 processes.


Using fewer than n3 processors.• Assume that the number of processes p is equal to q3 for some q < n.• The two matrices are partitioned into blocks of size (n/q) x(n/q).• Each matrix can thus be regarded as a q x q two-dimensional square

array of blocks.• The algorithm follows from the previous one, except, in this case, we

operate on blocks rather than on individual elements.


Using fewer than n3 processors. • The first one-to-one communication step is performed for both A

and B, and takes time for each matrix. • The two one-to-all broadcasts take time for each

matrix. • The reduction takes time . • Multiplication of submatrices takes time. • The parallel time is approximated by:

• The isoefficiency function is .

Solving a System of Linear Equations• Consider the problem of solving linear equations of the kind:

• This is written as Ax = b, where A is an n x n matrix with A[i, j] = ai,j, b is an n x 1 vector [ b0, b1, … , bn ]T, and x is the solution.

Solving a System of Linear Equations

Two steps in solution are: reduction to triangular form, and back-substitution. The triangular form is as:

We write this as: Ux = y . A commonly used method for transforming a given matrix into an

upper-triangular matrix is Gaussian Elimination.

Gaussian Elimination

Serial Gaussian Elimination

Gaussian Elimination• The computation has three nested loops - in the kth iteration of the

outer loop, the algorithm performs (n-k)2 computations. Summing from k = 1..n, we have roughly (n3/3) multiplications-subtractions.

A typical computation in Gaussian elimination.

Parallel Gaussian Elimination• Assume p = n with each row assigned to a processor. • The first step of the algorithm normalizes the row. This is a serial

operation and takes time (n-k) in the kth iteration. • In the second step, the normalized row is broadcast to all the

processors. This takes time . • Each processor can independently eliminate this row from its own. This

requires (n-k-1) multiplications and subtractions. • The total parallel time can be computed by summing from k = 1 … n-1

as

• The formulation is not cost optimal because of the tw term.

Parallel Gaussian Elimination

Gaussian elimination steps during the iteration corresponding k = 3 for an 8 x 8 matrix partitioned rowwise among eight processes.

Parallel Gaussian Elimination: Pipelined Execution• In the previous formulation, the (k+1)st iteration starts only after all

the computation and communication for the kth iteration is complete.

• In the pipelined version, there are three steps - normalization of a row, communication, and elimination. These steps are performed in an asynchronous fashion.

• A processor Pk waits to receive and eliminate all rows prior to k.• Once it has done this, it forwards its own row to processor Pk+1.

Parallel Gaussian Elimination: Pipelined Execution

Pipelined Gaussian elimination on a 5 x 5 matrix partitioned withone row per process.

Parallel Gaussian Elimination: Pipelined Execution• The total number of steps in the entire pipelined procedure is Θ(n).• In any step, either O(n) elements are communicated between

directly-connected processes, or a division step is performed on O(n) elements of a row, or an elimination step is performed on O(n) elements of a row.

• The parallel time is therefore O(n2) .• This is cost optimal.

Parallel Gaussian Elimination: Pipelined Execution

The communication in the Gaussian elimination iterationcorresponding to k = 3 for an 8 x 8 matrix distributed among

four processes using block 1-D partitioning.

Parallel Gaussian Elimination: Block 1D with p < n

• The above algorithm can be easily adapted to the case when p < n.• In the kth iteration, a processor with all rows belonging to the active part

of the matrix performs (n – k -1) / np multiplications and subtractions.• In the pipelined version, for n > p, computation dominates

communication.• The parallel time is given by: or approximately, n3/p.• While the algorithm is cost optimal, the cost of the parallel algorithm is

higher than the sequential run time by a factor of 3/2.


Computation load on different processes in block and cyclic 1-D partitioning of an 8 x 8 matrix on four processes during the

Gaussian elimination iteration corresponding to k = 3.


• The load imbalance problem can be alleviated by using a cyclic mapping.

• In this case, other than processing of the last p rows, there is no load imbalance.

• This corresponds to a cumulative load imbalance overhead of O(n2p) (instead of O(n3) in the previous case).

Gaussian Elimination with Partial Pivoting• For numerical stability, one generally uses partial pivoting.• In the k th iteration, we select a column i (called the pivot column)

such that A[k, i] is the largest in magnitude among all A[k, i] such that k ≤ j < n.

• The k th and the i th columns are interchanged.• Simple to implement with row-partitioning and does not add

overhead since the division step takes the same time as computing the max.

• Column-partitioning, however, requires a global reduction, adding a log p term to the overhead.

• Pivoting precludes the use of pipelining.

Gaussian Elimination with Partial Pivoting: 2-D Partitioning • Partial pivoting restricts use of pipelining, resulting in performance

loss. • This loss can be alleviated by restricting pivoting to specific columns. • Alternately, we can use faster algorithms for broadcast.

Solving a Triangular System: Back-Substitution

• The upper triangular matrix U undergoes back-substitution to determine the vector x.

A serial algorithm for back-substitution.

Solving a Triangular System: Back-Substitution• The algorithm performs approximately n2/2 multiplications and

subtractions.• Since complexity of this part is asymptotically lower, we should optimize

the data distribution for the factorization part.• Consider a rowwise block 1-D mapping of the n x n matrix U with vector

y distributed uniformly.• The value of the variable solved at a step can be pipelined back.• Each step of a pipelined implementation requires a constant amount of

time for communication and Θ(n/p) time for computation.• The parallel run time of the entire algorithm is Θ(n2/p).

Solving a Triangular System: Back-Substitution• If the matrix is partitioned by using 2-D partitioning on a logical

mesh of processes, and the elements of the vector are distributed along one of the columns of the process mesh, then only the processes containing the vector perform any computation.

• Using pipelining to communicate the appropriate elements of U to the process containing the corresponding elements of y for the substitution step (line 7), the algorithm can be executed in time.

• While this is not cost optimal, since this does not dominate the overall computation, the cost optimality is determined by the factorization.

Sorting Algorithms

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Topic Overview • Issues in Sorting on Parallel Computers

• Sorting Networks

• Bubble Sort and its Variants

• Quicksort

• Bucket and Sample Sort

• Other Sorting Algorithms

Sorting: Overview • One of the most commonly used and well-studied kernels.

• Sorting can be comparison-based or noncomparison-based.

• The fundamental operation of comparison-based sorting is compare-exchange.

• The lower bound on any comparison-based sort of n numbers is Θ(nlog n) .

• We focus here on comparison-based sorting algorithms.

Sorting: Basics What is a parallel sorted sequence? Where are the input and output lists stored?

• We assume that the input and output lists are distributed.

• The sorted list is partitioned with the property that each partitioned list is sorted and each element in processor Pi's list is less than that in Pj's list if i < j.

Sorting: Parallel Compare Exchange Operation

A parallel compare-exchange operation. Processes Pi and Pj send their elements to each other. Process Pi keeps min{ai,aj}, and Pj keeps

max{ai, aj}.

Sorting: Basics What is the parallel counterpart to a sequential comparator?

• If each processor has one element, the compare exchange operation stores the smaller element at the processor with smaller id. This can be done in ts + tw time.

• If we have more than one element per processor, we call this operation a compare split. Assume each of two processors have n/p elements.

• After the compare-split operation, the smaller n/p elements are at processor Pi and the larger n/p elements at Pj, where i < j.

• The time for a compare-split operation is (ts+ twn/p), assuming that the two partial lists were initially sorted.

Sorting: Parallel Compare Split Operation

A compare-split operation. Each process sends its block of size n/p to the other process. Each process merges the received block with its

own block and retains only the appropriate half of the merged block. In this example, process Pi retains the smaller elements and process

Pi retains the larger elements.

Sorting Networks • Networks of comparators designed specifically for sorting.

• A comparator is a device with two inputs x and y and two outputs x' and y'. For an increasing comparator, x' = min{x,y} and y' = min{x,y}; and vice-versa.

• We denote an increasing comparator by and a decreasing comparator by Ө.

• The speed of the network is proportional to its depth.

Sorting Networks: Comparators

A schematic representation of comparators: (a) an increasing comparator, and (b) a decreasing comparator.

Sorting Networks

A typical sorting network. Every sorting network is made up of a series of columns, and each column contains a number of

comparators connected in parallel.

Sorting Networks: Bitonic Sort • A bitonic sorting network sorts n elements in Θ(log2n) time.

• A bitonic sequence has two tones - increasing and decreasing, or vice versa. Any cyclic rotation of such networks is also considered bitonic.

• 1,2,4,7,6,0 is a bitonic sequence, because it first increases and then decreases. 8,9,2,1,0,4 is another bitonic sequence, because it is a cyclic shift of 0,4,8,9,2,1.

• The kernel of the network is the rearrangement of a bitonic sequence into a sorted sequence.

Sorting Networks: Bitonic Sort • Let s = a0,a1,…,an-1 be a bitonic sequence such that a0 ≤ a1 ≤ ···

≤ an/2-1 and an/2 ≥ an/2+1 ≥ ··· ≥ an-1.

• Consider the following subsequences of s:

s1 = min{a0,an/2},min{a1,an/2+1},…,min{an/2-1,an-1}

s2 = max{a0,an/2},max{a1,an/2+1},…,max{an/2-1,an-1} (1)

• Note that s1 and s2 are both bitonic and each element of s1 is less than every element in s2.

• We can apply the procedure recursively on s1 and s2 to get the sorted sequence.

Sorting Networks: Bitonic Sort

Merging a 16-element bitonic sequence through a series of log 16 bitonic splits.

Sorting Networks: Bitonic Sort • We can easily build a sorting network to implement this bitonic merge algorithm.

• Such a network is called a bitonic merging network.

• The network contains log n columns. Each column contains n/2 comparators and performs one step of the bitonic merge.

• We denote a bitonic merging network with n inputs by BM[n].

• Replacing the comparators by Ө comparators results in a decreasing output sequence; such a network is denoted by ӨBM[n].


A bitonic merging network for n = 16. The input wires are numbered 0,1,…, n - 1, and the binary representation of these numbers is shown. Each column

of comparators is drawn separately; the entire figure represents a BM[16] bitonic merging network. The network takes a bitonic sequence and outputs it

in sorted order.

Sorting Networks: Bitonic Sort How do we sort an unsorted sequence using a bitonic merge?

• We must first build a single bitonic sequence from the given sequence.

• A sequence of length 2 is a bitonic sequence.

• A bitonic sequence of length 4 can be built by sorting the first two elements using BM[2] and next two, using ӨBM[2].

• This process can be repeated to generate larger bitonic sequences.


A schematic representation of a network that converts an input sequence into a bitonic sequence. In this example, BM[k] and

ӨBM[k] denote bitonic merging networks of input size k that use and Ө comparators, respectively. The last merging network

(BM[16]) sorts the input. In this example, n = 16.


The comparator network that transforms an input sequence of 16 unordered numbers into a bitonic sequence.

Sorting Networks: Bitonic Sort • The depth of the network is Θ(log2 n).

• Each stage of the network contains n/2 comparators. A serial implementation of the network would have complexity Θ(nlog2 n).

Mapping Bitonic Sort to Hypercubes • Consider the case of one item per processor. The question becomes one of how the wires in the bitonic network should be mapped to the hypercube interconnect.

• Note from our earlier examples that the compare-exchange operation is performed between two wires only if their labels differ in exactly one bit!

• This implies a direct mapping of wires to processors. All communication is nearest neighbor!

Mapping Bitonic Sort to Hypercubes

Communication during the last stage of bitonic sort. Each wire is mapped to a hypercube process; each connection represents a compare-

exchange between processes.


Communication characteristics of bitonic sort on a hypercube. During each stage of the algorithm, processes communicate along the

dimensions shown.


Parallel formulation of bitonic sort on a hypercube with n = 2d processes.


• During each step of the algorithm, every process performs a compare-exchange operation (single nearest neighbor communication of one word).

• Since each step takes Θ(1) time, the parallel time is

Tp = Θ(log2n) (2)

• This algorithm is cost optimal w.r.t. its serial counterpart, but not w.r.t. the best sorting algorithm.

Mapping Bitonic Sort to Meshes

• The connectivity of a mesh is lower than that of a hypercube, so we must expect some overhead in this mapping.

• Consider the row-major shuffled mapping of wires to processors.


Different ways of mapping the input wires of the bitonic sorting network to a mesh of processes: (a) row-major mapping, (b) row-major snakelike

mapping, and (c) row-major shuffled mapping.


The last stage of the bitonic sort algorithm for n = 16 on a mesh, using the row-major shuffled mapping. During each step, process pairs compare-exchange their elements. Arrows indicate the pairs of

processes that perform compare-exchange operations.

Mapping Bitonic Sort to Meshes • In the row-major shuffled mapping, wires that differ at the ith least-significant bit are mapped onto mesh processes that are 2(i-1)/2 communication links away.

• The total amount of communication performed by each process is . The total computation performed by each process is Θ(log2n).

• The parallel runtime is:

• This is not cost optimal.

)(or ,72log

1 12/)1( nnn

i

i

jj

Block of Elements Per Processor

• Each process is assigned a block of n/p elements.

• The first step is a local sort of the local block.

• Each subsequent compare-exchange operation is replaced by a compare-split operation.

• We can effectively view the bitonic network as having (1 + log p)(log p)/2 steps.

Block of Elements Per Processor: Hypercube • Initially the processes sort their n/p elements (using merge sort) in time

Θ((n/p)log(n/p)) and then perform Θ(log2p) compare-split steps.

• The parallel run time of this formulation is

• Comparing to an optimal sort, the algorithm can efficiently use up to processes.

• The isoefficiency function due to both communication and extra work is Θ(plog plog2p) .

)2( lognp

Block of Elements Per Processor: Mesh • The parallel runtime in this case is given by:

• This formulation can efficiently use up to p = Θ(log2n) processes.

• The isoefficiency function is

Performance of Parallel Bitonic Sort The performance of parallel formulations of bitonic sort for n elements

on p processes.

Bubble Sort and its Variants The sequential bubble sort algorithm compares and exchanges

adjacent elements in the sequence to be sorted:

Sequential bubble sort algorithm.

Bubble Sort and its Variants

• The complexity of bubble sort is Θ(n2).

• Bubble sort is difficult to parallelize since the algorithm has no concurrency.

• A simple variant, though, uncovers the concurrency.

Odd-Even Transposition

Sequential odd-even transposition sort algorithm.


Sorting n = 8 elements, using the odd-even transposition sort algorithm. During each phase, n = 8 elements are compared.


• After n phases of odd-even exchanges, the sequence is sorted.

• Each phase of the algorithm (either odd or even) requires Θ(n) comparisons.

• Serial complexity is Θ(n2).

Parallel Odd-Even Transposition

• Consider the one item per processor case.

• There are n iterations, in each iteration, each processor does one compare-exchange.

• The parallel run time of this formulation is Θ(n).

• This is cost optimal with respect to the base serial algorithm but not the optimal one.


Parallel formulation of odd-even transposition.


• Consider a block of n/p elements per processor.

• The first step is a local sort.

• In each subsequent step, the compare exchange operation is replaced by the compare split operation.

• The parallel run time of the formulation is


• The parallel formulation is cost-optimal for p = O(log n).

• The isoefficiency function of this parallel formulation is Θ(p2p).

Shellsort

• Let n be the number of elements to be sorted and p be the number of processes.

• During the first phase, processes that are far away from each other in the array compare-split their elements.

• During the second phase, the algorithm switches to an odd-even transposition sort.

Parallel Shellsort • Initially, each process sorts its block of n/p elements internally.

• Each process is now paired with its corresponding process in the reverse order of the array. That is, process Pi, where i < p/2, is paired with process Pp-i-1.

• A compare-split operation is performed.

• The processes are split into two groups of size p/2 each and the process repeated in each group.

Parallel Shellsort

An example of the first phase of parallel shellsort on an eight-process array.

Parallel Shellsort • Each process performs d = log p compare-split operations.

• With O(p) bisection width, each communication can be performed in time Θ(n/p) for a total time of Θ((nlog p)/p).

• In the second phase, l odd and even phases are performed, each requiring time Θ(n/p).

• The parallel run time of the algorithm is:

Quicksort

• Quicksort is one of the most common sorting algorithms for sequential computers because of its simplicity, low overhead, and optimal average complexity.

• Quicksort selects one of the entries in the sequence to be the pivot and divides the sequence into two - one with all elements less than the pivot and other greater.

• The process is recursively applied to each of the sublists.

Quicksort

The sequential quicksort algorithm.

Quicksort

Example of the quicksort algorithm sorting a sequence of size n = 8.

Quicksort

• The performance of quicksort depends critically on the quality of the pivot.

• In the best case, the pivot divides the list in such a way that the larger of the two lists does not have more than αn elements (for some constant α).

• In this case, the complexity of quicksort is O(nlog n).

Parallelizing Quicksort

• Lets start with recursive decomposition - the list is partitioned serially and each of the subproblems is handled by a different processor.

• The time for this algorithm is lower-bounded by Ω(n)!

• Can we parallelize the partitioning step - in particular, if we can use n processors to partition a list of length n around a pivot in O(1) time, we have a winner.

• This is difficult to do on real machines, though.

Parallelizing Quicksort: PRAM Formulation • We assume a CRCW (concurrent read, concurrent write) PRAM with concurrent writes resulting in an arbitrary write succeeding.

• The formulation works by creating pools of processors. Every processor is assigned to the same pool initially and has one element.

• Each processor attempts to write its element to a common location (for the pool).

• Each processor tries to read back the location. If the value read back is greater than the processor's value, it assigns itself to the `left' pool, else, it assigns itself to the `right' pool.

• Each pool performs this operation recursively.

• Note that the algorithm generates a tree of pivots. The depth of the tree is the expected parallel runtime. The average value is O(log n).

Parallelizing Quicksort: PRAM Formulation

A binary tree generated by the execution of the quicksort algorithm. Each level of the tree represents a different array-partitioning iteration. If

pivot selection is optimal, then the height of the tree is Θ(log n), which is also the number of iterations.

Parallelizing Quicksort: PRAM Formulation

The execution of the PRAM algorithm on the array shown in (a).

Parallelizing Quicksort: Shared Address Space Formulation

• Consider a list of size n equally divided across p processors.

• A pivot is selected by one of the processors and made known to all processors.

• Each processor partitions its list into two, say Li and Ui, based on the selected pivot.

• All of the Li lists are merged and all of the Ui lists are merged separately.

• The set of processors is partitioned into two (in proportion of the size of lists L and U). The process is recursively applied to each of the lists.

Shared Address Space Formulation


• The only thing we have not described is the global reorganization (merging) of local lists to form L and U.

• The problem is one of determining the right location for each element in the merged list.

• Each processor computes the number of elements locally less than and greater than pivot.

• It computes two sum-scans to determine the starting location for its elements in the merged L and U lists.

• Once it knows the starting locations, it can write its elements safely.


Efficient global rearrangement of the array.


• The parallel time depends on the split and merge time, and the quality of the pivot.

• The latter is an issue independent of parallelism, so we focus on the first aspect, assuming ideal pivot selection.

• The algorithm executes in four steps: (i) determine and broadcast the pivot; (ii) locally rearrange the array assigned to each process; (iii) determine the locations in the globally rearranged array that the local elements will go to; and (iv) perform the global rearrangement.

• The first step takes time Θ(log p), the second, Θ(n/p) , the third, Θ(log p) , and the fourth, Θ(n/p).

• The overall complexity of splitting an n-element array is Θ(n/p) + Θ(log p).


• The process recurses until there are p lists, at which point, the lists are sorted locally.

• Therefore, the total parallel time is:

• The corresponding isoefficiency is Θ(plog2p) due to broadcast and scan operations.

Parallelizing Quicksort: Message Passing Formulation • A simple message passing formulation is based on the recursive halving

of the machine.

• Assume that each processor in the lower half of a p processor ensemble is paired with a corresponding processor in the upper half.

• A designated processor selects and broadcasts the pivot.

• Each processor splits its local list into two lists, one less (Li), and other greater (Ui) than the pivot.

• A processor in the low half of the machine sends its list Ui to the paired processor in the other half. The paired processor sends its list Li.

• It is easy to see that after this step, all elements less than the pivot are in the low half of the machine and all elements greater than the pivot are in the high half.

Parallelizing Quicksort: Message Passing Formulation • The above process is recursed until each processor has its own local list,

which is sorted locally.

• The time for a single reorganization is Θ(log p) for broadcasting the pivot element, Θ(n/p) for splitting the locally assigned portion of the array, Θ(n/p) for exchange and local reorganization.

• We note that this time is identical to that of the corresponding shared address space formulation.

• It is important to remember that the reorganization of elements is a bandwidth sensitive operation.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003

Topic Overview

• Definitions and Representation • Minimum Spanning Tree: Prim's Algorithm • Single-Source Shortest Paths: Dijkstra's Algorithm • All-Pairs Shortest Paths • Transitive Closure • Connected Components • Algorithms for Sparse Graphs

Definitions and Representation

• An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite set of edges.

• An edge e ∈ E is an unordered pair (u,v), where u,v ∈ V. • In a directed graph, the edge e is an ordered pair (u,v). An edge (u,v)

is incident from vertex u and is incident to vertex v. • A path from a vertex v to a vertex u is a sequence <v0,v1,v2,…,vk> of

vertices where v0 = v, vk = u, and (vi, vi+1) ∈ E for I = 0, 1,…, k-1. • The length of a path is defined as the number of edges in the path.


a) An undirected graph and (b) a directed graph.


• An undirected graph is connected if every pair of vertices is connected by a path.

• A forest is an acyclic graph, and a tree is a connected acyclic graph. • A graph that has weights associated with each edge is called a

weighted graph.


• Graphs can be represented by their adjacency matrix or an edge (or vertex) list.

• Adjacency matrices have a value ai,j = 1 if nodes i and j share an edge; 0 otherwise. In case of a weighted graph, ai,j = wi,j, the weight of the edge.

• The adjacency list representation of a graph G = (V,E) consists of an array Adj[1..|V|] of lists. Each list Adj[v] is a list of all vertices adjacent to v.

• For a grapn with n nodes, adjacency matrices take Θ(n2) space and adjacency list takes Θ(|E|) space.


An undirected graph and its adjacency matrix representation.

An undirected graph and its adjacency list representation.

Minimum Spanning Tree

• A spanning tree of an undirected graph G is a subgraph of G that is a tree containing all the vertices of G.

• In a weighted graph, the weight of a subgraph is the sum of the weights of the edges in the subgraph.

• A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight.

Minimum Spanning Tree

An undirected graph and its minimum spanning tree.

Minimum Spanning Tree: Prim's Algorithm• Prim's algorithm for finding an MST is a greedy algorithm. • Start by selecting an arbitrary vertex, include it into the current MST. • Grow the current MST by inserting into it the vertex closest to one of

the vertices already in current MST.

Minimum Spanning Tree: Prim's Algorithm

Prim's minimum spanning tree algorithm.

Minimum Spanning Tree: Prim's Algorithm

Prim's sequential minimum spanning tree algorithm.

Prim's Algorithm: Parallel Formulation • The algorithm works in n outer iterations - it is hard to execute these

iterations concurrently. • The inner loop is relatively easy to parallelize. Let p be the number of

processes, and let n be the number of vertices. • The adjacency matrix is partitioned in a 1-D block fashion, with distance

vector d partitioned accordingly. • In each step, a processor selects the locally closest node, followed by a

global reduction to select globally closest node. • This node is inserted into MST, and the choice broadcast to all

processors. • Each processor updates its part of the d vector locally.

Prim's Algorithm: Parallel Formulation

The partitioning of the distance array d and the adjacency matrix A among p processes.

Prim's Algorithm: Parallel Formulation • The cost to select the minimum entry is O(n/p + log p). • The cost of a broadcast is O(log p). • The cost of local updation of the d vector is O(n/p). • The parallel time per iteration is O(n/p + log p). • The total parallel time is given by O(n2/p + n log p). • The corresponding isoefficiency is O(p2log2p).

All-Reduce and Prefix-Sum Operations

Engineering

Transcript of All-Reduce and Prefix-Sum Operations