CS 770G - Parallel Algorithms in Introduction to Parallel Computing...
Transcript of CS 770G - Parallel Algorithms in Introduction to Parallel Computing...
1
CS 770G - Parallel Algorithms in Scientific Computing
July 18 , 2001Lecture 14
Parallel Sorting
2
References
• Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis , Benjamin Cummings.
• A portion of the notes comes from Prof. J. Demmel’sCS267 course at UC Berkeley.
3
• Where the input and output sequences are stored.– Input: unsorted sequence distributed uniformly among
processors.
– Output: sorted sequence across processors.
• Global ordering - proc enumeration.
• Compare-exchange or compare -split on nonlocal elements.
Issues in Sorting on Parallel Computers
4
• One element per processor.• Pi & Pj compare their elements ai & aj.
– Send their element to each other.– Pi keeps min(ai,aj).– Pj keeps max(ai,aj).
• Communication time: Tcomm = ts + tw.
Compare-Exchange
Pi Pj Pi Pj Pi Pj
ai aj ai, aj ai, aj min{ai, aj} max{ai, aj}
2
5
• More than one element per processor.
• Pi & Pj compare their blocks A i & Aj.– Sort their block locally.– Send their block to each other.– Each proc merges the two sorted blocks, and retains the
appropriate half.
• Communication time: Tcomm = ts + tw (n/p).
Compare-Split
Pi PjPi Pj
1, 3 2, 4
Pi Pj
1, 3 1, 3
2, 4 2, 4
1, 2, 3, 41, 2, 3, 4
Pi Pj
1, 2 3, 4
6
• Algorithm:
• O(n) per iteration + n iterations ⇒ complexity = O(n2).
• Inherently sequential -- compare adjacent pairs in order.
Bubble Sort
endloop-i end
loop-j end)a,exchange(a-compare
i to1 jfor 1 to1-nifor
begin(n) TBUBBLE_SOR
1jj +
==
7
• Bubble sort variant.• Sort n elements in n phases.• Each phase requires n/2 compare-exchange operations.• Alternate between 2 phases -- odd & even.• Let {a1, a2, …, an} be the sequence to be sorted.
– Odd phase: compare-exchange the pairs (a1,a2), (a3, a4), … , (an-1,an).
– Even phase: compare-exchange the pairs (a2,a3), (a4, a5), … , (an-2,an-1).
• n comparisons per phase + n phases ⇒ complexity = O(n2).
Odd-Even Transposition
8
• Sequential algorithm:
Odd-Even Transposition (cont.)
endloop-i end
loop-j end)a,exchange(a-compare
1-n/2 to1 jfor even then is i if
loop-j end)a,exchange(a-compare
1-n/2 to0 jfor thenodd is i if
n to1ifor begin
(n) ODD_EVEN
12 j2 j
22 j12 j
+
++
=
=
=
3
9
Example
3 2 3 8 5 6 4 1Phase 1 (odd)
2 3 3 8 5 6 1 4Phase 2 (even)
2 3 3 5 8 1 6 4Phase 3 (odd)
2 3 3 5 1 8 4 6Phase 4 (even)
10
Example (cont.)
2 3 3 1 5 4 8 6Phase 5 (odd)
2 3 1 3 4 5 6 8Phase 6 (even)
2 1 3 3 4 5 6 8Phase 7 (odd)
1 2 3 3 4 5 6 8 Phase 8 (even)
11
• One element per processor.
• Compare-exchange operations on pairs of elements are done simultaneously.
• Odd phase: proc2i-1 compare-exchanges its element with proc 2i.
• Even phase: proc2i compare-exchanges its element with proc 2i+1.
Parallel Implementation
12
• In each phase, the complexity of compare-exchange = O(1).
• A total of n phases ⇒ complexity = O(n).
• Sequential complexity of the best sorting algorithm = O(n log n).
• Hence, odd-even transposition sort is not cost-optimal because processor-time product = O(n2).
Parallel Complexity
4
13
• More than one element per processor, p < n.
• Complexity of local sort = O(n/p log(n/p) ).
• P phases: (p/2 odd & p/2 even)
• Odd phase: proc2i-1 compare-splits its element with proc 2i.
• Even phase: proc2i compare-splits its element with proc 2i+1.
Parallel Implementation (cont.)
14
• Parallel run-time:
• Speedup:
• Efficiency:
• Cost optimal ⇒ p=O(log n).
Parallel Performance
)()()log( nOnOpn
pn
OTp ++=
)()log(
)log(
nOpn
pn
O
nnOTT
Sp
sp
+==
)log
()loglog(1
1
npO
npOp
SE p
p
+−==
local sort comparisons communications
15
• Divide-and-conquer.
• (Average) complexity = O(n log n).
• Let the sequence be A[1..n].
• Two steps:– Divde: given A[q..r], divide into 2 subarrays A[q..s] & A[s+1..r]
such that each element of A[q..s] ≤ each element of A[s+1..r].
– Conquer: apply Quicksort to the subsequences.
• Partitioning– Select a pivot x.
– A subsequence contain elements ≤ x; another subsequnce contains elements > x.
Quicksort
16
Quicksort Algorithm
endif end
r);1,sA,QUICKSORT(s);q,A,QUICKSORT(
A[s]);swap(A[q],loop-i endif end
A[i]);swap(A[s],1;ssx then A[i] if
r to1qifor q;sA[q];xr thenq if
beginr)q,(A, QUICKSORT
+
+=≤
+===<
5
17
• Perform quicksort on the subsequences in parallel.
• Start with a single processor.– Assign one of the subproblems to another processor.
– Each of these processors sort its array by using quicksort and assigns one of its subproblems to other processors.
– Algorithm terminates when the arrays cannot be further partitioned.
Parallel Quicksort
18
• Problem: partition is done by a single processor.
• In the beginning, the complexity of partition = O(n).
• Hence the lower bound = O(n).
• Processor-time product = O(n2) ⇒ not cost-optimal.
• Needs parallel partitioning– PRAM model
Parallel Complexity
19
• Concurrent-read, concurrent-write parallel random-access machine.
• Write conflicts are resolved arbitrarily.
• Quicksort can be interpreted as constructing a binary tree.– Pivot is the root.
– Elements ≤ pivot go to the left subtree.
– Elements > pivot go to the right subtree.
• Sorted sequence obtained by inorder trasversal.
Parallel CRCW PRAM Model
20
• Select a pivot.
• Partition into 2 parts.
• Subsequent pivot elements, one for each new subtree, are then selected in parallel.
• In each iteration, a level of the tree is constructed in O(1) time.
• Thus, the averge complexity = depth of tree = O(log n).
• The sorted squence is obtained by inorder trasversal in O(1) time.
• Thus, it is cost-optimal.
Parallel PRAM Algorithm
6
21
BuildTree Algorithm
endrepeat end
if end];[parentrightchild parent elseexit then ][parentrightchild i if
i; ][parentrightchildelse
];parentleftchild[ parent elseexit then ]parentleftchild[ i if
i; ]parentleftchild[ then)parenti and ]A[parent(A[i]or ])A[parentA[i]( if
root i proceach for repeat for end
1;n [i]rightchild i]leftchild[root; parent
i;rootdo i proceach for
begin(A[1..n]) BUILD_TREE
ii
i
i
ii
i
i
iii
i
==
=
==
=<=<
≠
+===
=
22
• A bitionic sorting network sorts n elements in O(log2n).
• The key operation is rearrange a bitonic sequence into a sorted sequence.
• Bitonic sequence: {a0, a1, …, an-1} with the property that either:
(1) there exists i such that {a0, …, ai} is monotonically increasing and {ai+1, …, an-1} is monotonically decreasing, or
(2) there exists a cyclic shift of indices so that (1) holds.
Bitonic Sort
Sorting on Different Networks
Sorting Networks
PRAM Sorts
MEM
p p p°°°
Sorting on Network Y
P
M
network
P
M
P
M°°°
LogP SortsSorting onMachine X
24
• Let s1 = {a0, a1, …, an-1} be a bitonic sequence such that a0 ≤ a1 ≤ … ≤ an/2-1 and an/2 ≥ an/2+1 ≥ … ≥ an-1.
• Define 2 subsequences:
• In sequence s 1, there is an element b i=min{ai,an/2+i} such that all the elements before b i are from the increasing part of the original sequence, and all those after are from the decreasing part.
Bitonic Seq. to Increasing Seq.
s a a a a a a
s a a a a a an n n n
n n n n
1 0 2 1 2 1 2 1 1
2 0 2 1 2 1 2 1 1
=
=+ − −
+ − −
(min{ , },min{ , },. .., min{ , })
(max{ , }, max{ , },. .. ,max{ , })/ / /
/ / /
7
25
• In sequence s 2, the element b i’=max{ai,an/2+i} is such that all the elements before b i’ are from the decreasing part of the original sequence, and all those after are from the increasing part.
• Every element of the first sequence ≤ every elment of the second sequence.
• Both s1 and s2 are bitonic sequences.
• Thus, the initial problem of rearranging a bitonic sequence of size n is reduced to that of rearranging 2 smaller bitionic sequences of size n/2 and concatenating the results.
Bitonic Seq. to Increasing Seq. (cont.)
26
• Repeat the process recursively until we obtain subsequences of size 1.
• At that point, the output is sorted in monotonically increasing order.
• Since after each bitonic split, the size of the problem is halved, the number of splits = log n.
• Sorting a bitonic sequence using bitonic splits is called bitonic merge.
• Can be implemented easily on a network of comparators.
Bitonic Seq. to Increasing Seq. (cont.)
27
• A sequence of 2 elements forms a bitonic sequence.
• Hence, any unsorted sequence is a concatenation of bitonic sequence of size 2.
• Merge adjacent bitonic sequences in increasing and decreasing order.
• By definition, the sequence obtained by concatenating the increasing and decreasing sequences is bitonic.
• By merging larger and larger bitonic sequences, we eventually obtain a bitonic sequence of size n.
Sorting Unordered Elements
28
Bitonic Sort Algorithm
8
29
Parallel Bitonic Sort
30
Parallel Bitonic Sort (cont.)
31
Parallel Performance
32
Parallel Sorting Comparison