J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek...
-
Upload
tracy-jackson -
Category
Documents
-
view
214 -
download
0
Transcript of J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek...
J. Harkins 1 of 51 MAPLD2005/C178
Sorting on the SRC 6 Reconfigurable Computer
John Harkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang
The George Washington UniversityWashington, DC
J. Harkins 2 of 51 MAPLD2005/C178
Algorithms
• Quick Sort
• Heap Sort
• Radix Sort
• Bitonic Sort
• Odd/Even Merge
J. Harkins 3 of 51 MAPLD2005/C178
SRC System Architecture
16 Port Crossbar Switch1.6 GB/s Peak Port BW
ProcessorNode
FPGANode
MemoryNode
Up to 16 Nodes per
Switch
\ 64 \ 64 \ 64 \ 64… … …
J. Harkins 4 of 51 MAPLD2005/C178
Example - Quick Sort
0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
med: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]
J. Harkins 5 of 51 MAPLD2005/C178
Example - Quick Sort
0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]
J. Harkins 6 of 51 MAPLD2005/C178
Example - Quick Sort
0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]
QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13]
J. Harkins 7 of 51 MAPLD2005/C178
Example - Quick Sort
0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]
QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13]
J. Harkins 8 of 51 MAPLD2005/C178
Example - Quick Sort
0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]
QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13]
mL: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8]
J. Harkins 9 of 51 MAPLD2005/C178
Example - Quick Sort
0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]
QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13]
mL: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8]
PS: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8]
J. Harkins 10 of 51 MAPLD2005/C178
Quick Sort - MIMD Architecture
BankA
BankB
BankC
BankD
BankE
BankF
FPGA
1
QS1
QS2
QS3
90%
FPGA
2
QS4
QS5
QS6
84%
• 6 Instances• Median of 3 to select pivot• Pipeline Sort for partitions ≤ 10 vs. Insertion Sort ≤ 20
J. Harkins 11 of 51 MAPLD2005/C178
14
10 2 6
0 8 4 12 7 5 11 1
9
15
Example - Heap Sort
0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
13
3
J. Harkins 12 of 51 MAPLD2005/C178
14
10 2 6
4 12 7 5 11 1
15
Example - Heap Sort
13
3
0
9
8
8: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
J. Harkins 13 of 51 MAPLD2005/C178
14
10 2 6
8 4 12 7 5 11 1
15
Example - Heap Sort
13
3
0
9
7: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
J. Harkins 14 of 51 MAPLD2005/C178
14
10 2 6
8 4 12 7 5 11 1
15
Example - Heap Sort
13
3
9
0
7: [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
J. Harkins 15 of 51 MAPLD2005/C178
14
10 2
8 4 12 7 5
15
Example - Heap Sort
13
3
9
0
6
11 1
6: [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
6
11 1
J. Harkins 16 of 51 MAPLD2005/C178
14
10 2
8 4 12 7 5
15
Example - Heap Sort
13
3
9
0
6
11 1
6: [13][ 3][14][15][10][ 2][11][ 9][ 8][ 4][12][ 7][ 5][ 6][ 1][ 0] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
11
6 1
J. Harkins 17 of 51 MAPLD2005/C178
14
12 7 11
3 8 4 10 2 5 6 1
0
9
Example - Heap Sort
max: [15][13][14][ 9][12][ 7][11][ 3][ 8][ 4][10][ 2][ 5][ 6][ 1][ 0] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
15
13
J. Harkins 18 of 51 MAPLD2005/C178
Heap Sort - MIMD Architecture
BankA
BankB
BankC
BankD
BankE
BankF
FPGA
1
HS1
HS2
HS3
55%
FPGA
2
HS4
HS5
HS6
5%
• 6 Instances• Almost identical to processor code
J. Harkins 19 of 51 MAPLD2005/C178
Example - Radix Sort
1: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1101001111101111101000100110000010000100110001110101101100011001
Pass1:
0:1:2:3:4:5:6:7:8:9:
10:11:12:13:14:15:
index0 = 0
index1 = 4
index2 = 8
index3 = 12
count1 = 4count2 = 4count3 = 4count4 = 4
indexn = ∑ counti n > 0i=1
n
index0 = 0
J. Harkins 20 of 51 MAPLD2005/C178
Example - Radix Sort
2: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1101001111101111101000100110000010000100110001110101101100011001
Pass2:
0:1:2:3:4:5:6:7:8:9:
10:11:12:13:14:15:
index0 = 0
index1 = 4
index2 = 8
index3 = 12
count0 = 0count1 = 0count2 = 0count3 = 0
J. Harkins 21 of 51 MAPLD2005/C178
1101
Example - Radix Sort
2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1101001111101111101000100110000010000100110001110101101100011001
Pass2:
0:1:2:3:4:5:6:7:8:9:
10:11:12:13:14:15:
index0 = 0
index1 = 5
index2 = 8
index3 = 12
count0 = 0count1 = 0count2 = 0count3 = 1
J. Harkins 22 of 51 MAPLD2005/C178
1101
Example - Radix Sort
2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ 3][ ][ ][ ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1101001111101111101000100110000010000100110001110101101100011001
0011
0:1:2:3:4:5:6:7:8:9:
10:11:12:13:14:15:
count0 = 1count1 = 0count2 = 0count3 = 1
Pass2:
index0 = 0
index1 = 5
index2 = 8
index3 = 13
J. Harkins 23 of 51 MAPLD2005/C178
1101
Example - Radix Sort
2: [ ][ ][ ][ ][13][ ][ ][ ][14][ ][ ][ ][ 3][ ][ ][ ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1101001111101111101000100110000010000100110001110101101100011001
0011
0:1:2:3:4:5:6:7:8:9:
10:11:12:13:14:15:
count0 = 1count1 = 0count2 = 0count3 = 2
Pass2:
index0 = 0
index1 = 5
index2 = 9
index3 = 13
1110
J. Harkins 24 of 51 MAPLD2005/C178
1101
Example - Radix Sort
3: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13][14][15] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1101001111101111101000100110000010000100110001110101101100011001
0011
0:1:2:3:4:5:6:7:8:9:
10:11:12:13:14:15:
index0 = 4
index1 = 8
index2 = 12
index3 = 16
1110
1111
101000100110
0000100001001100
0111
0101
1011
00011001
0100
1100
Pass3:
1000
1101
100110101011
0000000100100011
1110
0101
1111
01100111
J. Harkins 25 of 51 MAPLD2005/C178
Radix Sort - MIMD Architecture
BankA
BankB
BankC
BankD
BankE
BankF
FPGA
1
Radix Sort
1
33%
FPGA
25%
• 3 Instances• Uses enumeration sort• Radix 13 bits vs. 8 bits
Radix Sort
2
Radix Sort
3
J. Harkins 26 of 51 MAPLD2005/C178
MIMD Code Structure
main.cint main( ){ int n = 523770*6; int64 *buf;
buf = cacheAlign(n);
mapSort(buf, n);
free(buf); exit(0);}
mapSort.mcvoid mapSort(int64 *buf, n){ OBM_BANK_A (bufA, int64, n/6) OBM_BANK_B (bufB, int64, n/6)
OBM_BANK_F (bufF, int64, n/6) DMA_CPU(dir, bufA, stripes, buf, n);#pragma src parallel sections {#pragma src section {Xsort(bufA, n/6);}#pragma src section {Xsort(bufB, n/6);}
#pragma src section {Xsort(bufF, n/6);}} DMA_CPU(dir, bufA, stripes, buf, n); return;}
……
J. Harkins 27 of 51 MAPLD2005/C178
Example - Bitonic Sort
0:1:2:3:
LH
HL
LH
HL
LH
LH
LH
LH
LH
LH
LH
LH
LH
LH
HL
HL
LH
LH
LH
HL
HL
LH
LH
LH
[13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]
133
1415
Input Keys: Schedule:(0,1) (3,2)(0,2) (1,3)(0,1) (2,3)
J. Harkins 28 of 51 MAPLD2005/C178
Example - Bitonic Sort
0:1:2:3:
LH
HL
LH
HL
LH
LH
LH
LH
LH
LH
LH
LH
LH
LH
HL
HL
LH
LH
LH
HL
HL
LH
LH
LH
[ ][ ][ ][ ][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]
102
60
313
1514
Input Keys: Schedule:(0,1) (3,2)(0,2) (1,3)(0,1) (2,3)
J. Harkins 29 of 51 MAPLD2005/C178
Example - Bitonic Sort
0:1:2:3:
LH
HL
LH
HL
LH
LH
LH
LH
LH
LH
LH
LH
LH
LH
HL
HL
LH
LH
LH
HL
HL
LH
LH
LH
[ ][ ][ ][ ][ ][ ][ ][ ][ 8][ 4][12][ 7][ 5][11][ 1][ 9]
511
19
210
60
315
1314
Input Keys: Schedule:(0,1) (3,2)(0,2) (1,3)(0,1) (2,3)
J. Harkins 30 of 51 MAPLD2005/C178
Example - Bitonic Sort
0:1:2:3:
LH
HL
LH
HL
LH
LH
LH
LH
LH
LH
LH
LH
LH
LH
HL
HL
LH
LH
LH
HL
HL
LH
LH
LH
[ ][ ][ ][ ][ ][ ][ ][ ][ 8][ 4][12][ 7][ ][ ][ ][ ]
84
127
511
91
313
1415
62
100
Input Keys: Schedule:(0,1) (3,2)(0,2) (1,3)(0,1) (2,3)
J. Harkins 31 of 51 MAPLD2005/C178
Example - Bitonic Sort
0:1:2:3:
LH
HL
LH
HL
LH
LH
LH
LH
LH
LH
LH
LH
LH
LH
HL
HL
LH
LH
LH
HL
HL
LH
LH
LH
[ 0][ 2][ 3][ 6][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]
112
58
79
411
02
36
1013
1415
Input Keys: Schedule:(0,1) (3,2)(0,2) (1,3)(0,1) (2,3)
J. Harkins 32 of 51 MAPLD2005/C178
Example - Bitonic Sort
0:1:2:3:
LH
HL
LH
HL
LH
LH
LH
LH
LH
LH
LH
LH
LH
LH
HL
HL
LH
LH
LH
HL
HL
LH
LH
LH
[ 0][ 2][ 3][ 6][10][13][14][15][ ][ ][ ][ ][ ][ ][ ][ ]
17
45
912
811
1013
1415
Input Keys: Schedule:(0,1) (3,2)(0,2) (1,3)(0,1) (2,3)
J. Harkins 33 of 51 MAPLD2005/C178
Example - Bitonic Sort
0:1:2:3:
LH
HL
LH
HL
LH
LH
LH
LH
LH
LH
LH
LH
LH
LH
HL
HL
LH
LH
LH
HL
HL
LH
LH
LH
[ 0][ 2][ 3][ 6][10][13][14][15][ ][ ][ ][ ][ 1][ 4][ 5][ 7]
14
57
89
1112
Input Keys: Schedule:(0,1) (3,2)(0,2) (1,3)(0,1) (2,3)
J. Harkins 34 of 51 MAPLD2005/C178
Example - Bitonic Sort
0:1:2:3:
LH
HL
LH
HL
LH
LH
LH
LH
LH
LH
LH
LH
LH
LH
HL
HL
LH
LH
LH
HL
HL
LH
LH
LH
[ 0][ 2][ 3][ 6][10][13][14][15][ 8][ 9][11][12][ 1][ 4][ 5][ 7]
89
1112
Input Keys: Schedule:(0,1) (3,2)(0,2) (1,3)(0,1) (2,3)
J. Harkins 35 of 51 MAPLD2005/C178
Bitonic Sort - SIMD Architecture
BankA
BankB
BankC
BankD
BankE
BankF
FPGA1
8 Input Bitonic Sorting Network
1
27%
FPGA
25%
• 2 Instances• Parallel sorting network
4 InputBitonic Sort
2
SIMDController
J. Harkins 36 of 51 MAPLD2005/C178
Example - Odd/Even Merge
LH
LH
MUX
Z-1 LH
A: [ 0][ 1][ 2][ 4][ 7][11][12][14]B: [ 3][ 5][ 6][ 8][ 9][10][13][15]
Input Keys:
Z-2
C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:
J. Harkins 37 of 51 MAPLD2005/C178
Example - Odd/Even Merge
LH
LH
Z-1 LH
A: [ 0][ 1][ 2][ 4][ 7][11][12][14]B: [ 3][ 5][ 6][ 8][ 9][10][13][15]
03
15
Input Keys:
Z-2
C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:
J. Harkins 38 of 51 MAPLD2005/C178
Example - Odd/Even Merge
LH
LH
Z-1 LH
A: [ ][ ][ 2][ 4][ 7][11][12][14]B: [ ][ ][ 6][ 8][ 9][10][13][15]
23
45
Input Keys:
Z-2
0
1
C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:
J. Harkins 39 of 51 MAPLD2005/C178
Example - Odd/Even Merge
LH
LH
Z-1 LH
A: [ ][ ][ ][ ][ 7][11][12][14]B: [ ][ ][ 6][ 8][ 9][10][13][15]
73
115
Input Keys:
Z-2
2
4
1
0
C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:
J. Harkins 40 of 51 MAPLD2005/C178
Example - Odd/Even Merge
LH
LH
Z-1 LH
A: [ ][ ][ ][ ][ ][ ][12][14]B: [ ][ ][ 6][ 8][ 9][10][13][15]
76
118
Input Keys:
Z-2
3
5
0
4
2
1
C: [ 0][ 1][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:
J. Harkins 41 of 51 MAPLD2005/C178
Example - Odd/Even Merge
LH
LH
Z-1 LH
A: [ ][ ][ ][ ][ ][ ][12][14]B: [ ][ ][ ][ ][ 9][10][13][15]
79
1110
Input Keys:
Z-2
6
8
2
5
4
3
C: [ 0][ 1][ 2][ 3][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:
J. Harkins 42 of 51 MAPLD2005/C178
Odd/Even Merge - SIMD Architecture
BankA
BankB
BankC
BankD
BankE
BankF
FPGA
1
Odd Merge Two
40%
FPGA
25%
• 1 Instance• Parallel sorting network• A/B = odd ; C/D = even
Even Merge Two
Merge Out
J. Harkins 43 of 51 MAPLD2005/C178
SIMD Code Structure
main.cint main( ){ int n = 523770*6; int64 *buf;
buf = cacheAlign(n);
mapSort(buf, n);
free(buf); exit(0);}
mapSort.mcvoid mapSort(int64 *buf, n){ OBM_BANK_A (AA, int64, n/6) OBM_BANK_B (BB, int64, n/6)
OBM_BANK_F (FF, int64, n/6) DMA_CPU(dir, AA, stripes, buf, n); for (i=0; i<rounds; i++) { schedule( &r1, &r2); bitonicSort8(AA[r1],BB[r1],CC[r1],DD[r1],
AA[r2],BB[r2],CC[r2].DD[r2], &AA[r1],&BB[r1],&CC[r1],&DD[r1], &AA[r2],&BB[r2],&CC[r2],&DD[r2]); bitonicSort4(EE[r1],FF[r1],EE[r2],FF[r2], … ); } DMA_CPU(dir, bufA, stripes, buf, n); return;}
…
J. Harkins 44 of 51 MAPLD2005/C178
Implementation Comparisons
Algorithm
Pro
cessor
Com
ple
xity
Lan
gu
ag
e
Com
pile
r
Lin
es O
f Cod
e
Recu
rsio
n
FP
GA
Util.
% S
lices
MIM
D
SIM
D
Refa
cto
ring
Up
per B
ou
nd
x106
keys/s
Quick Sort
X86 N lgN C 81
FPGA N lgN MC 97/96 n/a
90,84
31.58
Heap Sort
X86 N lgN C 55 -
FPGA N lgN MC 56/54 n/a
55,0 31.58
Radix Sort
X86 N C 70 -
FPGA N MC 81/64 n/a
33,0 60.00
Bitonic Sort
X86 Nlg2N C 78
FPGA lg2N VHDL 53/478/365
n/a
27,0 6.32
O/E Merge
X86 N C 52 -
FPGA N MC 71/120 n/a
40,0 60.87
= icc v8.0 -fast= mcc v1.8= mcc v1.9
X86 = Dual Xeon 2.8GHzFPGA = Virtex2XC6000 @ 100MHz
MC = MAP C
= entirely= major changes= some= very little= almost none
J. Harkins 45 of 51 MAPLD2005/C178
Lesson Learned #1
Com
pile
r
Qu
ick S
ort
Heap
Sort
Rad
ix S
ort
Bito
nic
S
ort
O/E
Merg
e
2.8 GHz Xeonx106 keys/s
gcc 1.99 0.50 1.63 - -
icc -fast
5.66 1.06 4.72 - -
FPGA upper bound estimate
x106 keys/s
31.58
31.58
60.00
6.3260.8
7
Upper bound on speedup
vs gcc15.8
763.1
636.8
1- -
vs icc 5.5829.7
912.7
1- -
• Know your tools• Develop accurate assessments early
J. Harkins 46 of 51 MAPLD2005/C178
Test Conditions
• 64 bit unsigned integer keys• Uniformly distributed• Randomly permuted• Scores average of 10 runs• FPGA configuration time ~65ms• DMA time ~18ms• Typical key quantity 3.14M• Processor comparison: Xeon 2.8GHz, 1GB mem
J. Harkins 47 of 51 MAPLD2005/C178
Experimental Results - 64 bit keys
5.66
1.06
4.72
0.69
2.32 1.961.02
12.99
0
2
4
6
8
10
12
14
Quick Heap Radix Bitonic
X86FPGA
x 10
6 key
s/s
Sorting Algorithms
77.03
36
0
10
20
30
40
50
60
70
80
90
O/E Merge
X86FPGA
J. Harkins 48 of 51 MAPLD2005/C178
mcc Compiler
• Attempts to pipeline inner loops– Maintains sequential behavior of C– Reports dependencies/penalties
• Quick Sort: 1 penalty*
• Heap Sort: 12 penalties
• Radix Sort: 2 penalties
• Bitonic Sort: 5 penalties
• Odd/Even Merge: 1 penalty
• Easy to build embarrassingly parallel code
• Resource usage ~2x HDL
J. Harkins 49 of 51 MAPLD2005/C178
Conclusion
• FPGAs not best choice for sorting• Sorting is memory bound
– Tight loops, low computation suited to processor– More parallel memory accesses– Faster clock rates
• Refactoring for better performance– FPGAs underutilized– Understand compiler limitations– Eliminate dependencies
J. Harkins 50 of 51 MAPLD2005/C178
Tight Loop Example
• Merge
a[N]=b[N]=infinity;j=k=0;Loop i = 0 to 2N-1{ if (a[j] > b[k]) merged[i] = b[k++]; else merged[i] = a[j++];}
J. Harkins 51 of 51 MAPLD2005/C178
Future Work
• More refactoring– Greater use of block rams– HW prediction to reduce penalties
• FPGA performance gain = ƒ(computation density/memory access)