J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek...

J. Harkins 1 of 51 MAPLD2005/C178

Sorting on the SRC 6 Reconfigurable Computer

John Harkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang

The George Washington UniversityWashington, DC


Algorithms

• Quick Sort

• Heap Sort

• Radix Sort

• Bitonic Sort

• Odd/Even Merge


SRC System Architecture

16 Port Crossbar Switch1.6 GB/s Peak Port BW

ProcessorNode

FPGANode

MemoryNode

Up to 16 Nodes per

Switch

\ 64 \ 64 \ 64 \ 64… … …


Example - Quick Sort

0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

med: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]



0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]



0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]

QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13]



0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]

QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13]

mL: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8]



0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13]

QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13]

mL: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8]

PS: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8]


Quick Sort - MIMD Architecture

BankA

BankB

BankC

BankD

BankE

BankF

FPGA

1

QS1

QS2

QS3

90%

FPGA

2

QS4

QS5

QS6

84%

• 6 Instances• Median of 3 to select pivot• Pipeline Sort for partitions ≤ 10 vs. Insertion Sort ≤ 20


14

10 2 6

0 8 4 12 7 5 11 1

9

15

Example - Heap Sort

0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

13

3


14

10 2 6

4 12 7 5 11 1

15

Example - Heap Sort

13

3

0

9

8

8: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


14

10 2 6

8 4 12 7 5 11 1

15

Example - Heap Sort

13

3

0

9

7: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


14

10 2 6

8 4 12 7 5 11 1

15

Example - Heap Sort

13

3

9

0

7: [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


14

10 2

8 4 12 7 5

15

Example - Heap Sort

13

3

9

0

6

11 1

6: [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

6

11 1


14

10 2

8 4 12 7 5

15

Example - Heap Sort

13

3

9

0

6

11 1

6: [13][ 3][14][15][10][ 2][11][ 9][ 8][ 4][12][ 7][ 5][ 6][ 1][ 0] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

11

6 1


14

12 7 11

3 8 4 10 2 5 6 1

0

9

Example - Heap Sort

max: [15][13][14][ 9][12][ 7][11][ 3][ 8][ 4][10][ 2][ 5][ 6][ 1][ 0] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

15

13


Heap Sort - MIMD Architecture

BankA

BankB

BankC

BankD

BankE

BankF

FPGA

1

HS1

HS2

HS3

55%

FPGA

2

HS4

HS5

HS6

5%

• 6 Instances• Almost identical to processor code


Example - Radix Sort

1: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1101001111101111101000100110000010000100110001110101101100011001

Pass1:

0:1:2:3:4:5:6:7:8:9:

10:11:12:13:14:15:

index0 = 0

index1 = 4

index2 = 8

index3 = 12

count1 = 4count2 = 4count3 = 4count4 = 4

indexn = ∑ counti n > 0i=1

n

index0 = 0



2: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1101001111101111101000100110000010000100110001110101101100011001

Pass2:

0:1:2:3:4:5:6:7:8:9:

10:11:12:13:14:15:

index0 = 0

index1 = 4

index2 = 8

index3 = 12



1101


2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1101001111101111101000100110000010000100110001110101101100011001

Pass2:

0:1:2:3:4:5:6:7:8:9:

10:11:12:13:14:15:

index0 = 0

index1 = 5

index2 = 8

index3 = 12



1101


2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ 3][ ][ ][ ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1101001111101111101000100110000010000100110001110101101100011001

0011

0:1:2:3:4:5:6:7:8:9:

10:11:12:13:14:15:


Pass2:

index0 = 0

index1 = 5

index2 = 8

index3 = 13


1101


2: [ ][ ][ ][ ][13][ ][ ][ ][14][ ][ ][ ][ 3][ ][ ][ ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1101001111101111101000100110000010000100110001110101101100011001

0011

0:1:2:3:4:5:6:7:8:9:

10:11:12:13:14:15:


Pass2:

index0 = 0

index1 = 5

index2 = 9

index3 = 13

1110


1101


3: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13][14][15] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1101001111101111101000100110000010000100110001110101101100011001

0011

0:1:2:3:4:5:6:7:8:9:

10:11:12:13:14:15:

index0 = 4

index1 = 8

index2 = 12

index3 = 16

1110

1111

101000100110

0000100001001100

0111

0101

1011

00011001

0100

1100

Pass3:

1000

1101

100110101011

0000000100100011

1110

0101

1111

01100111


Radix Sort - MIMD Architecture

BankA

BankB

BankC

BankD

BankE

BankF

FPGA

1

Radix Sort

1

33%

FPGA

25%

• 3 Instances• Uses enumeration sort• Radix 13 bits vs. 8 bits

Radix Sort

2

Radix Sort

3


MIMD Code Structure

main.cint main( ){ int n = 523770*6; int64 *buf;

buf = cacheAlign(n);

mapSort(buf, n);

free(buf); exit(0);}

mapSort.mcvoid mapSort(int64 *buf, n){ OBM_BANK_A (bufA, int64, n/6) OBM_BANK_B (bufB, int64, n/6)

OBM_BANK_F (bufF, int64, n/6) DMA_CPU(dir, bufA, stripes, buf, n);#pragma src parallel sections {#pragma src section {Xsort(bufA, n/6);}#pragma src section {Xsort(bufB, n/6);}

#pragma src section {Xsort(bufF, n/6);}} DMA_CPU(dir, bufA, stripes, buf, n); return;}

……


Example - Bitonic Sort

0:1:2:3:

LH

HL

LH

HL

LH

LH

LH

LH

LH

LH

LH

LH

LH

LH

HL

HL

LH

LH

LH

HL

HL

LH

LH

LH

[13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]

133

1415

Input Keys: Schedule:(0,1) (3,2)(0,2) (1,3)(0,1) (2,3)



0:1:2:3:

LH

HL

LH

HL

LH

LH

LH

LH

LH

LH

LH

LH

LH

LH

HL

HL

LH

LH

LH

HL

HL

LH

LH

LH

[ ][ ][ ][ ][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9]

102

60

313

1514




0:1:2:3:

LH

HL

LH

HL

LH

LH

LH

LH

LH

LH

LH

LH

LH

LH

HL

HL

LH

LH

LH

HL

HL

LH

LH

LH

[ ][ ][ ][ ][ ][ ][ ][ ][ 8][ 4][12][ 7][ 5][11][ 1][ 9]

511

19

210

60

315

1314




0:1:2:3:

LH

HL

LH

HL

LH

LH

LH

LH

LH

LH

LH

LH

LH

LH

HL

HL

LH

LH

LH

HL

HL

LH

LH

LH

[ ][ ][ ][ ][ ][ ][ ][ ][ 8][ 4][12][ 7][ ][ ][ ][ ]

84

127

511

91

313

1415

62

100




0:1:2:3:

LH

HL

LH

HL

LH

LH

LH

LH

LH

LH

LH

LH

LH

LH

HL

HL

LH

LH

LH

HL

HL

LH

LH

LH

[ 0][ 2][ 3][ 6][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]

112

58

79

411

02

36

1013

1415




0:1:2:3:

LH

HL

LH

HL

LH

LH

LH

LH

LH

LH

LH

LH

LH

LH

HL

HL

LH

LH

LH

HL

HL

LH

LH

LH

[ 0][ 2][ 3][ 6][10][13][14][15][ ][ ][ ][ ][ ][ ][ ][ ]

17

45

912

811

1013

1415




0:1:2:3:

LH

HL

LH

HL

LH

LH

LH

LH

LH

LH

LH

LH

LH

LH

HL

HL

LH

LH

LH

HL

HL

LH

LH

LH

[ 0][ 2][ 3][ 6][10][13][14][15][ ][ ][ ][ ][ 1][ 4][ 5][ 7]

14

57

89

1112




0:1:2:3:

LH

HL

LH

HL

LH

LH

LH

LH

LH

LH

LH

LH

LH

LH

HL

HL

LH

LH

LH

HL

HL

LH

LH

LH

[ 0][ 2][ 3][ 6][10][13][14][15][ 8][ 9][11][12][ 1][ 4][ 5][ 7]

89

1112



Bitonic Sort - SIMD Architecture

BankA

BankB

BankC

BankD

BankE

BankF

FPGA1

8 Input Bitonic Sorting Network

1

27%

FPGA

25%

• 2 Instances• Parallel sorting network

4 InputBitonic Sort

2

SIMDController


Example - Odd/Even Merge

LH

LH

MUX

Z-1 LH

A: [ 0][ 1][ 2][ 4][ 7][11][12][14]B: [ 3][ 5][ 6][ 8][ 9][10][13][15]

Input Keys:

Z-2

C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:



LH

LH

Z-1 LH

A: [ 0][ 1][ 2][ 4][ 7][11][12][14]B: [ 3][ 5][ 6][ 8][ 9][10][13][15]

03

15

Input Keys:

Z-2

C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:



LH

LH

Z-1 LH

A: [ ][ ][ 2][ 4][ 7][11][12][14]B: [ ][ ][ 6][ 8][ 9][10][13][15]

23

45

Input Keys:

Z-2

0

1

C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:



LH

LH

Z-1 LH

A: [ ][ ][ ][ ][ 7][11][12][14]B: [ ][ ][ 6][ 8][ 9][10][13][15]

73

115

Input Keys:

Z-2

2

4

1

0

C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:



LH

LH

Z-1 LH

A: [ ][ ][ ][ ][ ][ ][12][14]B: [ ][ ][ 6][ 8][ 9][10][13][15]

76

118

Input Keys:

Z-2

3

5

0

4

2

1

C: [ 0][ 1][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:



LH

LH

Z-1 LH

A: [ ][ ][ ][ ][ ][ ][12][14]B: [ ][ ][ ][ ][ 9][10][13][15]

79

1110

Input Keys:

Z-2

6

8

2

5

4

3

C: [ 0][ 1][ 2][ 3][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]Merged Keys:


Odd/Even Merge - SIMD Architecture

BankA

BankB

BankC

BankD

BankE

BankF

FPGA

1

Odd Merge Two

40%

FPGA

25%

• 1 Instance• Parallel sorting network• A/B = odd ; C/D = even

Even Merge Two

Merge Out


SIMD Code Structure

main.cint main( ){ int n = 523770*6; int64 *buf;

buf = cacheAlign(n);

mapSort(buf, n);

free(buf); exit(0);}

mapSort.mcvoid mapSort(int64 *buf, n){ OBM_BANK_A (AA, int64, n/6) OBM_BANK_B (BB, int64, n/6)

OBM_BANK_F (FF, int64, n/6) DMA_CPU(dir, AA, stripes, buf, n); for (i=0; i<rounds; i++) { schedule( &r1, &r2); bitonicSort8(AA[r1],BB[r1],CC[r1],DD[r1],

AA[r2],BB[r2],CC[r2].DD[r2], &AA[r1],&BB[r1],&CC[r1],&DD[r1], &AA[r2],&BB[r2],&CC[r2],&DD[r2]); bitonicSort4(EE[r1],FF[r1],EE[r2],FF[r2], … ); } DMA_CPU(dir, bufA, stripes, buf, n); return;}

…


Implementation Comparisons

Algorithm

Pro

cessor

Com

ple

xity

Lan

gu

ag

e

Com

pile

r

Lin

es O

f Cod

e

Recu

rsio

n

FP

GA

Util.

% S

lices

MIM

D

SIM

D

Refa

cto

ring

Up

per B

ou

nd

x106

keys/s

Quick Sort

X86 N lgN C 81

FPGA N lgN MC 97/96 n/a

90,84

31.58

Heap Sort

X86 N lgN C 55 -

FPGA N lgN MC 56/54 n/a

55,0 31.58

Radix Sort

X86 N C 70 -

FPGA N MC 81/64 n/a

33,0 60.00

Bitonic Sort

X86 Nlg2N C 78

FPGA lg2N VHDL 53/478/365

n/a

27,0 6.32

O/E Merge

X86 N C 52 -

FPGA N MC 71/120 n/a

40,0 60.87

= icc v8.0 -fast= mcc v1.8= mcc v1.9

X86 = Dual Xeon 2.8GHzFPGA = Virtex2XC6000 @ 100MHz

MC = MAP C

= entirely= major changes= some= very little= almost none


Lesson Learned #1

Com

pile

r

Qu

ick S

ort

Heap

Sort

Rad

ix S

ort

Bito

nic

S

ort

O/E

Merg

e

2.8 GHz Xeonx106 keys/s

gcc 1.99 0.50 1.63 - -

icc -fast

5.66 1.06 4.72 - -

FPGA upper bound estimate

x106 keys/s

31.58

31.58

60.00

6.3260.8

7

Upper bound on speedup

vs gcc15.8

763.1

636.8

1- -

vs icc 5.5829.7

912.7

1- -

• Know your tools• Develop accurate assessments early


Test Conditions

• 64 bit unsigned integer keys• Uniformly distributed• Randomly permuted• Scores average of 10 runs• FPGA configuration time ~65ms• DMA time ~18ms• Typical key quantity 3.14M• Processor comparison: Xeon 2.8GHz, 1GB mem


Experimental Results - 64 bit keys

5.66

1.06

4.72

0.69

2.32 1.961.02

12.99

0

2

4

6

8

10

12

14

Quick Heap Radix Bitonic

X86FPGA

x 10

6 key

s/s

Sorting Algorithms

77.03

36

0

10

20

30

40

50

60

70

80

90

O/E Merge

X86FPGA


mcc Compiler

• Attempts to pipeline inner loops– Maintains sequential behavior of C– Reports dependencies/penalties

• Quick Sort: 1 penalty*

• Heap Sort: 12 penalties

• Radix Sort: 2 penalties

• Bitonic Sort: 5 penalties

• Odd/Even Merge: 1 penalty

• Easy to build embarrassingly parallel code

• Resource usage ~2x HDL


Conclusion

• FPGAs not best choice for sorting• Sorting is memory bound

– Tight loops, low computation suited to processor– More parallel memory accesses– Faster clock rates

• Refactoring for better performance– FPGAs underutilized– Understand compiler limitations– Eliminate dependencies


Tight Loop Example

• Merge

a[N]=b[N]=infinity;j=k=0;Loop i = 0 to 2N-1{ if (a[j] > b[k]) merged[i] = b[k++]; else merged[i] = a[j++];}


Future Work

• More refactoring– Greater use of block rams– HW prediction to reduce penalties

• FPGA performance gain = ƒ(computation density/memory access)

J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek...

Documents

Transcript of J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek...