Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...

A Practical Quicksort Algorithmfor Graphics Processors

Daniel Cederman and Philippas Tsigas

Distributed Computing and SystemsChalmers University of Technology

System Model The Algorithm Experiments

Overview

5 8 3 2

3 2

2 3

5 8

5 8


System Model

5 8 3 2

3 2

2 3

5 8

5 8

CPU vs. GPU

Normal processors Graphics processors

Multi-core Large cache Few threads Speculative execution

Many-core Small cache Wide and fast memory

bus Thousands of threads

hides memory latency

Designed for general purpose computation Extensions to C/C++

CUDA

System Model – Global Memory

Global Memory

System Model – Shared Memory

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

System Model – Thread Blocks

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

Thread Block 0

Thread Block 1

Thread Block x

Thread Block y

Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives

Challenges


The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Data Parallel!

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

NOTData

Parallel!

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase One• Sequences divided•Block synchronization

on the CPU

Thread Block 1 Thread Block 2

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase Two• Each block is

assigned own sequence

• Run entirely on GPU

Thread Block 1 Thread Block 2

Quicksort – Phase One

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8


Assign Thread Blocks

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8


Uses an auxiliary array In-place too expensive

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 0

Thread Block ONE

Thread BlockTWO

1

0

Fetch-And-Addon

LeftPointer


LeftPointer = 2

Thread Block ONE

Thread BlockTWO

2

3

3 2

Fetch-And-Addon

LeftPointer


LeftPointer = 4

Thread Block ONE

Thread BlockTWO

5

4

3 2 2 3


LeftPointer = 10

3 2 2 3 3 0 2 3 97 7 8 78 5 7 6 9 80 2


Three way partition

3 2 2 3 3 0 2 3 94 7 7 8 78 5 7 6 9 84 40 2

Two Pass Partitioning

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

<

>

=

>

<< <

>

<

>

<

=

<

>

<

> > >

<

>

=

<

> >

= = =

Recurse on smallest sequence◦ Max depth log(n)

Explicit stack Alternative sorting

◦ Bitonic sort

Second Phase


The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

GPU-Quicksort Cederman and Tsigas, ESA08 GPUSort Govindaraju et. al.,

SIGMOD05 Radix-Merge Harris et. al., GPU Gems 3 ’07 Global radix Sengupta et. al., GH07 Hybrid Sintorn and Assarsson,

GPGPU07

Introsort David Musser, Software: Practice and Experience

Set of Algorithms

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

P

x1x2

x3

x4

First Phase◦ Average of maximum and minimum element

Second Phase◦ Median of first, middle and last element

Pivot selection

8800GTX◦ 16 multiprocessors (128 cores)◦ 86.4 GB/s bandwidth

8600GTS◦ 4 multiprocessors (32 cores)◦ 32 GB/s bandwidth

CPU-Reference◦ AMD Dual-Core Opteron 265 / 1.8 GHz

Hardware

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)


GPU-Quicksort


500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

CPU-Reference~4s


GPU-Quicksort


500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPUSort andRadix-Merge

~1.5s


GPU-Quicksort


500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPU-Quicksort~380ms

Global Radix~510ms

8800GTX – Sorted Distribution

1M 2M 4M 8M0

100

200

300

400

500

600

700

800


Size

Tim

e (m

s)

8800GTX – Sorted – 8M

GPU-Quicksort


100

200

300

400

500

600

700

800

Tim

e (m

s)


GPU-Quicksort


100

200

300

400

500

600

700

800

Tim

e (m

s)

Reference is faster than

GPUSort and Radix-Merge


GPU-Quicksort


100

200

300

400

500

600

700

800

Tim

e (m

s)

GPU-Quicksorttakes less than

200ms

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

200

400

600

800

1000

1200

1400

1600

1800


Size

Tim

e (m

s)

8800GTX – Zero – 16M

GPU-Quicksort


200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)


GPU-Quicksort


200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Bitonic is unaffected

by distribution


GPU-Quicksort


200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Three-way partition

8600GTS – Uniform Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

Size

Tim

e (m

s)

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)


GPU-Quicksort


500

1000

1500

2000

2500

Tim

e (m

s)

Reference equal to

Radix-Merge


GPU-Quicksort


500

1000

1500

2000

2500

Tim

e (m

s)

Quicksort and hybrid ~4 times

faster

8600GTS – Sorted Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

Size

Tim

e (m

s)

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)


GPU-Quicksort


0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Hybrid needs randomization


GPU-Quicksort


0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Reference better

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX


100200300400500600700800900

1000


Tim

e (m


Similar to uniform

distribution


100200300400500600700800900

1000


Tim

e (m


Sequence needs to be a power of

two in length

Minimal, manual cache◦ Used only for prefix sum and bitonic

32-word SIMD instruction◦ Main part executes same instructions

Coalesced memory access◦ All reads coalesced

No block synchronization◦ Only required in first phase

Expensive synchronization primitives◦ Two passes amortizes cost

Conclusions

Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way

It is competitive

Conclusions

Thank you!

www.cs.chalmers.se/~cederman

Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...

Documents

Transcript of Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...