Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...

67
A Practical Quicksort Algorithm for Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

Transcript of Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...

Page 1: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

A Practical Quicksort Algorithmfor Graphics Processors

Daniel Cederman and Philippas Tsigas

Distributed Computing and SystemsChalmers University of Technology

Page 2: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

System Model The Algorithm Experiments

Overview

5 8 3 2

3 2

2 3

5 8

5 8

Page 3: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

System Model The Algorithm Experiments

System Model

5 8 3 2

3 2

2 3

5 8

5 8

Page 4: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

CPU vs. GPU

Normal processors Graphics processors

Multi-core Large cache Few threads Speculative execution

Many-core Small cache Wide and fast memory

bus Thousands of threads

hides memory latency

Page 5: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Designed for general purpose computation Extensions to C/C++

CUDA

Page 6: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

System Model – Global Memory

Global Memory

Page 7: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

System Model – Shared Memory

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

Page 8: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

System Model – Thread Blocks

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

SharedMemory

Thread Block 0

Thread Block 1

Thread Block x

Thread Block y

Page 9: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives

Challenges

Page 10: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

System Model The Algorithm Experiments

The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

Page 11: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Page 12: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Data Parallel!

Page 13: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

NOTData

Parallel!

Page 14: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase One• Sequences divided•Block synchronization

on the CPU

Thread Block 1 Thread Block 2

Page 15: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort

5 8 3 2

3 2

2 3

5 8

5 8

Phase Two• Each block is

assigned own sequence

• Run entirely on GPU

Thread Block 1 Thread Block 2

Page 16: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort – Phase One

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Page 17: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort – Phase One

Assign Thread Blocks

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Page 18: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort – Phase One

Uses an auxiliary array In-place too expensive

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Page 19: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 0

Thread Block ONE

Thread BlockTWO

1

0

Page 20: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 2

Thread Block ONE

Thread BlockTWO

2

3

3 2

Page 21: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 4

Thread Block ONE

Thread BlockTWO

5

4

3 2 2 3

Page 22: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 10

3 2 2 3 3 0 2 3 97 7 8 78 5 7 6 9 80 2

Page 23: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Three way partition

3 2 2 3 3 0 2 3 94 7 7 8 78 5 7 6 9 84 40 2

Page 24: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Two Pass Partitioning

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

<

>

=

>

<< <

>

<

>

<

=

<

>

<

> > >

<

>

=

<

> >

= = =

Page 25: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Recurse on smallest sequence◦ Max depth log(n)

Explicit stack Alternative sorting

◦ Bitonic sort

Second Phase

Page 26: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

System Model The Algorithm Experiments

The Algorithm

5 8 3 2

3 2

2 3

5 8

5 8

Page 27: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

GPU-Quicksort Cederman and Tsigas, ESA08 GPUSort Govindaraju et. al.,

SIGMOD05 Radix-Merge Harris et. al., GPU Gems 3 ’07 Global radix Sengupta et. al., GH07 Hybrid Sintorn and Assarsson,

GPGPU07

Introsort David Musser, Software: Practice and Experience

Set of Algorithms

Page 28: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 29: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 30: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 31: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 32: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Page 33: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

P

x1x2

x3

x4

Page 34: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

First Phase◦ Average of maximum and minimum element

Second Phase◦ Median of first, middle and last element

Pivot selection

Page 35: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX◦ 16 multiprocessors (128 cores)◦ 86.4 GB/s bandwidth

8600GTS◦ 4 multiprocessors (32 cores)◦ 32 GB/s bandwidth

CPU-Reference◦ AMD Dual-Core Opteron 265 / 1.8 GHz

Hardware

Page 36: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 37: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 38: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Page 39: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

CPU-Reference~4s

Page 40: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPUSort andRadix-Merge

~1.5s

Page 41: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

GPU-Quicksort~380ms

Global Radix~510ms

Page 42: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Sorted Distribution

1M 2M 4M 8M0

100

200

300

400

500

600

700

800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 43: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Sorted Distribution

1M 2M 4M 8M0

100

200

300

400

500

600

700

800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 44: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

100

200

300

400

500

600

700

800

Tim

e (m

s)

Page 45: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

100

200

300

400

500

600

700

800

Tim

e (m

s)

Reference is faster than

GPUSort and Radix-Merge

Page 46: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

100

200

300

400

500

600

700

800

Tim

e (m

s)

GPU-Quicksorttakes less than

200ms

Page 47: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

200

400

600

800

1000

1200

1400

1600

1800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 48: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

200

400

600

800

1000

1200

1400

1600

1800

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

Size

Tim

e (m

s)

Page 49: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Zero – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Page 50: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Zero – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Bitonic is unaffected

by distribution

Page 51: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8800GTX – Zero – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

200

400

600

800

1000

1200

1400

1600

1800

Tim

e (m

s)

Three-way partition

Page 52: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Uniform Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

Size

Tim

e (m

s)

Page 53: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Uniform Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

Size

Tim

e (m

s)

Page 54: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)

Page 55: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)

Reference equal to

Radix-Merge

Page 56: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

500

1000

1500

2000

2500

Tim

e (m

s)

Quicksort and hybrid ~4 times

faster

Page 57: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Sorted Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

Size

Tim

e (m

s)

Page 58: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Sorted Distribution

1M 2M 4M 8M0

500

1000

1500

2000

2500

3000

3500

4000

4500

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

Size

Tim

e (m

s)

Page 59: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Page 60: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Hybrid needs randomization

Page 61: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Tim

e (m

s)

Reference better

Page 62: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX

Page 63: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX

Similar to uniform

distribution

Page 64: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

1000

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

Tim

e (m

s)Visibility Ordering – 8800GTX

Sequence needs to be a power of

two in length

Page 65: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Minimal, manual cache◦ Used only for prefix sum and bitonic

32-word SIMD instruction◦ Main part executes same instructions

Coalesced memory access◦ All reads coalesced

No block synchronization◦ Only required in first phase

Expensive synchronization primitives◦ Two passes amortizes cost

Conclusions

Page 66: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way

It is competitive

Conclusions

Page 67: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Thank you!

www.cs.chalmers.se/~cederman