Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...
-
Upload
yadiel-gardiner -
Category
Documents
-
view
231 -
download
6
Transcript of Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...
![Page 1: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/1.jpg)
A Practical Quicksort Algorithmfor Graphics Processors
Daniel Cederman and Philippas Tsigas
Distributed Computing and SystemsChalmers University of Technology
![Page 2: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/2.jpg)
System Model The Algorithm Experiments
Overview
5 8 3 2
3 2
2 3
5 8
5 8
![Page 3: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/3.jpg)
System Model The Algorithm Experiments
System Model
5 8 3 2
3 2
2 3
5 8
5 8
![Page 4: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/4.jpg)
CPU vs. GPU
Normal processors Graphics processors
Multi-core Large cache Few threads Speculative execution
Many-core Small cache Wide and fast memory
bus Thousands of threads
hides memory latency
![Page 5: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/5.jpg)
Designed for general purpose computation Extensions to C/C++
CUDA
![Page 6: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/6.jpg)
System Model – Global Memory
Global Memory
![Page 7: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/7.jpg)
System Model – Shared Memory
Global Memory
Multiprocessor 0 Multiprocessor 1
SharedMemory
SharedMemory
![Page 8: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/8.jpg)
System Model – Thread Blocks
Global Memory
Multiprocessor 0 Multiprocessor 1
SharedMemory
SharedMemory
Thread Block 0
Thread Block 1
Thread Block x
Thread Block y
![Page 9: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/9.jpg)
Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives
Challenges
![Page 10: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/10.jpg)
System Model The Algorithm Experiments
The Algorithm
5 8 3 2
3 2
2 3
5 8
5 8
![Page 11: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/11.jpg)
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
![Page 12: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/12.jpg)
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
Data Parallel!
![Page 13: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/13.jpg)
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
NOTData
Parallel!
![Page 14: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/14.jpg)
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
Phase One• Sequences divided•Block synchronization
on the CPU
Thread Block 1 Thread Block 2
![Page 15: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/15.jpg)
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
Phase Two• Each block is
assigned own sequence
• Run entirely on GPU
Thread Block 1 Thread Block 2
![Page 16: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/16.jpg)
Quicksort – Phase One
2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
![Page 17: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/17.jpg)
Quicksort – Phase One
Assign Thread Blocks
2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
![Page 18: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/18.jpg)
Quicksort – Phase One
Uses an auxiliary array In-place too expensive
2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
![Page 19: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/19.jpg)
Fetch-And-Addon
LeftPointer
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
LeftPointer = 0
Thread Block ONE
Thread BlockTWO
1
0
![Page 20: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/20.jpg)
Fetch-And-Addon
LeftPointer
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
LeftPointer = 2
Thread Block ONE
Thread BlockTWO
2
3
3 2
![Page 21: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/21.jpg)
Fetch-And-Addon
LeftPointer
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
LeftPointer = 4
Thread Block ONE
Thread BlockTWO
5
4
3 2 2 3
![Page 22: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/22.jpg)
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
LeftPointer = 10
3 2 2 3 3 0 2 3 97 7 8 78 5 7 6 9 80 2
![Page 23: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/23.jpg)
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Three way partition
3 2 2 3 3 0 2 3 94 7 7 8 78 5 7 6 9 84 40 2
![Page 24: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/24.jpg)
Two Pass Partitioning
2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
<
>
=
>
<< <
>
<
>
<
=
<
>
<
> > >
<
>
=
<
> >
= = =
![Page 25: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/25.jpg)
Recurse on smallest sequence◦ Max depth log(n)
Explicit stack Alternative sorting
◦ Bitonic sort
Second Phase
![Page 26: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/26.jpg)
System Model The Algorithm Experiments
The Algorithm
5 8 3 2
3 2
2 3
5 8
5 8
![Page 27: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/27.jpg)
GPU-Quicksort Cederman and Tsigas, ESA08 GPUSort Govindaraju et. al.,
SIGMOD05 Radix-Merge Harris et. al., GPU Gems 3 ’07 Global radix Sengupta et. al., GH07 Hybrid Sintorn and Assarsson,
GPGPU07
Introsort David Musser, Software: Practice and Experience
Set of Algorithms
![Page 28: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/28.jpg)
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
![Page 29: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/29.jpg)
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
![Page 30: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/30.jpg)
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
![Page 31: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/31.jpg)
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
![Page 32: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/32.jpg)
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
![Page 33: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/33.jpg)
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
P
x1x2
x3
x4
![Page 34: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/34.jpg)
First Phase◦ Average of maximum and minimum element
Second Phase◦ Median of first, middle and last element
Pivot selection
![Page 35: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/35.jpg)
8800GTX◦ 16 multiprocessors (128 cores)◦ 86.4 GB/s bandwidth
8600GTS◦ 4 multiprocessors (32 cores)◦ 32 GB/s bandwidth
CPU-Reference◦ AMD Dual-Core Opteron 265 / 1.8 GHz
Hardware
![Page 36: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/36.jpg)
8800GTX – Uniform Distribution
1M 2M 4M 8M 16M0
500
1000
1500
2000
2500
3000
3500
4000
4500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
![Page 37: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/37.jpg)
8800GTX – Uniform Distribution
1M 2M 4M 8M 16M0
500
1000
1500
2000
2500
3000
3500
4000
4500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
![Page 38: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/38.jpg)
8800GTX – Uniform – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
![Page 39: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/39.jpg)
8800GTX – Uniform – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
CPU-Reference~4s
![Page 40: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/40.jpg)
8800GTX – Uniform – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
GPUSort andRadix-Merge
~1.5s
![Page 41: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/41.jpg)
8800GTX – Uniform – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
GPU-Quicksort~380ms
Global Radix~510ms
![Page 42: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/42.jpg)
8800GTX – Sorted Distribution
1M 2M 4M 8M0
100
200
300
400
500
600
700
800
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
![Page 43: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/43.jpg)
8800GTX – Sorted Distribution
1M 2M 4M 8M0
100
200
300
400
500
600
700
800
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
![Page 44: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/44.jpg)
8800GTX – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
100
200
300
400
500
600
700
800
Tim
e (m
s)
![Page 45: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/45.jpg)
8800GTX – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
100
200
300
400
500
600
700
800
Tim
e (m
s)
Reference is faster than
GPUSort and Radix-Merge
![Page 46: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/46.jpg)
8800GTX – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
100
200
300
400
500
600
700
800
Tim
e (m
s)
GPU-Quicksorttakes less than
200ms
![Page 47: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/47.jpg)
8800GTX – Zero Distribution
1M 2M 4M 8M 16M0
200
400
600
800
1000
1200
1400
1600
1800
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
![Page 48: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/48.jpg)
8800GTX – Zero Distribution
1M 2M 4M 8M 16M0
200
400
600
800
1000
1200
1400
1600
1800
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
![Page 49: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/49.jpg)
8800GTX – Zero – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
200
400
600
800
1000
1200
1400
1600
1800
Tim
e (m
s)
![Page 50: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/50.jpg)
8800GTX – Zero – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
200
400
600
800
1000
1200
1400
1600
1800
Tim
e (m
s)
Bitonic is unaffected
by distribution
![Page 51: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/51.jpg)
8800GTX – Zero – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
200
400
600
800
1000
1200
1400
1600
1800
Tim
e (m
s)
Three-way partition
![Page 52: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/52.jpg)
8600GTS – Uniform Distribution
1M 2M 4M 8M0
500
1000
1500
2000
2500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid
Size
Tim
e (m
s)
![Page 53: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/53.jpg)
8600GTS – Uniform Distribution
1M 2M 4M 8M0
500
1000
1500
2000
2500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid
Size
Tim
e (m
s)
![Page 54: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/54.jpg)
8600GTS – Uniform – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid0
500
1000
1500
2000
2500
Tim
e (m
s)
![Page 55: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/55.jpg)
8600GTS – Uniform – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid0
500
1000
1500
2000
2500
Tim
e (m
s)
Reference equal to
Radix-Merge
![Page 56: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/56.jpg)
8600GTS – Uniform – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid0
500
1000
1500
2000
2500
Tim
e (m
s)
Quicksort and hybrid ~4 times
faster
![Page 57: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/57.jpg)
8600GTS – Sorted Distribution
1M 2M 4M 8M0
500
1000
1500
2000
2500
3000
3500
4000
4500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)
Size
Tim
e (m
s)
![Page 58: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/58.jpg)
8600GTS – Sorted Distribution
1M 2M 4M 8M0
500
1000
1500
2000
2500
3000
3500
4000
4500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)
Size
Tim
e (m
s)
![Page 59: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/59.jpg)
8600GTS – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
![Page 60: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/60.jpg)
8600GTS – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
Hybrid needs randomization
![Page 61: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/61.jpg)
8600GTS – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
Reference better
![Page 62: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/62.jpg)
Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0
100200300400500600700800900
1000
GPU-Quicksort Global Radix GPUSortRadix/Merge STL
Tim
e (m
s)Visibility Ordering – 8800GTX
![Page 63: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/63.jpg)
Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0
100200300400500600700800900
1000
GPU-Quicksort Global Radix GPUSortRadix/Merge STL
Tim
e (m
s)Visibility Ordering – 8800GTX
Similar to uniform
distribution
![Page 64: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/64.jpg)
Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0
100200300400500600700800900
1000
GPU-Quicksort Global Radix GPUSortRadix/Merge STL
Tim
e (m
s)Visibility Ordering – 8800GTX
Sequence needs to be a power of
two in length
![Page 65: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/65.jpg)
Minimal, manual cache◦ Used only for prefix sum and bitonic
32-word SIMD instruction◦ Main part executes same instructions
Coalesced memory access◦ All reads coalesced
No block synchronization◦ Only required in first phase
Expensive synchronization primitives◦ Two passes amortizes cost
Conclusions
![Page 66: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/66.jpg)
Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way
It is competitive
Conclusions
![Page 67: Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.](https://reader030.fdocuments.net/reader030/viewer/2022020720/551c02bd5503469e4f8b4d03/html5/thumbnails/67.jpg)
Thank you!
www.cs.chalmers.se/~cederman