In-place Super Scalar Samplesort - KITalgo2.iti.kit.edu/sanders/courses/algen17/IPSSSSo.pdf ·...

Post on 03-Mar-2021

4 views 0 download

Transcript of In-place Super Scalar Samplesort - KITalgo2.iti.kit.edu/sanders/courses/algen17/IPSSSSo.pdf ·...

In-place (Parallel) Super Scalar Samplesort

Algorithm Engineering Lecture · June 13, 2017Michael Axtmann, Sascha Witt, Daniel Ferizovic, Peter Sanders

KIT – University of the State of Baden-Wuerttemberg andNational Laboratory of the Helmholtz Association

Institute of Theoretical Informatics · Algorithmics Group

www.kit.edu

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

Overview

Quicksort BlockQuicksort SSSS ISSSS

Decoupled control anddata flow

no yes yes yes

Conditional branches yes no no noData transfers / element ≈ log2 n ≈ log2 n ≈ logk n ≈ logk nAdditional space O(1) O(b) O(n) O(kb)Parallelization yes no no yes

1

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

BlockQuicksort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesIn-place: O (b) additional space

2

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

BlockQuicksort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesIn-place: O (b) additional space

b Elements

2

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

BlockQuicksort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesIn-place: O (b) additional space

Pivot b Elements

2

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

BlockQuicksort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesIn-place: O (b) additional space

Pivot b Elements

2

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

BlockQuicksort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesIn-place: O (b) additional space

Pivot b Elements

2

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

BlockQuicksort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesIn-place: O (b) additional space

Pivot b Elements

2

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

BlockQuicksort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesIn-place: O (b) additional space

Pivot b Elements

2

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

BlockQuicksort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesIn-place: O (b) additional space

Pivot

DrawbacksO(

nb log2

nn0

)block transfers

b Elements

2

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesk-way distributionCache/IO-efficient

O(

ntb logk

nn0

)block transfers

In-place: O (kb) additional spaceEasy to parallelize

3

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Local classification

Block permutation

Input

4

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Local classification

Block permutation

Cleanup

Input

4

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Flush ?

Local classificationk-way decision tree without branchesk buffer blocks of size B

Flush buffer block if charged

k buffer blocks

5

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

Swap buffers

b1 b2 b3 b4 b5w2 w3 w4w1 r1 r2 r3 r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

Swap buffers

b1 b2 b3 b4 b5w1 w2 w3 w4r1 r2 r3 r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

Swap buffers

b1 b2 b3 b4 b5w1 w2 w3 w4r1 r2 r3 r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

Swap buffers

b1 b2 b3 b4 b5w1 w2 w3 w4r1 r2 r3 r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

Swap buffers

b1 b2 b3 b4 b5w1 w2 w3 w4r1 r2 r3 r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

Swap buffers

b1 b2 b3 b4 b5w1 w2 w3 w4r1 r2 r3 r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

Swap buffers

b1 b2 b3 b4 b5w1 w2 w3 w4r1 r2 r3 r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b3 b5w1 w2 w3r1 r2 r3

Swap buffers

b4 w4r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b3 b4 b5w2 w3 w4w1 r1 r2 r3 r4

Swap buffers

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b5w1 r1 r2

Swap buffers

b3w2 r3 b4w3 w4r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b5w1 r1 r2

Swap buffers

b3w2 r3 b4w3 w4r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b5w1 r1 r2

Swap buffers

b3w2 r3 b4w3 w4r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b5w1 r1 r2

Swap buffers

b3w2 r3 b4w3 w4r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b5w1r1 r2

Swap buffers

b3w2 r3 b4w3 w4r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b5w1r1 r2

Swap buffers

b3w2 r3 b4w3 w4r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b5w1

w4r1 r2

Swap buffers

b3w2 r3 b4w3 r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

Block permutationInvariant

Permuted blocks [bi , wi )Unpermuted blocks [wi , ri ]Empty Blocks (ri , bi+1)

Two buffers to swap blocks

b1 b2 b5w1

w4r1 r2

Swap buffers

b3w2 r3 b4w3 r4

6

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

DiscussionWhy blocks

Just n + n/b classifications per levelReduced TLB misses

Local classification: buffers on same pageWrite and read whole blocks

Better prefetching and less cache misses

7

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Super Scalar Samplesort

28 211 214 217 220 223 226 2290

2

4

6

Item count n

Runn

ingtim

e/

nlo

g 2n[µ

s]

IS4o BlockQ std-sorts3-sort DualPivot

Sequential on two Xeon E5-2683 v4 16-core processors – Uniform Input

8

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Parallel Super Scalar Samplesort

Shared-memory parallelism with t threadsLocal classification: divide input into stripes – one for each threadBlock permutation: fetch blocks atomically

atomic read pointers and end pointersAccess with fetch-and-add operations

Blocks of size Ω(t) to avoid starvationBuffers for each threadCall sequential subroutines in parallel if n ≤ ninit/t

9

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Parallel Super Scalar Samplesort

215 218 221 224 227 2300

0.2

0.4

0.6

One of two Intel Xeon E5-2683 v4 16-core processors – Uniform InputItem count n

Runn

ingtim

e/

nlo

g 2n[µ

s]IPS4o MCSTLubq MCSTLbqPBBS MCSTLmwm TBB

10

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Parallel Super Scalar Samplesort

215 218 221 224 227 2300

0.2

0.4

0.6

Two Intel Xeon E5-2683 v4 16-core processors – Uniform InputItem count n

Runn

ingtim

e/

nlo

g 2n[µ

s]IPS4o MCSTLubq MCSTLbqPBBS MCSTLmwm TBB

11

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Parallel Super Scalar Samplesort

215 218 221 224 227 2300

0.2

0.4

0.6

Two Intel Xeon E5-2683 v4 16-core processors – Exponential InputItem count n

Runn

ingtim

e/

nlo

g 2n[µ

s]IPS4o MCSTLubq MCSTLbqPBBS MCSTLmwm TBB

12

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Parallel Super Scalar Samplesort

215 218 221 224 227 2300

0.2

0.4

0.6

Two Intel Xeon E5-2683 v4 16-core processors – RootDup InputItem count n

Runn

ingtim

e/

nlo

g 2n[µ

s]IPS4o MCSTLubq MCSTLbqPBBS MCSTLmwm TBB

13

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Parallel Super Scalar Samplesort

215 218 221 224 227 2300

0.2

0.4

0.6

Two Intel Xeon E5-2683 v4 16-core processors – TwoDup InputItem count n

Runn

ingtim

e/

nlo

g 2n[µ

s]IPS4o MCSTLubq MCSTLbqPBBS MCSTLmwm TBB

14

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

In-place Parallel Super Scalar Samplesort

215 218 221 224 227 2300

0.2

0.4

0.6

Two Intel Xeon E5-2683 v4 16-core processors – AlmostSorted InputItem count n

Runn

ingtim

e/

nlo

g 2n[µ

s]IPS4o MCSTLubq MCSTLbqPBBS MCSTLmwm TBB

15

M. Axtmann, S. Witt, D. Ferizovic, P. Sanders – In-place Super Scalar Samplesort Institute of Theoretical Informat-icsAlgorithmics Group

BlockQuicksort

GoalsPartially decoupling control flow from data flowAvoid conditional branchesIn-place

16