SIMD- and Cache-Friendly Algorithm · k-way merge k-way merge k-way merge k-way merge k-way merge...

1
0 10 20 30 40 50 60 70 80 90 100 110 Our approach Key-index approach Direct approach Radix sort STL stable_sort STL sort (unstable) execution time (sec) with SIMD without SIMD multiway mergesort faster Hiroshi Inoue ([email protected]) † ‡ Kenjiro Taura IBM Research Tokyo University of Tokyo SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures VLDB 2015 at Hawai‘i, USA SIMD-based multiway mergesort has been used for in-memory sorting of integers We extend this for sorting structures (key + payload) Existing approaches Key-index approach: 1. encode key and index for each record into an integer, 2. sort the key-index pairs with SIMD, and 3. rearrange the records based on the sorted key-index pairs SIMD friendly but NOT cache friendly Costly due to random accesses for memory Direct approach: 1. sort records directly without encoding into an integer Cache friendly but NOT SIMD friendly Inefficient with SIMD due to gather for keys Key idea: to execute rearranging of records more frequently , e.g. once per m merge stages (m > 1), instead of only once at the last in key-index approach Benefits Cache friendly: the rearrange operation reads from k = 2 m input streams and write to one output stream; hence the memory accesses are sequential unless m is too large SIMD friendly: most of the merge operations are done for integers; reading keys from records, which is costly with SIMD, only once per m stages Implementation in multiway merge We integrate key encoding and rearrange into multiway merge operation multiway merge, which merges k (k > 2) input streams into one output stream, is a common technique to reduce memory bandwidth in mergesort Steps of our multiway merge operation 1. at the first stage, read records from system memory and encode key and streamID into an integer 2. merge integer values using SIMD 3. at the last stage, rearrange records based on the encoded streamID Performance results Sorting 512M records of 16 byte Sorting 16M records of various record sizes processor’s cache memory system memory input streams rearrange records stream 0 output stream 2-way merge 2-way merge 2-way merge encode {key, streamID} pair into an integer value 2-way merge 2-way merge 2-way merge 2-way merge stream 1 stream 2 stream 3 stream 4 stream 5 stream 6 stream 7 8-way merge operation for structures move structures move integer values SIMD merge operation for integers stage 1 stage 2 stage 3 rearrange without random accesses unsorted input sorted output Direct approach Key-index approach encode rearrange Our approach (m = 3) unsorted input sorted output unsorted input sorted output encode rearrange encode rearrange merge operation for structures cache unfriendly random accesses SIMD unfriendly merge operation for integers cache friendly sequential accesses cache friendly sequential accesses SIMD friendly SIMD friendly Introduction struct Record { int32_t key; int32_t dataX; int32_t dataY; ... }; struct Record array[N]; payload key payload key payload record i +1 k i i k i+1 i +1 Our approach record i 64-bit integer encode System 2.9-GHz Xeon (SandyBridge) / RHEL 6.4 / gcc-4.8 / using SSE (128-bit SIMD) 0 1 2 3 4 5 6 7 8 16 24 32 48 64 96 128 192 256 executin time (sec) record size (byte) Our approach Key-index approach Direct approach Radix sort STL stable_sort STL sort (unstable) cache line size faster sorted unsorted input sorted output k -way merge k -way merge k -way merge k -way merge k -way merge Step 1: divide into small blocks that can fir in (L2) cache Step 2: sort each block by vectorized combsort Step 3: repeatedly execute multiway merge to merge all blocks Optimizations and overall scheme Optimizations: exploiting 4-wide SIMD by encoding a {key, id} pair into 32-bit integer we encode streamID (up to k) accompanied with its key into an intermediate integer; the streamID is much smaller than the index (up to N) we can use more bits for the key we use a 32-bit integer instead of a 64-bit integer to encode (a part of ) key and streamID to use higher data parallelism when the number of elements to merge is smaller than a threshold if we use only a part of keys for merging, we check the order by using the entire key when we rearrange records (without using SIMD) vectorized combsort for initial sorting because mergesort is not efficient for small amount of data, we switch to vectorized combsort if a block to sort is small enough to fit into L2 cache combsort is efficient with SIMD but shows very poor memory access locality good for initial sorting of small blocks Overall sorting scheme: 0 5 10 15 20 25 30 35 40 45 0 4 8 12 16 relative performane over STL's std::stable_sort on 1 core number of cores Our approach Key-index approach Direct approach Radix sort STL stable_sort STL sort (unstable) faster Scalability with multiple cores 0.01 0.1 1 10 100 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M execution time (sec) # records Our approach Key-index approach Direct approach Radix sort STL stable_sort STL sort (unstable) faster Sorting various numbers of 16-byte records 0 10 20 30 40 50 60 70 80 2 4 8 16 32 64 128 256 512 1k 2k execution time (sec) number of ways faster Effect of number of ways (k)

Transcript of SIMD- and Cache-Friendly Algorithm · k-way merge k-way merge k-way merge k-way merge k-way merge...

Page 1: SIMD- and Cache-Friendly Algorithm · k-way merge k-way merge k-way merge k-way merge k-way merge Step 1: divide into small blocks that can fir in (L2) cache Step 2: sort each block

0

10

20

30

40

50

60

70

80

90

100

110

Our approach

Key-index approach

Direct approach

Radix sort STLstable_sort

STL sort(unstable)

executio

n ti

me (

sec)

with SIMD without SIMD

multiway mergesort

faste

r

Hiroshi Inoue ([email protected])† ‡ Kenjiro Taura‡

†IBM Research – Tokyo ‡University of Tokyo

SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures

VLDB 2015 at Hawai‘i, USA

• SIMD-based multiway mergesort has been used for in-memory sorting of integers

• We extend this for sorting structures (key + payload)

Existing approaches

• Key-index approach:

1. encode key and index for each record into an integer,

2. sort the key-index pairs with SIMD, and

3. rearrange the records based on the sorted key-index pairs

SIMD friendly but NOT cache friendly

Costly due to random accesses for memory

• Direct approach:

1. sort records directly without encoding into an integer

Cache friendly but NOT SIMD friendly

Inefficient with SIMD due to gather for keys

Key idea:

• to execute rearranging of records more frequently, e.g. once per m merge stages (m > 1), instead of only once at the last in key-index approach

Benefits

• Cache friendly: the rearrange operation reads from k = 2m input streams and write to one output stream; hence the memory accesses are sequential unless m is too large

• SIMD friendly: most of the merge operations are done for integers; reading keys from records, which is costly with SIMD, only once per m stages

Implementation in multiway merge

We integrate key encoding and rearrange into multiway merge operation

• multiway merge, which merges k (k > 2) input streams into one output stream, is a common technique to reduce memory bandwidth in mergesort

Steps of our multiway merge operation

1. at the first stage, read records from system memory and encode key and streamID into an integer

2. merge integer values using SIMD

3. at the last stage, rearrange records based on the encoded streamID

Performance results

Sorting 512M records of 16 byte

Sorting 16M records of various record sizes

processor’s cache memory

system memory

input streams

rearrange records

stream0

output stream

2-waymerge

2-waymerge

2-waymerge

encode {key, streamID} pair into an integer value

2-waymerge

2-waymerge

2-waymerge

2-waymerge

stream1

stream2

stream3

stream4

stream5

stream6

stream7

8-way merge operation for structures

move structures move integer values

SIMD merge operation for integers

stage 1

stage 2

stage 3

rearrange without random accesses

unsorted input

sorted output

Direct approachKey-index approach

encode

rearrange

Our approach (m = 3)

unsorted input

sorted output

unsorted input

sorted output

encode

rearrange

encode

rearrange

merge operation

for structures

cache

unfriendly

random

accesses

SIMD

unfriendly

merge operation

for integers

cache

friendly

sequential

accesses

cache

friendly

sequential

accesses

SIMD

friendly

SIMD

friendly

Introduction

struct Record {

int32_t key;

int32_t dataX;

int32_t dataY;...

};

struct Record array[N];

payload

key payload key payload

record i record i +1

ki i ki+1 i+1

Our approach

record i

64-bit integer encode

System

• 2.9-GHz Xeon (SandyBridge) / RHEL 6.4 / gcc-4.8 / using SSE (128-bit SIMD)

0

1

2

3

4

5

6

7

8 16 24 32 48 64 96 128 192 256

executin tim

e (

sec)

record size (byte)

Our approach

Key-index approach

Direct approach

Radix sort

STL stable_sort

STL sort (unstable)

cache line size

faste

r

sorted

unsorted input

sorted output

k-way merge k-way merge k-way merge k-way merge

k-way merge

Step 1: divide into small blocks that can fir in (L2) cache

Step 2: sort each block by vectorized combsort

Step 3: repeatedly execute multiway merge to merge all blocks

Optimizations and overall scheme Optimizations:

exploiting 4-wide SIMD by encoding a {key, id} pair into 32-bit integer

• we encode streamID (up to k) accompanied with its key into an intermediate integer; the streamID is much smaller than the index (up to N) we can use more bits for the key

• we use a 32-bit integer instead of a 64-bit integer to encode (a part of ) key and streamID to use higher data parallelism when the number of elements to merge is smaller than a threshold

• if we use only a part of keys for merging, we check the order by using the entire key when we rearrange records (without using SIMD)

vectorized combsort for initial sorting

• because mergesort is not efficient for small amount of data, we switch to vectorized combsort if a block to sort is small enough to fit into L2 cache

• combsort is efficient with SIMD but shows very poor memory access locality good for initial sorting of small blocks

Overall sorting scheme:

0

5

10

15

20

25

30

35

40

45

0 4 8 12 16re

lative p

erf

orm

ane o

ver S

TL's

std

::sta

ble

_so

rt o

n 1

co

renumber of cores

Our approach

Key-index approach

Direct approach

Radix sort

STL stable_sort

STL sort (unstable)

faste

r

Scalability with multiple cores

0.01

0.1

1

10

100

1M 2M 4M 8M 16M 32M 64M 128M 256M 512M

executio

n tim

e (

sec)

# records

Our approach

Key-index approach

Direct approach

Radix sort

STL stable_sort

STL sort (unstable)

faste

r

Sorting various numbers of 16-byte records

0

10

20

30

40

50

60

70

80

2 4 8

16

32

64

128

256

512

1k

2k

executio

n ti

me (

sec)

number of ways

faste

r

Effect of number of ways (k)