Dataset Skew Resilience in Main Memory Parallel Hash Joins · The rapid rise of big data and data...

1
Examples of Dataset Skew Joins on non-primary key columns (duplicate keys) Parallel multiway joins Compound queries (multiple joins and aggregations) Many examples in real-world data The size of cities and the length and frequency of words can be modeled with Zipfian distributions Measurement errors and IQ scores often follow Gaussian distributions Dataset skew can occur in attributes as well as in the correlation between relations Experiments We generated large datasets (hundreds of millions of records) that vary in terms of distribution, correlation, shuffling, and skew on the build and probe tables We measured hash join execution time, with three different hash table implementations Separate Chaining hash table (SC) is the original baseline Modified Separate Chaining hash table (MSC) Custom Cuckoo Hashing implementation (CCH) The rapid rise of big data and data analytics are driving the need for greater algorithmic efficiency. Parallel main memory hash joins are prescribed to accelerate database joins. However, the performance of these joins can be hindered by dataset skew. We tested four popular hash join algorithms on an extensive array of datasets. Our results demonstrate that hash joins are acutely affected by dataset skew, and performance is further hampered if the data is unordered. To address these issues, we propose two different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table that can further improve performance by up to 17.3x. Dataset Skew Resilience in Main Memory Parallel Hash Joins Puya Memarzia, Virendra Bhavsar, and Suprio Ray Faculty of Computer Science, University of New Brunswick, Fredericton, New Brunswick, Canada ABSTRACT References: [1] Image from Wikimedia commons - https://commons.wikimedia.org/wiki/File:Zipf_distribution_CMF.png#/media/File:Zipf_distribution_CMF.png - CC-BY-SA 3.0 [2] By Inductiveload (self-made, Mathematica, Inkscape) [Public domain], via Wikimedia Commons https://commons.wikimedia.org/w/index.php?curid=3817960 (a) Zipfian distribution [1] Motivation In-memory hash joins on datasets that are skewed and/or shuffled are significantly slower than ordered non-skewed datasets. The performance penalty can be several orders of magnitude. Throwing more hardware at the problem is not always an option. The design and implementation of the hash table is one of the main bottlenecks Join time deteriorates due to increased memory lookups, lock contention, and CPU cache and TLB misses Hash Join Configurations Hash joins can leverage data parallelism to scale up on multicore processors. Hash join configurations can be categorized based on their partitioning scheme (or lack thereof) and parallel implementation. 1. No Partitioning (Nopart): one large lock-protected hash table shared by all threads 2. Shared Partitioning (Part-Share): multiple lock-protected partitions shared by all threads 3. Independent Partitioning (Part-Indep): private lock-less partitions for each thread 4. Radix Partitioning (Part-Radix): a hierarchy of partitions with dynamic load balancing Figure 2. The performance impact of dataset skew and shuffling on four hash join algorithms (note that y axis is log scale) 0 0 0 0 0 0 0 0 1 10 100 1,000 10,000 100,000 1,000,000 Nopart Part-Share Part-Indep Part-Radix Runtime (CPU Cylces) - log 10 scale Hash Join Configuration Non-skewed Ordered Non-skewed Shuffled Skewed Shuffled 10 14 10 13 10 12 10 11 10 10 10 9 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 1 Figure 1. Example of a hash join between two corresponding tables or partitions from each table using a separate chaining hash table Figure 4. Original (SC) versus modified hash table (MSC) on skewed dataset (b) Gaussian distribution [2] Figure 3. Cumulative distribution function of data distributions that mimic some skewed datasets 0 20 40 60 80 100 120 140 160 Intel Skylake Intel Harpertown AMD K8 Runtime (CPU Cycles) Billions CPU Architecture Nopart_CCH Nopart_MSC Part-Share_MSC Part-Indep_MSC Part-Radix_MSC 0 0 0 0 0 0 0 0 1 10 100 1,000 10,000 100,000 1,000,000 Nopart Part-Share Part-Indep Part-Radix Runtime (CPU Cylces) - log 10 scale Hash Join Configuration SC MSC 10 14 10 13 10 12 10 11 10 10 10 9 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 1 Conclusion Hash table skew resilience is an important feature to consider when designing parallel hash join implementations Our approach improves hash join times across a variety of datasets and architectures, allowing for more work with the same resources The dataset and CPU can help determine the most efficient method Applications such as machine learning, analytics, graph processing, and data warehousing, could benefit from skew resilient algorithms . . . Financial 40 IT 20 HR 30 Hash Function Build Relation Hash Table dept_part0 desc deptno Marketing 10 IT 20 HR 30 Financial 40 Probe Relation emp_part0 ename deptno Gary 10 Dulley 10 Reiter 30 Taylor 20 Prevatt 30 Marketing 10 NULL Hash Function Gary, Marketing Dulley, Marketing Output . . . . . . Original Data (optional) 0 50 100 150 200 250 300 350 400 Cycles per output tuple Configuration Part Build Probe 0 50 100 150 200 250 Cycles per output tuple Configuration Part Build Probe (b) Shuffled dataset Figure 6. Breakdown of hash join phases on Zipf-skewed datasets (a) Ordered dataset Figure 5. Evaluation of three different CPU architectures on shuffled dataset Sponsored by:

Transcript of Dataset Skew Resilience in Main Memory Parallel Hash Joins · The rapid rise of big data and data...

Page 1: Dataset Skew Resilience in Main Memory Parallel Hash Joins · The rapid rise of big data and data analytics are driving the need for greater algorithmic efficiency. Parallel main

Examples of Dataset Skew▪ Joins on non-primary key columns (duplicate

keys)

▪ Parallel multiway joins

▪ Compound queries (multiple joins and

aggregations)

▪ Many examples in real-world data

▪ The size of cities and the length and

frequency of words can be modeled with

Zipfian distributions

▪ Measurement errors and IQ scores often

follow Gaussian distributions

▪ Dataset skew can occur in attributes as well

as in the correlation between relations

Experiments▪ We generated large datasets (hundreds of millions of records) that

vary in terms of distribution, correlation, shuffling, and skew on the

build and probe tables

▪ We measured hash join execution time, with three different hash

table implementations

▪ Separate Chaining hash table (SC) is the original baseline

▪ Modified Separate Chaining hash table (MSC)

▪ Custom Cuckoo Hashing implementation (CCH)

The rapid rise of big data and data analytics are driving the need for greater algorithmic efficiency. Parallel main memory hash joins are prescribed to accelerate database

joins. However, the performance of these joins can be hindered by dataset skew. We tested four popular hash join algorithms on an extensive array of datasets. Our results

demonstrate that hash joins are acutely affected by dataset skew, and performance is further hampered if the data is unordered. To address these issues, we propose two

different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original

implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table that can further improve performance by up to 17.3x.

Dataset Skew Resilience in Main Memory Parallel

Hash JoinsPuya Memarzia, Virendra Bhavsar, and Suprio Ray

Faculty of Computer Science, University of New Brunswick, Fredericton, New Brunswick, Canada

ABSTRACT

References: [1] Image from Wikimedia commons - https://commons.wikimedia.org/wiki/File:Zipf_distribution_CMF.png#/media/File:Zipf_distribution_CMF.png - CC-BY-SA 3.0 [2] By Inductiveload (self-made, Mathematica,

Inkscape) [Public domain], via Wikimedia Commons https://commons.wikimedia.org/w/index.php?curid=3817960

(a) Zipfian distribution [1]

MotivationIn-memory hash joins on datasets that are

skewed and/or shuffled are significantly

slower than ordered non-skewed datasets.

The performance penalty can be several

orders of magnitude.

▪ Throwing more hardware at the problem

is not always an option.

▪ The design and implementation of the

hash table is one of the main bottlenecks

▪ Join time deteriorates due to increased

memory lookups, lock contention, and

CPU cache and TLB misses

Hash Join ConfigurationsHash joins can leverage data parallelism to scale up on multicore processors. Hash join

configurations can be categorized based on their partitioning scheme (or lack thereof) and

parallel implementation.

1. No Partitioning (Nopart): one large lock-protected hash table shared by all threads

2. Shared Partitioning (Part-Share): multiple lock-protected partitions shared by all threads

3. Independent Partitioning (Part-Indep): private lock-less partitions for each thread

4. Radix Partitioning (Part-Radix): a hierarchy of partitions with dynamic load balancing

Figure 2. The performance impact of dataset skew and shuffling on

four hash join algorithms (note that y axis is log scale)

0

0

0

0

0

0

0

0

1

10

100

1,000

10,000

100,000

1,000,000

Nopart Part-Share Part-Indep Part-Radix

Ru

nti

me

(C

PU

Cyl

ces)

-lo

g 10

scal

e

Hash Join Configuration

Non-skewed Ordered Non-skewed Shuffled Skewed Shuffled1014

1013

1012

1011

1010

109

108

107

106

105

104

103

102

101

1

Figure 1. Example of a hash join between two corresponding tables or partitions from each table – using a separate chaining hash table

Figure 4. Original (SC) versus modified hash

table (MSC) on skewed dataset

(b) Gaussian distribution [2]

Figure 3. Cumulative distribution function of data distributions that

mimic some skewed datasets

0

20

40

60

80

100

120

140

160

Intel Skylake Intel Harpertown AMD K8

Ru

nti

me (

CP

U C

ycle

s)

Billio

ns

CPU Architecture

Nopart_CCH

Nopart_MSC

Part-Share_MSC

Part-Indep_MSC

Part-Radix_MSC

0

0

0

0

0

0

0

0

1

10

100

1,000

10,000

100,000

1,000,000

Nopart Part-Share Part-Indep Part-Radix

Ru

nti

me

(C

PU

Cyl

ces)

-lo

g 10

scal

e

Hash Join Configuration

SC MSC1014

1013

1012

1011

1010

109

108

107

106

105

104

103

102

101

1

Conclusion• Hash table skew resilience is an important feature to consider when

designing parallel hash join implementations

• Our approach improves hash join times across a variety of datasets

and architectures, allowing for more work with the same resources

• The dataset and CPU can help determine the most efficient method

• Applications such as machine learning, analytics, graph processing,

and data warehousing, could benefit from skew resilient algorithms

.

.

.

Financial

40

IT

20

HR

30

Hash FunctionBuild Relation Hash Table

dept_part0

desc deptno

Marketing 10

IT 20

HR 30

Financial 40

Probe Relation

emp_part0

ename deptno

Gary 10

Dulley 10

Reiter 30

Taylor 20

Prevatt 30

Marketing

10

NULL

Hash Function

Gary, Marketing

Dulley, Marketing

Output

.

.

....

Original Data

(optional)

0

50

100

150

200

250

300

350

400

Cycle

s p

er

ou

tpu

t tu

ple

Configuration

Part Build Probe

0

50

100

150

200

250C

ycle

s p

er

ou

tpu

t tu

ple

Configuration

Part Build Probe

(b) Shuffled dataset

Figure 6. Breakdown of hash join phases on Zipf-skewed datasets

(a) Ordered dataset

Figure 5. Evaluation of three different CPU

architectures on shuffled dataset

Sponsored by: