Using BigBench to compare Hive and Spark

13
Using BigBench to compare Hive and Spark Alejandro Montero, Nicolas Poggi 1

Transcript of Using BigBench to compare Hive and Spark

Page 1: Using BigBench to compare Hive and Spark

Using BigBench to compare Hive and Spark

Alejandro Montero, Nicolas Poggi

1

Page 2: Using BigBench to compare Hive and Spark

What is BigBench (TPCx-BB1)?

• Specification-based benchmark with an open-source implementation, proposed as the first Big Data benchmark standard.

• BigBench covers all major Big Data characteristics• Volume -> Scale factor.• Velocity -> Table refresh.• Variety -> Data type disparity.

• Extension of TPC-DS.• Borrows 10 pure QL queries from TCP-DS. • Added BigData tables and use cases: Machine Learning, Natural Language Processing, ...

• Can support multiple implementations, multiple BigData engines and table formats.

• Can execute multiple parallel streams.

• Defines scale factors for data.• Tested: 100 GB.

2[1]: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf

Page 3: Using BigBench to compare Hive and Spark

BigBench – Overview

3

Unstructured DataStructured Data

Semi-Structured Data

Marketprice Items

Sales

Web Page Customers

Reviews

Web Log

Workload:

• 14 Pure QL queries.• 10 Borrowed from TPC-DS.

• 4 Queries with MapReduce pre-processing.• 7 Natural Language Processing Queries.• 5 Machine Learning Queries.

Page 4: Using BigBench to compare Hive and Spark

BigBench v1.2 – Reference Implementation

HDFS

Hive Metastore

MapReduce Tez Spark

Yarn

Hive Spark SQL

Mahout ML Custom Spark MLlibApplication

SQL Engine

Table Metastore

Execution Engine

Filesystem

Benchmarked systems:

• Hive + MapReduce + Mahout• Hive + MapReduce + Spark_MLlib• Hive + Tez + Mahout• Hive + Tez + Spark_MLlib• Spark SQL + Mahout• Spark SQL + Spark_MLlib• Spark 2 SQL + Mahout

Work in progress:

• Hive 2• Spark 2 SQL + Spark_MLlib

Page 5: Using BigBench to compare Hive and Spark

The cluster – HDInsight PaaS

5

Model HDInsight D4v3

# Head nodes 2

# Working nodes 4

# Zookeeper nodes 3

CPU Intel(R) Xeon(R) CPU E5-2673 v38 x 2,4 GHz cores

RAM 28 GB

HDFS Remote

Software HortonWorks Data Platform 2.5

Spark config 1 executor/working node3 cores/executor

Page 6: Using BigBench to compare Hive and Spark

Pure QL

6

6223

16011848

1457

0

1000

2000

3000

4000

5000

6000

7000

Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1

Tim

e in

sec

on

ds

Average of three executions using 100 GB Scale Factor

Page 7: Using BigBench to compare Hive and Spark

Query 12 CPU behavior

7

Tez Spark 1.6.2 Spark 2.0.2

Average of three executions using 100 GB Scale Factor

Page 8: Using BigBench to compare Hive and Spark

Custom Reducers

8

2815

1122

16291466

0

500

1000

1500

2000

2500

3000

Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1

Tim

e in

sec

on

ds

Average of three executions using 100 GB Scale Factor

Page 9: Using BigBench to compare Hive and Spark

Natural Language Processing

9

5004

1100

2289

1913

0

1000

2000

3000

4000

5000

6000

Hive_MR Hive_Tez Spark_1.6.2 Spark_2.0.1

Tim

e in

sec

on

ds

Average of three executions using 100 GB Scale Factor

Page 10: Using BigBench to compare Hive and Spark

Machine Learning

10

3550

1613

3769

1898

4045

1937

3390

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Hive_MR +Mahout

Hive_MR +Spark_ml

Hive_tez +Mahout

Hive_Tez +Spark_ML

Spark_1.6.2 +Mahout

Spark_1.6.2 +Spark_ML

Spark_2.0.1 +Mahout

Tim

e in

sec

on

ds

Average of three executions using 100 GB Scale Factor

Page 11: Using BigBench to compare Hive and Spark

11

Aggregated Results

6223 6223

1601 1623 1848 1951 1457

2815 2815

1122 11371629 1489

1466

5004 5004

1100 1082

2289 23561913

3550

1613

3769

1898

4045

1937 3390

17592

15655

7592

5740

9811

77338226

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Hive_MR +Mahout

Hive_MR +Spark_ML

Hive_tez +Mahout

Hive_Tez +Spark_ML

Spark_1.6.2 +Mahout

Spark_1.6.2 +Spark_ML

Spark_2.0.1 +Mahout

Tim

e in

sec

on

ds

Pure-QL Custom Reducers NLP ML Total

Average of three executions using 100 GB Scale Factor

Page 12: Using BigBench to compare Hive and Spark

Conclusions

• Hive on Tez greatly improves SQL performance over Hive on MapReduce.• It is also faster than Hive on spark 1.

• Hive on spark 2 is slightly faster.

• The Spark implementation is based on hive…

• Spark MLlib has an excellent performance over Mahout.

• Best production combination: Apache Tez for SQL + Spark MLlib forMachine Learning.

12

Page 13: Using BigBench to compare Hive and Spark

Thanks, questions?

Follow up / feedback : [email protected]

Using BigBench to compare Hive and Spark

13