Radiantadvisors benchmark sql on hadoop 2014q1

26
Independent Benchmark Open-Source SQL-on-Hadoop Performance Q1 2014

description

http://radiantadvisors.com/wp-content/uploads/2014/04/RadiantAdvisors_Benchmark_SQL-on-Hadoop_2014Q1.pdf

Transcript of Radiantadvisors benchmark sql on hadoop 2014q1

Page 1: Radiantadvisors benchmark sql on hadoop 2014q1

Independent BenchmarkOpen-Source SQL-on-Hadoop PerformanceQ1 2014

Page 2: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 2

About Radiant AdvisorsRadiant Advisors is a leading strategic research and advisory firm that delivers innovative, cutting-edge research and thought-leadership to transform today’s organizations into tomorrow’s data-driven industry leaders.

To learn more, visit www.radiantadvisors.com.

Table of Contents

Executive Summary 3

Takeaways and Insights 4

Benchmark Summary 8

Closing 23

radiantadvisors.com | 2

Page 3: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 3

Executive SummaryFollowing several years of hype and headlines, 2014 is the year that mainstream companies are starting to adopt Hadoop to unlock business value -- with or without big data. Yet, while data scientists are the new rock stars of the data world, users that leverage decades of Structured Query Language (SQL) knowledge, technologies, and experience will continue to generate the majority of business value through its ease of accessibility.

SQL access to data stored in Hadoop was one of the earliest demands by business analysts that led to the creation of Hive. Hive-QL is a SQL interface to Hadoop MapReduce with limited syntax that enabled access for many early users. Since the creation of Hive, several alternatives have come to market. Cloudera released Impala 1.0 in May 2013 to provide interactive and higher performance SQL access to data stored in the Hadoop Distributed File System (HDFS) without using MapReduce. Open-source Presto also provides improved SQL capability and performance while accessing the HDFS directly. And, open-source InfiniDB for Hadoop comes from a massively parallel processing (MPP) relational database background to offer complex and analytic SQL capabilities with the high performance that users have come to expect.

This independent performance benchmark focused on the performance dimensions of speed and SQL capabilities by comparing SQL-on-Hadoop options that are open-source and free to download for existing Apache, Hortonworks, and Cloudera distributions. We also illustrated the relationship to available data file encodings -- such as Optimized RC, SequenceFile, Parquet, or InfiniDB -- for compression, performance, and openness.

radiantadvisors.com | 3

Page 4: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 4

Takeaways and Insights

Page 5: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 5

The Value of Open-Source SQL-on-Hadoop PerformanceThe affordability and scalability of open-source Hadoop has democratized big data and analytic capabilities for all companies to exploit. While big data still means many things to many people today, solving big problems and creating new, innovative opportunities remain at the heart of this data platform.

One of the most proven ways to generate business value comes from enabling as many people as possible within the company with broad access to data for discovery and analysis. Traditionally, business intelligence (BI) tools have enabled this access via SQL connectors to relational databases and data stores, such as data warehouses and data marts. Whether through reports or interactive analysis with business metrics and dashboards, SQL has become the most widely learned and accepted form of data access. A significant portion of data -- and big data -- contains some form of structure that can be accessed via the SQL paradigm.

Unlocking business value with SQL access requires more than basic access to data within the Hadoop HDFS. Three critical factors of leveraging SQL-on-Hadoop -- Speed, Scalability, and SQL Capability -- deliver the optimum value.

Speed is the database response time from when a user submits the SQL statement to the time the returned data set begins.

Without interactive speed, performing data analysis is a slow process for the user and the system can be deemed unusable or unavailable. Reports and extractions can be scheduled.

Scalability is the number of nodes within a Hadoop cluster that the SQL statement can be executed on via the SQL engine.

Node storage sizes will continue to evolve, but if you need SQL access to a big data set then you must verify your SQL engine can be executed on all nodes within the Hadoop cluster.

SQL Capability is especially important to move beyond simple data access to realize more value from analytics within Hadoop.

Basic access begins with ANSI SQL-92 standard and connecting with tools, but more advanced SQL engines will support the much needed analytic-oriented functions from later standards.

radiantadvisors.com | 5

Page 6: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 6

Key Takeaways

There are proven faster performing options to Hive 0.11 and 0.12 available for Hadoop.

Presto 0.57 is much faster performing than Hive 0.11 and 0.12, but look to Impala and InfiniDB for enabling interactivity.

InfiniDB leverages the MySQL connector found in nearly all tools; Hive, Presto, and Impala are gaining compatibility.

Hadoop Clusters with InfiniDB

The InfiniDB engine and file format delivered up to 40x performance gains over Hive for reporting queries and an additional 20x - 40x over Presto, while adding significant additional analytic SQL capabilities.

When both high performance and analytic SQL capabilities are required, migrating data into InfiniDB files is a good option. While IDB files can be accessed by MapReduce programs, they lack a broad community standard support.

Hadoop Clusters with Impala

The Impala engine had significant performance gains that enable interactivity compared to Hive and Presto. Some queries required minor edits and it was not able to execute analytic queries 8 - 10.

Using Parquet partitioned files for Impala was based on the growing acceptance of Parquet files in HDFS for other interfaces within Hadoop. The Impala roadmap shows future support for analytic SQL functions.

Hadoop Clusters with Presto

Adding Presto can deliver 10x - 14x performance gains for Hadoop clusters that have data in ORC data files that are currently using Hive 0.11 or 0.12 with no changes to those data files.

Scheduled Hive jobs for reports or data extractions will benefit from using Presto with little change to SQL statements.

ORC partitioned files were twice as fast with Presto than Hive, while SequenceFile partitioned files were twice as fast with Hive than Presto. radiantadvisors.com | 6

Page 7: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 7

Data file encodings for HDFS can be optimized for compression, openness, and even SQL performance from a given SQL engine’s perspective. Four encodings were tested in the benchmark-- Optimized RC, SequenceFile, Parquet, and IDB -- with both date partitioned and non-partitioned keys.

While performance by file encoding type is noted, the SQL engines themselves provide the orders of magnitude in gains and capability for evaluation.

SQL Engines

Insights

As part of the Hadoop 2 architecture, YARN needs to be a consideration for your own Hadoop architecture strategy. Adopting the YARN-pluggable architecture for resource management versus allowing HDFS direct access approach can have implications for effectively managing your overall Hadoop cluster and ecosystem for application and user stability.

File Encoding Standards

Hadoop Architecture

radiantadvisors.com | 7

The rapid maturing of SQL-on-Hadoop engines will undoubtedly continue to benefit end users and strive to meet more and more of their demands. These engines will be relied upon to deliver the optimal performance combination of response times, scalability, and SQL language capability today and later in its roadmap. While it is possible to have all SQL engines co-existing in the Hadoop environment, understanding default and optimized engine roles will help in aligning requirements with engines.

Balancing which of the dimensions of response time, scalability, or SQL capability is most critical for processes (like discovery and analysis versusadvanced analytics on big or small data sets) will help in selecting which SQL engine should be used. Don’texpect any one SQL engine to meet all your Hadoop data access needs today: it’s okay to have more than one open-source and cost-effective option.

Page 8: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 8

Benchmark Results

Page 9: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 9

The ApproachNo performance benchmark is perfect as every user’s specific applications or workloads can vary between companies and are often unique. Though the TPC-DS benchmark strives to be exhaustive, this performance benchmark chose to limit variables and focus on a straightforward comparative: SQL engines performance on Hadoop. We compared open-source and free for download SQL engines with data loaded into various file formats in Hadoop, while striving to keep many other factors constant.

The open-source web analytics application Piwik was selected for the performance benchmark for schema, seed data, and application queries. The Piwik queries are used to drive the web analytics dashboard application, and the Piwik schema included three tables representing user visits and actions. Three years of initial data was provided, and then was duplicated to represent 100x user activity over the same three years. In total, this schema represented approximately one terabyte of user data. SQL statements executed were based on requiring minimal SQL statement editing from Piwik app.

After loading data files all queries were manually run independently at least 10 times per each cluster, SQL engine, and file encoding format. Performance stats were based upon the average of 10 runs, with analysis of minimum, maximum, standard deviation, average deviation from mean, and percent deviation to average ratio for observed anomolies. No third party access tools were used and only interactive command line, scripts, or log files.

Benchmark Application - Piwik

Benchmark Data Collection

Each cluster was identical Amazon servers with one head node and four data nodesSingle user query concurrency with no other activitiesManually flushing cache and not running the same query multiple times in a row No load performance or compression factors measured

Scalability

Concurrency

Shuffling

Data

-

-

-

-

Fixed Variables & Out of Scope

radiantadvisors.com | 9

Page 10: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 10

ConfigurationsAmazon EC2 extra-large servers and EBS storage were used to configure three identical AWS clusters of five servers (one name node and four data nodes):

Piwik schema tables (~1.1TB uncompressed)log_visit (visits to a site with many actions) 1,412,062,500 rowslog_link_visit_action (many-to-many relationship) 6,484,811,700 rowslog_action (individual URLs visited on sites) 2,539,682 rows

Cluster 1 Hortonworks 1.3Hive 0.11Presto 0.57InfiniDB Enterprise 4.0 ORC partitioned filesSequenceFile partitioned filesIDB columnar files

Cluster 2 Hortonworks 2.0Hive 0.12Presto 0.57InfiniDB Enterprise 4.0

ORC partitioned filesSequenceFile partitioned filesIDB columnar files

Cluster 3 Cloudera CDH5 beta Hive 0.11Impala 1.2InfiniDB Enterprise 4.0

Parquet partitioned filesIDB columnar files

Using Amazon-based EC2 and EBS for Hadoop clusters further demonstrates the accessibility and quick deployment in the cloud

radiantadvisors.com | 10

Query Types:Queries 1 - 4 represent Reporting Queries 5 - 7 represent Adhoc Queries 8 - 10 represent Analytic

Page 11: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 11

Benchmark Results

0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 8 00.0

Q10

Q9

Q8

Q7

Q6

Q5

Q4

Q3

Q2

Q1

Hive 0.12 Presto 0.57 InfiniDB 4.0 Impala 1.2

Seconds

Hive completed 4 of the 10 benchmark queries where date partitions were leveraged. Hive could likely complete these with more nodes, but would require a much longer time or significant SQL rewriting.

Impala was the fastest on the reporting queries using Parquet-partitioned files, but still 4x - 12x slower than InfiniDB on adhoc queries and was not able to execute analytic queries 8 - 10.

InfiniDB completed all queries in the benchmark with exceptional times using the IDB file format while slightly slower than Impala on the reporting queries.

Presto completed 9 of the 10 benchmark queriessignificantly faster than Hive, however was stillmultiple times slower than Impala or InfiniDB.

Clusters 1 and 2 implementations of Hortonworks 1.3 and 2 demonstrated only a slight difference in results. Therefore, the analysis only discusses Hadoop 2.

There was a consistent and inverted performance gain between ORC-partitioned and SEQ-partitioned files for both Hive and Presto in clusters 1 and 2. Only ORC-partitioned files were included in the analysis

radiantadvisors.com | 11

Page 12: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 12

Comments:

Impala is fastest with InfiniDB within a few seconds on clusters 2 and 3Hive 0.11 on Cloudera intentionally not tested with Parquet filesPresto and Impala added “limit 10000” to end of the queryInfiniDB retained the implicit numbered column alias in GROUP BY from Piwik

Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).

Query 1Reporting Query: Select statement with sub-select containing count-distinct, filtering, ordering, group by, order by, and limit leveraging date partition.

Hive syntax example (10,000 rows returned)

select name, type, subq.*from (select idaction_urlidaction, count(distinct idvisit) dis_visit , count(*) subquery_count from log_link_visit_ action where (part_date between ‘2011-01-01’ and ‘2011-01-31’ and idsite = 5 and idaction_url > 0 ) group by idaction_ url order by subquery_count desc limit 10000) subqjoin log_action_tbl on (subq.idaction = log_ action.idaction) order by subquery_count desc ;

Cluster 2

Seconds

47x Hive 0.12 721.21 s

45.24 s15.34 s

Presto 0.57InfiniDB 4.0

Hortonworks 2.0ORC/Partitioned

IDB Files

Cluster 3

Seconds

1.4x 12.48 s17.52 s

Cloudera Beta 5PARQUET/Partitioned

Hive 0.11Impala 1.2

InfiniDB 4.0

not tested

IDB Files

Page 13: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 13

Comments:

Impala is fastest with InfiniDB within one second on clusters 2 and 3Hive 0.11 on Cloudera intentionally not tested with Parquet filesPresto and Impala added “limit 10000” to end of the queryInfiniDB retained the implicit numbered column alias in GROUP BY from PiwikInfiniDB specifies another schema for INDEX

Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).

Query 2Reporting Query:Select statement with count-distinct, filtering, ordering, group by, order by, and limit leveraging date partition.

Impala syntax example(10,000 rows returned)

select visit_entry_idaction_url idaction, count(distinct visitor_ idcookie) INDEX_NB_UNIQ_ VISITORS, count(*) INDEX_PAGE_ NB_HITSfrom log_visitwhere (part_date >= ‘2011-01-01’ and part_date <= ‘2011-01-31’ and idsite = 5 and visit_entry_idaction_url > 0 ) group by visit_entry_idaction_url order by INDEX_PAGE_NB_HITS desclimit 10000;

Cluster 2

Seconds

38x 129.29 s

18.89 s3.40 s

Hortonworks 2.0ORC/Partitioned

Hive 0.11

InfiniDB 4.0

IDB Files

Presto 0.57

Cluster 3

Seconds

1.2x 3.10 s

3.98 s

Cloudera Beta 5PARQUET/Partitioned

Hive 0.11Impala 1.2

InfiniDB 4.0

not tested

IDB Files

Page 14: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 14

Comments:

Impala is fastest with InfiniDB within one second on clusters 2 and 3Hive 0.11 on Cloudera intentionally not tested with Parquet filesPresto and Impala added “limit 10000” to end of the queryInfiniDB retained the implicit numbered column alias in GROUP BY from PiwikInfiniDB specifies another schema for INDEX

Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).

Query 3Reporting Query:Select statement with sum group by, filtering, and ordering leveraging date partition.

Hive syntax example(21,552 rows returned)

select idaction_url_ref, sum(time_ spent_ref_action) INDEX_PAGE_ SUM_TIME_SPENTfrom log_link_visit_ actionwhere (part_date between ‘2011-01-01’ and ‘2011-01-31’and idsite = 5 and time_spent_ref_ action > 0 and idaction_url_ref > 0 ) group by idaction_url_reforder by idaction_url_ref desc;

Cluster 2

Seconds

47x 364.59 s

24.53 s7.69 s

Hortonworks 2.0ORC/Partitioned

Hive 0.11

InfiniDB 4.0

IDB Files

Presto 0.57

Cluster 3

Seconds

1.3x 6.30 s8.60 s

Cloudera Beta 5PARQUET/Partitioned

Hive 0.11Impala 1.2

InfiniDB 4.0

not tested

IDB Files

Page 15: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 15

Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).

Query 4Reporting Query:Select statement with sum, filtering, ordering, and limit leveraging date partition.

Impala syntax example(7,177 rows returned)

select idaction_name_ref, sum(time_ spent_ref_action)from log_link_visit_actionwhere (part_date >= ‘2011-01-01’ and part_date <= ‘2011-01-31’ and idsite = 5 and time_spent_ref_action > 0 and idaction_name_ref > 0 )group by idaction_name_ref order by idaction_name_ref desc limit 10000;

Comments:

Impala is fastest with InfiniDB within one second on clusters 2 and 3Hive 0.11 on Cloudera intentionally not tested with Parquet filesPresto and Impala added “limit 10000” to end of the queryInfiniDB retained the implicit numbered column alias in GROUP BY from PiwikInfiniDB specifies another schema for INDEX

Cluster 2

Seconds

46x 358.06 s

23.20 s7.75 s

Hortonworks 2.0ORC/Partitioned

Hive 0.11

InfiniDB 4.0

IDB Files

Presto 0.57

Cluster 3

Seconds

1.4x

Cloudera Beta 5PARQUET/Partitioned

6.10 s8.62 s

Hive 0.11Impala 1.2

InfiniDB 4.0

not tested

IDB Files

Page 16: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 16

Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).

Query 5Adhoc Query:Select statement with date functions, no joins, filtering, and grouping by date functions without leveraging date partition in where clause.

InfiniDB syntax example(21,552 rows returned)

select year(visit_server_date), month(visit_server_date), count(distinct idvisit) from log_visitwhere visit_entry_idaction_url = 2301411group by year(visit_server_date),month(visit_server_date);

Comments:

InfiniDB is fastest with Impala 12x slower but still twice as fast as PrestoHive 0.12 was not able to complete with resource issues noted in log filesHive 0.11 on Cloudera intentionally not tested with Parquet filesPresto required data parse for implicit string to date conversions

Cluster 2

Seconds

38x 48.16 s1.28 s

Hortonworks 2.0ORC/Partitioned

Hive 0.11

InfiniDB 4.0

not able to complete

IDB Files

Presto 0.57

Cluster 3

Seconds

12x 18.66 s

1.60 s

Cloudera Beta 5PARQUET/Partitioned

Hive 0.11Impala 1.2

InfiniDB 4.0

not tested

IDB Files

Page 17: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 17

Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).

Query 6Adhoc Query:Select statement with date functions, count-distinct, no joins, filtering, and grouping by date functions without leveraging date partition in where clause.

Impala syntax example(726 rows returned)

select year(server_time),month(server_time),idaction_url,count(distinct idvisit) from log_link_visit_actionwhere idaction_url between 2301400 and 2302400group by year(server_time),month(server_time),idaction_url; Comments:

InfiniDB is fastest with Impala 10x slower but still twice as fast as PrestoHive 0.12 was not able to complete with resource issues noted in log filesHive 0.11 on Cloudera intentionally not tested with Parquet files

Cluster 2

Seconds

23x 142.46 s6.18 s

Hortonworks 2.0ORC/Partitioned

Hive 0.11

InfiniDB 4.0

IDB Files

Presto 0.57

not able to complete

Cluster 3

Seconds

9.9x 69.38 s6.99 s

Cloudera Beta 5PARQUET/Partitioned

Hive 0.11

Impala 1.2

InfiniDB 4.0

not tested

Page 18: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 18

Note: Cluster 2 (800 seconds) shown as 10x scale compared to Cluster 3 (80 seconds).

Query 7Adhoc Query:Select statement with sub-select containing count, join, filtering, ordering, and group by without leveraging date partition in where clause.

InfiniDB syntax example (2 rows returned)

select sub.visit_entry_idaction_url, sub.name, lv.referer_url, count(sub.idvisit)from log_visit lv, (select visit_entry_idaction_url, name, idvisitfrom log_visit join log_action on (visit_enåtry_idaction_url = log_action.idaction)where visit_entry_idaction_url between 2301400 and 2302400) subwhere lv.idvisit = sub.idvisitgroup by 1, 2, 3;

Comments:

InfiniDB is fastest with Impala 4x slower but still three times faster than PrestoHive 0.12 was not able to complete with resource issues noted in log filesHive 0.11 on Cloudera intentionally not tested with Parquet filesInfiniDB retained the implicit numbered column alias in GROUP BY from Piwik

Cluster 2

Seconds

16x 77.41 s4.66 s

Hortonworks 2.0ORC/Partitioned

Hive 0.12

Presto 0.57

InfiniDB 4.0

not able to complete

Cluster 3

Seconds

4x 24.18 s5.85 s

Cloudera Beta 5PARQUET/Partitioned

Hive 0.11Impala 1.2

InfiniDB 4.0

IDB Files

not tested

Page 19: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 19

Comments:

Rank and Dense Rank analytic functions in SQL statementInfiniDB is fastest compared with Presto Hive 0.12 and Impala were not able execute SQL statement InfiniDB retained implicit numbered column alias in GROUP and ORDER BY

Query 8Analytic Query:Select statement with sub-select containing count-distinct, filtering, ordering, group by, order by, and limit.

InfiniDB syntax example(274 rows returned)

select sub.visit_entry_idaction_url, sub.name, lv.referer_url, sum(visit_total_time) total_time, count(sub.idvisit), RANK () OVER (PARTITION BY sub.visit_entry_idaction_url ORDER BY count(sub.idvisit)) rank_by_visits, DENSE_RANK() OVER (PARTITION BY sub.visit_entry_idaction_url order by count(visit_total_time)) rank_by_time_spentfrom log_visit lv,(select visit_entry_idaction_url, name, idvisit from log_visit join log_action on (visit_entry_idaction_url = log_action.idaction) where visit_entry_idaction_url between 2301400 and 2302400) sub where lv.idvisit = sub.idvisitgroup by 1, 2, 3order by 1,6,7;

Cluster 2

Seconds

80x 506.84 s6.37 s

Hortonworks 2.0ORC/Partitioned

Hive 0.12

Presto 0.57

InfiniDB 4.0

not executable

IDB Files

Impala 1.2 not executable

Page 20: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 20

Query 9Analytic Query:Select statement with sum, count and cumlative distribution with sub-select containing a join, filtering, group by, and order by.

InfiniDB syntax example(274 rows returned)

select sub.visit_entry_idaction_url, sub.name, lv.referer_url, sum(visit_total_time) total_time,count(sub.idvisit) total_visits,CUME_DIST() OVER (PARTITION BY sub.visit_entry_idaction_url ORDER BY sum(visit_total_time) desc) cumulative_distribution from log_visit lv,(select visit_entry_idaction_url, name, idvisitfrom log_visitjoin log_action on (visit_entry_idaction_url = log_action.idaction)where visit_entry_idaction_url between 2301400 and 2302400) subwhere lv.idvisit = sub.idvisitgroup by 1, 2, 3order by 1,5, 6;

Comments:

Cumulative Distribution analytic functions in SQL statementInfiniDB is fastest compared with Presto Hive 0.12 and Impala were not able execute SQL statement InfiniDB retained implicit numbered column alias in GROUP and ORDER BY

Cluster 2

Seconds

75x 471.43 s6.24 s

Hortonworks 2.0ORC/Partitioned

Hive 0.12

Presto 0.57

InfiniDB 4.0

not executable

IDB Files

Impala 1.2 not executable

Page 21: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 21

Query 10Analytic Query:Select statement with standard deviation, variance population with sub-select containing date functions, count-distinct, sum, filtering, ordering, group by and order by.

InfiniDB syntax example(1,052 rows returned)select action_year, action_month, idaction_url, sum_visit_time, visit_count, stddev_pop(sum_visit_time) OVER (PARTITION BY idaction_url ORDER BY action_year,action_month) visit_time_dev, var_pop(sum_visit_time) OVER (PARTITION BY idaction_url ORDER BY action_year,action_month) visit_time_variancefrom(select year(server_time) action_year,month(server_time) action_month,idaction_url, count(distinct idvisit) visit_count, sum(time_spent_ref_action) sum_visit_time from log_link_visit_actionwhere idaction_url between 2301400 and 2302400group by year(server_time), month(server_time),idaction_url) a order by 3,1,2;

Comments:

Standard Deviation and Variance analytic functions in SQL statementInfiniDB was quick and the only SQL engine able to executeHive 0.12, Presto, and Impala were not able execute SQL statement InfiniDB retained implicit numbered column alias in GROUP and ORDER BY

Cluster 2

Seconds

not executable

not executable

Hortonworks 2.0ORC/Partitioned

IDB Files

not executable

not executable

7.01 s

Hive 0.12

Presto 0.57

InfiniDB 4.0

Impala 1.2 not executable

Page 22: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 22

Summary of Benchmark Test Results

All values are in seconds of SQL response timeAll values represent the average of 10 independent runs “x” denotes SQL could not be executed“n/t” denotes not test as not required“n/c” denotes not able to complete

Cluster 1 Hortonworks 1.3

Reporting Q-01Reporting Q-02Reporting Q-03Reporting Q-04

SQL Engine

File Encoding

Adhoc Q-05Adhoc Q-06Adhoc Q-07

Analytic Q-08Analytic Q-09Analytic Q-10

52.2826.2527.9527.14

516.12130.09290.35289.35

ORCPARTORCPART

74.32161.9486.74

n/cn/cn/c

n/tn/tx

xxx

15.363.717.747.81

IDB

1.707.236.91

n/tn/tn/t

96.3564.4073.1866.88

277.18102.59140.05137.47

SEQPARTSEQPART

371.38807.75388.15

n/cn/cn/c

n/tn/tx

xxx

Presto 0.57Hive 0.11 InfiniDB 4.0

Cluster 2 - Hortonworks 2.0

Reporting Q-01Reporting Q-02Reporting Q-03Reporting Q-04

SQL Engine

File Encoding

Adhoc Q-05Adhoc Q-06Adhoc Q-07

Analytic Q-08Analytic Q-09Analytic Q-10

45.2418.8924.5323.20

721.21129.29364.59358.06

ORCPARTORCPART

48.16142.4677.41

n/tn/tn/t

506.84471.43

x

xxx

15.343.407.697.75

IDB

1.286.184.66

6.376.247.01

116.8571.8486.2292.66

325.6891.39135.67131.72

SEQPARTSEQPART

374.44866.70462.09

n/tn/tn/t

n/tn/tn/t

xxx

Presto 0.57Hive 0.12 InfiniDB 4.0

Cluster 3 - Cloudera CDH5 beta

Reporting Q-01Reporting Q-02Reporting Q-03Reporting Q-04

SQL Engine

File Encoding

Adhoc Q-05Adhoc Q-06Adhoc Q-07

Analytic Q-08Analytic Q-09Analytic Q-10

12.483.106.306.10

PARQUET-PART

18.6669.3824.18

xxx

17.523.988.608.62

IDB

1.606.995.85

6.175.537.37

Impala 1.2 InfiniDB 4.0

Page 23: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 23

Closing

Page 24: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 24

Closing CommentsThis year, we will continue to see the evolution and maturing of these SQL engines along with the possible addition of new open-source options. This independent benchmark serves as a snapshot of where we were in Q1 of the 2014 Hadoop ecosystem. The subset of vendors and software components included in this benchmark were chosen for their architecture commonality and availability to mainstream companies looking to benefit from SQL-on-Hadoop processes and tools.

The Hadoop ecosystem continues to evolve with the recent GA release of Hadoop 2 and YARN. The exciting and fast-changing Hadoop ecosystem can be a challenge to keep up with as we follow encouraging stats on Hive 0.13 and Tez, Presto 0.60 is now available, and Parquet adoption continues to gain traction. We look forward to benchmarking these in the future.

We also recognize that other vendors have valuable solutions to offer companies with different packaging and business models. Our upcoming independent performance benchmarks will continue to strive for proper product categorizations and focused “apples-to-apples” comparisons.

Join the ConversationThis independent performance benchmark is intended to be part of an open, ongoing conversation to share and learn from each other’s experiences and perspectives. Please share your comments online with us at RadiantAdvisors.com or on Twitter @RadiantAdvisors.

radiantadvisors.com | 24

Page 25: Radiantadvisors benchmark sql on hadoop 2014q1

radiantadvisors.com | 25

AcknowledgementsInfiniDB engaged Radiant Advisors to perform an independent performance benchmark and analysis of preliminary findings conducted at InfiniDB. While InfiniDB is the sole sponsor of this report, Radiant Advisors has independently set up the cluster configurations, testing approach, benchmark execution, all data collection, and analysis. All findings and takeaways are solely those of Radiant Advisors.

Special thanks to Dr. Nathan Halko, resident data scientist, for providing an independent review of the test approach, data, and analysis.

Special thanks to Dundas Data Visualization, for providing data visualizations and visual engineering support to Lindy Ryan, Research Director, Data Discovery and Visualization at Radiant Advisors.

ReferencesPiwik.org - Open-source web analytics platform

Prestodb.io - Distributed SQL Query Engine for Big Data

Hortonworks.com - HDP distribution of Apache Hadoop

Cloudera.com - CDH distribution and Impala

InfiniDB.co - InfiniDB for Apache Hadoop

Parquet.io - Columnar file format for Hadoop

radiantadvisors.com | 25

Page 26: Radiantadvisors benchmark sql on hadoop 2014q1

Radiant Advisors Independent Benchmark Report

April 2014

Copyright © 2014 Radiant Advisors. All Rights Reserved.

InfiniDB and the InfiniDB logo and any other product or service names or slogans contained in this document are trademarks of InfiniDB and its suppliers or licen-sors, and may not be copied, imitated or used, in whole or in part, without the prior written consent of InfiniDB or the applicable trademark holder.

Hadoop, Sqoop, and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

Dundas logo is a trademark of Dundas Data Visualization, Inc.

All other trademarks, registered trademarks, product names, and company names or logo mentioned in this document, include Radiant Advisors and its affiliated brands and logos, are the property of their respective owners. Reference to any products, services, processes, or other information, by trade name, trademark, manufacturer, and supplier or otherwise does not constitute or imply endorsement, sponsorship, or recommendation thereof by Radiant Advisors.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any or by any means, or for any purpose, without the express written permission of Radiant Advisors.

Radiant Advisors may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this report. Except as expressly provided in any written license agreement from Radiant Advisors, the furnishing of this document does not give you the license to these patents, trademarks, copyrights, or other intellectual property. The information in this document is subject to change without notice. Radiant Advisors has made every effort to ensure that all statements and information contained in this document are accurate, but accepts no liability for any error or omission in the same.