Shark: SQL and Rich Analytics at Scale
Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica AMPLab, UC Berkeley June 25 @ SIGMOD 2013
Challenges Data size growing » Processing has to scale out over large���
clusters » Faults and stragglers complicate DB design
Complexity of analysis increasing » Massive ETL (web crawling) » Machine learning, graph processing » Leads to long running jobs
The Rise of MapReduce
What’s good about MapReduce?
1. Scales out to thousands of nodes in a fault-tolerant manner
2. Good for analyzing semi-structured data and complex analytics
3. Elasticity (cloud computing)
4. Dynamic, multi-tenant resource sharing
“parallel relational database systems are significantly faster than those that rely on the use of MapReduce for their query engines”
“I totally agree.”
This Research 1. Shows MapReduce model can be extended to
support SQL efficiently » Started from a powerful MR-like engine (Spark) » Extended the engine in various ways
2. The artifact: Shark, a fast engine on top of MR » Performant SQL » Complex analytics in the same engine » Maintains MR benefits, e.g. fault-tolerance
MapReduce Fundamental Properties?
Data-parallel operations » Apply the same operations on a defined set of data
Fine-grained, deterministic tasks » Enables fault-tolerance & straggler mitigation
Why Were Databases Faster?
Data representation » Schema-aware, column-oriented, etc » Co-partition & co-location of data
Execution strategies » Scheduling/task launching overhead (~20s in Hadoop) » Cost-based optimization » Indexing
Lack of mid-query fault tolerance » MR’s pull model costly compared to DBMS “push”
See Pavlo 2009, Xin 2013.
Why Were Databases Faster?
Data representation » Schema-aware, column-oriented, etc » Co-partition & co-location of data
Execution strategies » Scheduling/task launching overhead (~20s in Hadoop) » Cost-based optimization » Indexing
Lack of mid-query fault tolerance » MR’s pull model costly compared to DBMS “push”
See Pavlo 2009, Xin 2013.
Not fundamental to “MapReduce”
Can be surprisingly
cheap
Introducing Shark MapReduce-based architecture » Uses Spark as the underlying execution engine » Scales out and tolerate worker failures
Performant » Low-latency, interactive queries » (Optionally) in-memory query processing
Expressive and flexible » Supports both SQL and complex analytics » Hive compatible (storage, UDFs, types, metadata, etc)
Spark Engine Fast MapReduce-like engine » In-memory storage for fast iterative computations » General execution graphs » Designed for low latency (~100ms jobs)
Compatible with Hadoop storage APIs » Read/write to any Hadoop-supported systems, including
HDFS, Hbase, SequenceFiles, etc
Growing open source platform » 17 companies contributing code
More Powerful MR Engine General task DAG
Pipelines functions���within a stage
Cache-aware data���locality & reuse
Partitioning-aware���to avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Client CLI JDBC
Hive Architecture
Meta store
Hadoop Storage (HDFS, S3, …)
Driver
SQL Parser
Query Optimizer
Physical Plan
Execution
MapReduce
Client CLI JDBC
Shark Architecture
Meta store
Hadoop Storage (HDFS, S3, …)
Driver
SQL Parser
Spark
Cache Mgr.
Physical Plan
Execution Query
Optimizer
Extending Spark for SQL Columnar memory store
Dynamic query optimization
Miscellaneous other optimizations (distributed top-K, partition statistics & pruning a.k.a. coarse-grained indexes, co-partitioned joins, …)
Columnar Memory Store Simply caching records as JVM objects is inefficient (huge overhead in MR’s record-oriented model)
Shark employs column-oriented storage, a partition of columns is one MapReduce “record”.
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4 Benefit: compact representation, CPU efficient compression, cache locality.
How do we optimize:������
SELECT * FROM table1 a JOIN table2 b ON a.key=b.key WHERE my_crazy_udf(b.field1, b.field2) = true;
Hard to estimate cardinality!
Partial DAG Execution (PDE) Lack of statistics for fresh data and the prevalent use of UDFs necessitate dynamic approaches to query optimization.
PDE allows dynamic alternation of query plans based on statistics collected at run-time.
Shuffle Join
Stage 3Stage 2
Stage 1
JoinResult
Stage 1
Stage 2
JoinResult
Map Join (Broadcast Join) minimizes network traffic
PDE Statistics Gather customizable statistics at per-partition granularities while materializing map output. » partition sizes, record counts (skew detection) » “heavy hitters” » approximate histograms
Can alter query plan based on such statistics » map join vs shuffle join » symmetric vs non-symmetric hash join » skew handling
Complex Analytics Integration Unified system for SQL, machine learning
Both share the same set of workers and caches
def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache())
Pavlo Benchmark
Selection
0 22.5 45 67.5 90
Shark Shark5(disk) Hive
1.1
0 150 300 450 600
Aggregation1K5Groups
32
HiveShark5(disk)
SharkShark5Copartitioned
0 500 1000 1500 2000
Runtime5(seconds)
Machine Learning Performance
K"Means(Clustering
0 36 72 108 144 180
157
4.1
Logistic(Regression
0 24 48 72 96 120
110
0.96
Shark Hadoop
Runtime per iteration (secs)
Real Warehouse Benchmark
0
25
50
75
100
Q1 Q2 Q3 Q4
Runtim
e0(sec
onds
)
Shark Shark0(disk) Hive
1.1 0.8 0.7 1.0
1.7 TB Real Warehouse Data on 100 EC2 nodes
New Benchmark
Impala
Impala&(mem)
Redshift
Shark&(disk)
Shark&(mem)
0 5 10 15 20
Runtime&(seconds)
http://tinyurl.com/bigdata-benchmark
Other benefits of MapReduce Elasticity » Query processing can scale up and down dynamically
Straggler Tolerance
Schema-on-read & Easier ETL
Engineering » MR handles task scheduling / dispatch / launch » Simpler query processing code base (~10k LOC)
Berkeley Data Analytics Stack
Spark
Shark SQL
HDFS / Hadoop Storage
Mesos Resource Manager
Spark Streaming GraphX MLBase
Community
3000 people attended online training
800 meetup members
17 companies contributing
Conclusion Leveraging a modern MapReduce engine and techniques from databases, Shark supports both SQL and complex analytics efficiently, while maintaining fault-tolerance.
Growing open source community » Users observe similar speedups in real use cases » http://shark.cs.berkeley.edu » http://www.spark-project.org
MapReduce DBMSs Shark
Top Related