SparkSQL: A Compiler from Queries to RDDs

44
SparkSQL: A Compiler from Queries to RDDs Sameer Agarwal Spark Summit | Boston | February 9 th 2017

Transcript of SparkSQL: A Compiler from Queries to RDDs

SparkSQL:A Compiler from Queries to RDDs

Sameer AgarwalSpark Summit | Boston | February 9th 2017

About Me

• Software Engineer at Databricks (Spark Core/SQL)• PhD in Databases (AMPLab, UC Berkeley)• Research on BlinkDB (Approximate Queries in Spark)

Background: What is an RDD?

• Dependencies• Partitions• Compute function: Partition => Iterator[T]

3

Background: What is an RDD?

• Dependencies• Partitions• Compute function: Partition => Iterator[T]

4

Opaque Computation

Background: What is an RDD?

• Dependencies• Partitions• Compute function: Partition => Iterator[T]

5

Opaque Data

RDD Programming Model

6

Construct execution DAG using low level RDD operators.

RDD Programming Model

7

Construct execution DAG using low level RDD operators.

RDD Programming Model

8

Construct execution DAG using low level RDD operators.

SQL/Structured Programming Model

• High-level APIs (SQL, DataFrame/Dataset): Programs describe what data operations are needed without specifying how to execute these operations• More efficient: An optimizer can automatically find out

the most efficient plan to execute a query

9

10

SQL AST

DataFrame

Dataset

Query Plan Optimized Query Plan RDDs

Transformations

Catalyst

Abstractions of users’ programs(Trees)

Spark SQL Overview

Tungsten

11

How Catalyst Works: An Overview

SQL AST

DataFrame

Dataset

Query Plan Optimized Query Plan RDDs

Transformations

Catalyst

Abstractions of users’ programs(Trees)

12

Trees: Abstractions of Users’ Programs

SELECT sum(v)FROM (SELECTt1.id,1 + 2 + t1.value AS v

FROM t1 JOIN t2WHEREt1.id = t2.id ANDt2.id > 50 * 1000) tmp

13

Trees: Abstractions of Users’ Programs

SELECT sum(v)FROM (SELECTt1.id,1 + 2 + t1.value AS v

FROM t1 JOIN t2WHEREt1.id = t2.id ANDt2.id > 50 * 1000) tmp

Expression• An expression represents a

new value, computed based on input values• e.g. 1 + 2 + t1.value

14

Trees: Abstractions of Users’ Programs

SELECT sum(v)FROM (SELECTt1.id,1 + 2 + t1.value AS v

FROM t1 JOIN t2WHEREt1.id = t2.id ANDt2.id > 50 * 1000) tmp

Query Plan

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

Logical Plan

• A Logical Plan describes computation on datasets without defining how to conduct the computation

15

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

Physical Plan

• A Physical Plan describes computation on datasets with specific definitions on how to conduct the computation

16

Parquet Scan(t1)

JSON Scan(t2)

Sort-Merge Join

Filter

Project

Hash-Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

17

How Catalyst Works: An Overview

SQL AST

DataFrame

Dataset(Java/Scala)

Query Plan Optimized Query Plan RDDs

Transformations

Catalyst

Abstractions of users’ programs(Trees)

• A function associated with every tree used to implement a single rule

Transform

18

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

Attribute(t1.value)

Add

Literal(3)

3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2 for every row

Transform

• A transform is defined as a Partial Function• Partial Function: A function that is defined for a subset

of its possible arguments

19

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Case statement determine if the partial function is defined for a given input

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Transform

20

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Transform

21

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Transform

22

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

val expression: Expression = ...expression.transform {

case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>Literal(x + y)

}

Transform

23

Attribute(t1.value)

Add

Add

Literal(1) Literal(2)

1 + 2 + t1.value

Attribute(t1.value)

Add

Literal(3)

3+ t1.value

Combining Multiple Rules

24

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

Predicate Pushdown

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t2.id>50*1000

t1.id=t2.id

Combining Multiple Rules

25

Constant Folding

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t2.id>50*1000

t1.id=t2.id

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,3+t1.value as v

t2.id>50000

t1.id=t2.id

Combining Multiple Rules

26

Column Pruning

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,3+t1.value as v

t2.id>50000

t1.id=t2.id

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,3+t1.value as v

t2.id>50000

t1.id=t2.id

Project Projectt1.idt1.value t2.id

Combining Multiple Rules

27

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,1+2+t1.value as v

t1.id=t2.idt2.id>50*1000

Scan(t1)

Scan(t2)

Join

Filter

Project

Aggregate sum(v)

t1.id,3+t1.value as v

t2.id>50000

t1.id=t2.id

Project Projectt1.idt1.value

t2.id

Before transformations

After transformations

28

SQL AST

DataFrame

Dataset

Query Plan Optimized Query Plan RDDs

Transformations

Catalyst

Abstractions of users’ programs(Trees)

Spark SQL Overview

Tungsten

Scan

Filter

Project

Aggregate

select count(*) from store_saleswhere ss_item_sk = 1000

G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System,In IEEE Transactions on Knowledge and Data Engineering 1994

Volcano Iterator Model

• Standard for 30 years: almost all databases do it

• Each operator is an “iterator” that consumes records from its input operator

class Filter(child: Operator,predicate: (Row => Boolean))

extends Operator {def next(): Row = {var current = child.next()while (current == null ||predicate(current)) {

current = child.next()}return current}

}

Downside of the Volcano Model

1. Too many virtual function callso at least 3 calls for each row in Aggregate

2. Extensive memory accesso “row” is a small segment in memory (or in L1/L2/L3 cache)

3. Can’t take advantage of modern CPU featureso SIMD, pipelining, prefetching, branch prediction, ILP, instruction

cache, …

Scan

Filter

Project

Aggregate

long count = 0;for (ss_item_sk in store_sales) {

if (ss_item_sk == 1000) {count += 1;

}}

Whole-stage Codegen: Spark as a “Compiler”

Whole-stage Codegen

• Fusing operators together so the generated code looks like hand optimized code:- Identify chains of operators (“stages”)- Compile each stage into a single function- Functionality of a general purpose execution engine;

performance as if hand built system just to run your query

T Neumann, Efficiently compiling efficient query plans for modern hardware. In VLDB 2011

Putting it All Together

Operator Benchmarks: Cost/Row (ns)

5-30xSpeedups

Operator Benchmarks: Cost/Row (ns)

Radix Sort10-100xSpeedups

Operator Benchmarks: Cost/Row (ns)

Shufflingstill thebottleneck

Operator Benchmarks: Cost/Row (ns)

10xSpeedup

TPC-DS (Scale Factor 1500, 100 cores)Q

uery

Tim

e

Query #

Spark 2.0 Spark 1.6

Lower is Better

What’s Next?

Spark 2.2 and beyond

1. SPARK-16026: Cost Based Optimizer- Leverage table/column level statistics to optimize joins and aggregates- Statistics Collection Framework (Spark 2.1)- Cost Based Optimizer (Spark 2.2)

2. Boosting Spark’s Performance on Many-Core Machines- In-memory/ single node shuffle

3. Improving quality of generated code and better integration with the in-memory column format in Spark

Thank you.