20140908 spark sql & catalyst

44
Introduction to Spark SQL & Catalyst Takuya UESHIN Spark Meetup 2014/09/08(Mon)

description

A presentation material for Spark Meetup 2014/09/08.

Transcript of 20140908 spark sql & catalyst

Page 1: 20140908 spark sql & catalyst

Introduction to Spark SQL & Catalyst

Takuya UESHIN !

Spark Meetup 2014/09/08(Mon)

Page 2: 20140908 spark sql & catalyst

Who am I?

Takuya UESHIN

@ueshin

github.com/ueshin

Nautilus Technologies, Inc.

A Spark contributor

2

Page 3: 20140908 spark sql & catalyst

Agenda

What is Spark SQL?

Catalyst in depth

SQL core in depth

Interesting issues

How to contribute

3

Page 4: 20140908 spark sql & catalyst

What is Spark SQL?

Page 5: 20140908 spark sql & catalyst

What is Spark SQL?

5

Spark SQL is one of Spark components.

Executes SQL on Spark

Builds SchemaRDD like LINQ

Optimizes execution plan.

Page 6: 20140908 spark sql & catalyst

What is Spark SQL?Catalyst provides a execution planning framework for relational operations.

Including:

SQL parser & analyzer

Logical operators & general expressions

Logical optimizer

A framework to transform operator tree.

6

Page 7: 20140908 spark sql & catalyst

What is Spark SQL?To execute query needs some steps.

ParseAnalyze

Logical PlanOptimize

Physical Plan

Execute

7

Page 8: 20140908 spark sql & catalyst

What is Spark SQL?To execute query needs some steps.

ParseAnalyze

Logical PlanOptimize

Physical Plan

Execute

8

Catalyst

Page 9: 20140908 spark sql & catalyst

What is Spark SQL?To execute query needs some steps.

ParseAnalyze

Logical PlanOptimize

Physical Plan

Execute

9

SQL core

Page 10: 20140908 spark sql & catalyst

Catalyst in depth

Page 11: 20140908 spark sql & catalyst

Catalyst in depthProvides a execution planning framework for relational operations.

Row & DataType’s

Trees & Rules

Logical Operators

Expressions

Optimizations

11

Page 12: 20140908 spark sql & catalyst

Row & DataType’so.a.s.sql.catalyst.types.DataType

Long, Int, Short, Byte, Float, Double, Decimal

String, Binary, Boolean, Timestamp

Array, Map, Struct

o.a.s.sql.catalyst.expressions.Row

Represents a single row.

Can contain complex types.

12

Page 13: 20140908 spark sql & catalyst

Trees & Ruleso.a.s.sql.catalyst.trees.TreeNode

Provides transformations of tree.

foreach, map, flatMap, collect

transform, transformUp, transformDown

Used for operator tree, expression tree.

13

Page 14: 20140908 spark sql & catalyst

Trees & Ruleso.a.s.sql.catalyst.rules.Rule

Represents a tree transform rule.

o.a.s.sql.catalyst.rules.RuleExecutor

A framework to transform trees based on rules.

14

Page 15: 20140908 spark sql & catalyst

Logical OperatorsBasic Operators

Project, Filter, …

Binary Operators

Join, Except, Intersect, Union, …

Aggregate

Generate, Distinct

Sort, Limit

InsertInto, WriteToFile

15

Join

Table Table

Filter

Project

Page 16: 20140908 spark sql & catalyst

ExpressionsLiteral

Arithmetics

UnaryMinus, Sqrt, MaxOf

Add, Subtract, Multiply, …

Predicates

EqualTo, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual

Not, And, Or, In, If, CaseWhen

16

+

1 2

Page 17: 20140908 spark sql & catalyst

ExpressionsCast

GetItem, GetField

Coalesce, IsNull, IsNotNull

StringOperations

Like, Upper, Lower, Contains, StartsWith, EndsWith, Substring, …

17

Page 18: 20140908 spark sql & catalyst

OptimizationsConstantFolding

NullPropagation

ConstantFolding

BooleanSimplification

SimplifyFilters

FilterPushdown

CombineFilters

PushPredicateThroughProject

PushPredicateThroughJoin

ColumnPruning

18

Page 19: 20140908 spark sql & catalyst

OptimizationsNullPropagation, ConstantFolding

Replace expressions that can be evaluated with some literal value to the value.

ex)

1 + null => null

1 + 2 => 3

Count(null) => 0

19

Page 20: 20140908 spark sql & catalyst

OptimizationsBooleanSimplification

Simplifies boolean expressions that can be determined.

ex)

false AND $right => false

true AND $right => $right

true OR $right => true

false OR $right => $right

If(true, $then, $else) => $then

20

Page 21: 20140908 spark sql & catalyst

OptimizationsSimplifyFilters

Removes filters that can be evaluated trivially.

ex)

Filter(true, child) => child

Filter(false, child) => empty

21

Page 22: 20140908 spark sql & catalyst

OptimizationsCombineFilters

Merges two filters.

ex)

Filter($fc, Filter($nc, child)) => Filter(AND($fc, $nc), child)

22

Page 23: 20140908 spark sql & catalyst

OptimizationsPushPredicateThroughProject

Pushes Filter operators through Project operator.

ex)

Filter(‘i === 1, Project(‘i, ‘j, child)) => Project(‘i, ‘j, Filter(‘i === 1, child))

23

Page 24: 20140908 spark sql & catalyst

OptimizationsPushPredicateThroughJoin

Pushes Filter operators through Join operator.

ex)

Filter(“left.i”.attr === 1, Join(left, right) =>Join(Filter(‘i === 1, left), right)

24

Page 25: 20140908 spark sql & catalyst

OptimizationsColumnPruning

Eliminates the reading of unused columns.

ex)

Join(left, right, LeftSemi, “left.id”.attr === “right.id”.attr) => Join(left, Project(‘id, right), LeftSemi)

25

Page 26: 20140908 spark sql & catalyst

SQL core in depth

Page 27: 20140908 spark sql & catalyst

SQL core in depthProvides:

Physical operators to build RDD

Conversion from Existing RDD of Product to SchemaRDD support

Parquet file read/write support

JSON file read support

Columnar in-memory table support

27

Page 28: 20140908 spark sql & catalyst

SQL core in deptho.a.s.sql.SchemaRDD

Extends RDD[Row].

Has logical plan tree.

Provides LINQ-like interfaces to construct logical plan.

select, where, join, orderBy, …

Executes the plan.

28

Page 29: 20140908 spark sql & catalyst

SQL core in deptho.a.s.sql.execution.SparkStrategies

Converts logical plan to physical.

Some rules are based on statistics of the operators.

29

Page 30: 20140908 spark sql & catalyst

SQL core in depthParquet read/write support

Columnar storage format for Hadoop

Reads existing Parquet files.

Converts Parquet schema to row schema.

Writes new Parquet files.

Currently DecimalType and TimestampType are not supported.

30

Page 31: 20140908 spark sql & catalyst

SQL core in depthJSON read support

Loads a JSON file (one object per line)

Infers row schema from the entire dataset.

Giving the schema is experimental.

Inferring the schema by sampling is also experimental.

31

Page 32: 20140908 spark sql & catalyst

SQL core in depthColumnar in-memory table support

Caches table like RDD.cache, but as columnar style.

Can prune unnecessary columns when read data.

32

Page 33: 20140908 spark sql & catalyst

Interesting issues

Page 34: 20140908 spark sql & catalyst

Interesting issuesSupport the GroupingSet/ROLLUP/CUBE

https://issues.apache.org/jira/browse/SPARK-2663

Use statistics to skip partitions when reading from in-memory columnar data

https://issues.apache.org/jira/browse/SPARK-2961

34

Page 35: 20140908 spark sql & catalyst

Interesting issuesPluggable interface for shuffles

https://issues.apache.org/jira/browse/SPARK-2044

Sort-merge join

https://issues.apache.org/jira/browse/SPARK-2213

Cost-based join reordering

https://issues.apache.org/jira/browse/SPARK-2216

35

Page 36: 20140908 spark sql & catalyst

How to contribute

Page 37: 20140908 spark sql & catalyst

How to contributeSee: Contributing to Spark

Open an issue on JIRA

Send pull-request at GitHub

Communicate with committers and reviewers

Congratulations!

37

Page 38: 20140908 spark sql & catalyst

Conclusion

Introduced Spark SQL & Catalyst

Now you know them very well!

And you know how to contribute.

!

Let’s contribute to Spark & Spark SQL!!

38

Page 39: 20140908 spark sql & catalyst

an addition

Page 40: 20140908 spark sql & catalyst

What are we doing?

Page 41: 20140908 spark sql & catalyst

What are we doing?To execute query needs some steps.

ParseAnalyze

Logical PlanOptimize

Physical Plan

Execute

Page 42: 20140908 spark sql & catalyst

What are we doing?To execute query needs some steps.

ParseAnalyze

Logical PlanOptimize

Physical Plan

Execute

Page 43: 20140908 spark sql & catalyst

What are we doing?To execute query needs some steps.

ParseAnalyze

Logical PlanOptimize

Physical Plan

Execute

DSLASG

++++

++

business logic

Page 44: 20140908 spark sql & catalyst

Thanks!