Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional...

62
Michael Armbrust @michaelarmbrust spark.apache.org Functional Query Optimization with SQL

Transcript of Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional...

Page 1: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Michael Armbrust @michaelarmbrust

spark.apache.org

Functional Query Optimization with" "

SQL

Page 2: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves efficiency through: » In-memory computing primitives » General computation graphs

Improves usability through: » Rich APIs in Scala, Java, Python » Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Page 3: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

A General Stack

Spark

Spark Streaming"

real-time

Spark SQL

GraphX graph

MLlib machine learning …

Spark SQL

Page 4: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) » Collections of objects that can be stored in

memory or disk across a cluster » Parallel functional transformations (map, filter, …) » Automatically rebuilt on failure

Page 5: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

More than Map/Reduce map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 6: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns

val  lines  =  spark.textFile(“hdfs://...”)  val  errors  =  lines.filter(_  startswith  “ERROR”)  

val  messages  =  errors.map(_.split(“\t”)(2))  messages.cache()   lines

Block 1

lines Block 2

lines Block 3

Worker

Worker

Worker

Driver

messages.filter(_  contains  “foo”).count()  

messages.filter(_  contains  “bar”).count()  

. . .

tasks

results

messages Cache 1

messages Cache 2

messages Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec"(vs 170 sec for on-disk data)

Page 7: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Fault Tolerance

file.map(record => (record.tpe, 1)) .reduceByKey(_ + _) .filter { case (_, count) => count > 10 }

filter reduce map

Inpu

t file

RDDs track lineage info to rebuild lost data

Page 8: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

filter reduce map

Inpu

t file

Fault Tolerance

file.map(record => (record.tpe, 1)) .reduceByKey(_ + _) .filter { case (_, count) => count > 10 }

RDDs track lineage info to rebuild lost data

Page 9: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

and Scala Provides: •  Concise Serializable*

Functions

•  Easy interoperability with the Hadoop ecosystem

•  Interactive REPL * Made even better by Spores (5pm today)

Lines of Code

Scala!

Python!

Java!

Shell!

Other!

Page 10: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Reduced Developer Complexity

Page 11: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Streaming

Reduced Developer Complexity

Page 12: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

SparkSQL Streaming

Reduced Developer Complexity

Page 13: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Reduced Developer Complexity

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

GraphX

Streaming SparkSQL

Page 14: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

One of the largest open source projects in big data

150+ developers contributing

30+ companies contributing

Spark Community

Giraph Storm Tez 0

50

100

150

Contributors in past year

Page 15: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Community Growth

Spark 0.6: 17 contributors

Feb ‘13 Oct ‘12 May‘14 Sept ‘13

Spark 0.8: 67 contributors

Feb ‘14

Spark 0.9: 83 contributors

Spark 1.0: 110 contributors

Spark 0.7:"31 contributors

Page 16: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

With great power… Strict project coding guidelines to make it easier for non-Scala users and contributors: •  Absolute imports only

•  Minimize infix function use •  Java/Python friendly wrappers for user APIs

•  …

Page 17: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

SQL

Page 18: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Shark modified the Hive backend to run over Spark, but had two challenges: » Limited integration with Spark programs » Hive optimizer not designed for Spark

Spark SQL reuses the best parts of Shark:

Relationship to

Borrows •  Hive data loading •  In-memory column store

Adds •  RDD-aware optimizer •  Rich language interfaces

Page 19: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Spark SQL Components Catalyst Optimizer •  Relational algebra + expressions •  Query optimization

Spark SQL Core •  Execution of queries as RDDs •  Reading in Parquet, JSON …

Hive Support •  HQL, MetaStore, SerDes, UDFs

26%!

36%!

38%!

Page 20: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Adding Schema to RDDs Spark + RDDs!Functional transformations on partitioned collections of opaque objects.

SQL + SchemaRDDs!Declarative transformations on partitioned collections of tuples.!

User User User

User User User

Name Age Height Name Age Height Name Age Height

Name Age Height Name Age Height Name Age Height

Page 21: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Using Spark SQL SQLContext  

•  Entry point for all SQL functionality • Wraps/extends existing spark context

val  sc:  SparkContext  //  An  existing  SparkContext.  

val  sqlContext  =  new  org.apache.spark.sql.SQLContext(sc)  

//  Importing  the  SQL  context  gives  access  to  all  the  SQL  

functions  and  conversions.  

import  sqlContext._  

Page 22: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Example Dataset A text file filled with people’s names and ages: Michael,  30  

Andy,  31  

Justin  Bieber,  19  

…  

Page 23: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Turning an RDD into a Relation //  Define  the  schema  using  a  case  class.  

case  class  Person(name:  String,  age:  Int)  

 

//  Create  an  RDD  of  Person  objects  and  register  it  as  a  table.  

val  people  =  

   sc.textFile("examples/src/main/resources/people.txt")  

       .map(_.split(","))  

       .map(p  =>  Person(p(0),  p(1).trim.toInt))  

 people.registerAsTable("people")  

 

 

Page 24: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Querying Using SQL //  SQL  statements  are  run  with  the  sql  method  from  sqlContext.  

val  teenagers  =  sql("""  

 SELECT  name  FROM  people  WHERE  age  >=  13  AND  age  <=  19""")  

 

//  The  results  of  SQL  queries  are  SchemaRDDs  but  also    

//  support  normal  RDD  operations.  

//  The  columns  of  a  row  in  the  result  are  accessed  by  ordinal.  

val  nameList  =  teenagers.map(t  =>  "Name:  "  +  t(0)).collect()  

Page 25: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Querying Using the Scala DSL Express queries using functions, instead of SQL strings. //  The  following  is  the  same  as:  //      SELECT  name  FROM  people    //      WHERE  age  >=  10  AND  age  <=  19    val  teenagers  =      people        .where('age  >=  10)        .where('age  <=  19)        .select('name)  

Page 26: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Caching Tables In-Memory Spark SQL can cache tables using an in-memory columnar format: •  Scan only required columns

•  Fewer allocated objects (less GC) •  Automatically selects best compression

cacheTable("people")

Page 27: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Parquet Compatibility Native support for reading data in Parquet: •  Columnar storage avoids reading

unneeded data.

•  RDDs can be written to parquet files, preserving the schema.

Page 28: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Using Parquet //  Any  SchemaRDD  can  be  stored  as  Parquet.  people.saveAsParquetFile("people.parquet")  

 

//  Parquet  files  are  self-­‐describing  so  the  schema  is  preserved.  

val  parquetFile  =  sqlContext.parquetFile("people.parquet")  

 

//  Parquet  files  can  also  be  registered  as  tables  and  then  used  

//  in  SQL  statements.  parquetFile.registerAsTable("parquetFile”)  

val  teenagers  =  sql(  

   "SELECT  name  FROM  parquetFile  WHERE  age  >=  13  AND  age  <=  19")  

 

Page 29: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Hive Compatibility Interfaces to access data and code in"the Hive ecosystem:

o  Support for writing queries in HQL o  Catalog info from Hive MetaStore o  Tablescan operator that uses Hive SerDes o  Wrappers for Hive UDFs, UDAFs, UDTFs

Page 30: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Reading Data Stored In Hive val  hiveContext  =  new  org.apache.spark.sql.hive.HiveContext(sc)  

import  hiveContext._  

 

hql("CREATE  TABLE  IF  NOT  EXISTS  src  (key  INT,  value  STRING)")  

hql("LOAD  DATA  LOCAL  INPATH  '.../kv1.txt'  INTO  TABLE  src")  

 

//  Queries  can  be  expressed  in  HiveQL.  

hql("FROM  src  SELECT  key,  value")  

 

Page 31: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

SQL and Machine Learning val  trainingDataTable  =  sql("""  

   SELECT  e.action,  u.age,  u.latitude,  u.logitude  

       FROM  Users    u      

       JOIN  Events  e  ON  u.userId  =  e.userId""")  

 

//  SQL  results  are  RDDs  so  can  be  used  directly  in  Mllib.  

val  trainingData  =  trainingDataTable.map  {  row  =>  

   val  features  =  Array[Double](row(1),  row(2),  row(3))  

   LabeledPoint(row(0),  features)  

}  

val  model  =  new  LogisticRegressionWithSGD().run(trainingData)  

Page 32: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Supports Java Too! public  class  Person  implements  Serializable  {      private  String  _name;      private  int  _age;      public  String  getName()  {  return  _name;    }      public  void  setName(String  name)  {  _name  =  name;  }      public  int  getAge()  {  return  _age;  }      public  void  setAge(int  age)  {  _age  =  age;  }  }    JavaSQLContext  ctx  =  new  org.apache.spark.sql.api.java.JavaSQLContext(sc)  JavaRDD<Person>  people  =  ctx.textFile("examples/src/main/resources/people.txt").map(      new  Function<String,  Person>()  {          public  Person  call(String  line)  throws  Exception  {              String[]  parts  =  line.split(",");              Person  person  =  new  Person();              person.setName(parts[0]);              person.setAge(Integer.parseInt(parts[1].trim()));              return  person;          }      });  JavaSchemaRDD  schemaPeople  =  sqlCtx.applySchema(people,  Person.class);  

 

 

 

Page 33: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Supports Python Too! from  pyspark.context  import  SQLContext  sqlCtx  =  SQLContext(sc)    lines  =  sc.textFile("examples/src/main/resources/people.txt")  parts  =  lines.map(lambda  l:  l.split(","))  people  =  parts.map(lambda  p:  {"name":  p[0],  "age":  int(p[1])})    peopleTable  =  sqlCtx.applySchema(people)  peopleTable.registerAsTable("people")    teenagers  =  sqlCtx.sql("SELECT  name  FROM  people  WHERE  age  >=  13  AND  age  <=  19")  teenNames  =  teenagers.map(lambda  p:  "Name:  "  +  p.name)  

Page 34: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Optimizing Queries with

Page 35: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

What is Query Optimization? SQL is a declarative language: Queries express what data to retrieve, not how to retrieve it.

The database must pick the ‘best’ execution strategy through a process known as optimization. 35

Page 36: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Naïve Query Planning SELECT  name  

FROM  (  

     SELECT  id,  name  

     FROM  People)  p  

WHERE  p.id  =  1  

Projectname

Projectid,name

Filterid = 1

People

LogicalPlan

Projectname

Projectid,name

Filterid = 1

TableScanPeople

PhysicalPlan

36

Page 37: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Optimized Execution Writing imperative code to optimize all possible patterns is hard.

Projectname

Projectid,name

Filterid = 1

People

LogicalPlan

Projectname

Projectid,name

Filterid = 1

People

IndexLookupid = 1

return: name

LogicalPlan

PhysicalPlan

Instead write simple rules: •  Each rule makes one change •  Run many rules together to

fixed point.

37

Page 38: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Optimizing with Rules

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

Projectname

Filterid = 1

People

CombineProjection

IndexLookupid = 1

return: name

PhysicalPlan

38

Page 39: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Prior Work: "Optimizer Generators Volcano / Cascades: •  Create a custom language for expressing

rules that rewrite trees of relational operators.

•  Build a compiler that generates executable code for these rules.

Cons:  Developers  need  to  learn  this  custom  language.  Language  might  not  be  powerful  enough.   39

Page 40: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

TreeNode Library Easily transformable trees of operators •  Standard collection functionality - foreach,  

map,collect,etc. •  transform function – recursive modification

of tree fragments that match a pattern. •  Debugging support – pretty printing,

splicing, etc.

40

Page 41: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1.  If the function does apply to an operator, that

operator is replaced with the result. 2.  When the function does not apply to an

operator, that operator is left unchanged. 3.  The transformation is applied recursively to all

children. 41

Page 42: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Writing Rules as Tree Transformations 1.  Find filters on top of

projections. 2.  Check that the filter

can be evaluated without the result of the project.

3.  If so, switch the operators.

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

42

Page 43: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

43

Page 44: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Partial Function Tree

44

Page 45: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Find Filter on Project

45

Page 46: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Check that the filter can be evaluated without the result of the project.

46

Page 47: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

If so, switch the order.

47

Page 48: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Pattern Matching

48

Page 49: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Collections library

49

Page 50: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  Copy Constructors

50

Page 51: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Efficient Expression Evaluation Interpreting expressions (e.g., ‘a + b’) can very expensive on the JVM: •  Virtual function calls

•  Branches based on expression type

•  Object creation due to primitive boxing

•  Memory consumption by boxed primitive objects

Page 52: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Interpreting “a+b” 1.  Virtual call to Add.eval() 2.  Virtual call to a.eval() 3.  Return boxed Int 4.  Virtual call to b.eval() 5.  Return boxed Int 6.  Integer addition 7.  Return boxed result

Add

Attributea

Attributeb

Page 53: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Using Runtime Reflection def  generateCode(e:  Expression):  Tree  =  e  match  {      case  Attribute(ordinal)  =>          q"inputRow.getInt($ordinal)"      case  Add(left,  right)  =>          q"""              {                  val  leftResult  =  ${generateCode(left)}                  val  rightResult  =  ${generateCode(right)}                  leftResult  +  rightResult              }            """  }  

Page 54: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Executing “a + b” val  left:  Int  =  inputRow.getInt(0)  val  right:  Int  =  inputRow.getInt(1)  val  result:  Int  =  left  +  right  resultRow.setInt(0,  result)  

•  Fewer function calls • No boxing of primitives

Page 55: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Performance Comparison

0 5

10 15 20 25 30 35 40

Intepreted Evaluation Hand-written Code Generated with Scala Reflection

Mill

isec

onds

Evaluating 'a+a+a' One Billion Times!

Page 56: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

TPC-DS Results

0

50

100

150

200

250

300

350

400

Query 19 Query 53 Query 34 Query 59

Seco

nds

Shark - 0.9.2 SparkSQL + codegen

Page 57: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Code Generation Made Simple •  Other Systems (Impala, Drill, etc) do code

generation. •  Scala Reflection + Quasiquotes made

our implementation an experiment done over a few weekends instead of a major system overhaul.

Initial Version ~1000 LOC

Page 58: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Future Work: Typesafe Results Currently: people.registerAsTable(“people”)  val  results  =  sql(“SELECT  name  FROM  people”)  results.map(r  =>  r.getString(0))    What we want: val  results  =  sql“SELECT  name  FROM  $people”  results.map(_.name)    Joint work with: Heather Miller, Vojin Jovanovic, Hubert Plociniczak

Page 59: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Future Work: Typesafe Results Currently: people.registerAsTable(“people”)  val  results  =  sql(“SELECT  name  FROM  people”)  results.map(r  =>  r.getString(0))    What we want: val  results  =  sql“SELECT  name  FROM  $people”  results.map(_.name)    Joint work with: Heather Miller, Vojin Jovanovic, Hubert Plociniczak

Page 60: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Future Work: Typesafe Results Currently: people.registerAsTable(“people”)  val  results  =  sql(“SELECT  name  FROM  people”)  results.map(r  =>  r.getString(0))    What we want: val  results  =  sql“SELECT  name  FROM  $people”  results.map(_.name)    Joint work with: Heather Miller, Vojin Jovanovic, Hubert Plociniczak

Page 61: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Get Started Visit spark.apache.org for videos & tutorials Download Spark bundle for CDH Easy to run on just your laptop

Free training talks and hands-on"exercises: spark-summit.org

Page 62: Functional Query Optimization with SQL - home.apache.orgmarmbrus/talks/SparkSQL... · Functional Query Optimization with" " SQL . What is Apache Spark? ... Spark Streaming" real-time

Conclusion Big data analytics is evolving to include: » More complex analytics (e.g. machine learning) » More interactive ad-hoc queries, including SQL » More real-time stream processing

Spark is a fast platform that unifies these apps

"Join us at Spark Summit 2014!"June 30-July 2, San Francisco