    Chris Fregly, Principal Data Solutions EngineerIBM Spark Technology Center

    Oct 6, 2015

    Power of data. Simplicity of design. Speed of innovation.

  Who am I?

    Who am I?!!!!

    Streaming Data Engineer!Netflix Open Source Committer!


    !Data Solutions Engineer!

    Apache Contributor!!!

    Principal Data Solutions Engineer!IBM Technology Center!

  • IBM |

    Last Meetup (Spark Wins 100 TB Daytona GraySort)
On-disk only, in-memory caching disabled

  • IBM |

    Upcoming Advanced Apache Spark Meetups
Project Tungsten Data Structs/Algos for CPU/Memory Optimization

    Nov 12th, 2015

    Text-based Advanced Analytics and Machine Learning
Jan 14th, 2016

    ElasticSearch-Spark Connector w/ Costin Leau
Feb 16th, 2016

    Spark Internals Deep Dive
Mar 24th, 2016

    Spark SQL Catalyst Optimizer Deep Dive
Apr 21st, 2016

  • IBM |

  • IBM |

  • IBM |

  • IBM |

    Recent Events
Cassandra Summit 2015

    Real-time Advanced Analytics w/ Spark & Cassandra

    Strata NYC 2015
Practical Data Science w/ Spark: Recommender Systems

    Available on Slideshare!!


  • IBM |

  • Spark SQL + DataFrames

    Catalyst + Data Sources API

  • IBM |

    Topics of this Talk!DataFrames!Catalyst Optimizer and Query Plans!Data Sources API!Creating and Contributing Custom Data Source!!

    Partitions, Pruning, Pushdowns!!

    Native + Third-Party Data Source Impls!!

    Spark SQL Performance Tuning!

  • IBM |

    DataFrames!Inspired by R and Pandas DataFrames!Cross language support!

    SQL, Python, Scala, Java, R!Levels performance of Python, Scala, Java, and R!

    Generates JVM bytecode vs serialize/pickle objects to Python!DataFrame is Container for Logical Plan!

    Transformations are lazy and represented as a tree!Catalyst Optimizer creates physical plan!

    DataFrame.rdd returns the underlying RDD if needed!Custom UDF using registerFunction() New, experimental UDAF support!!

    Use DataFrames !instead of RDDs!!!

  • IBM |

    Catalyst Optimizer!Converts logical plan to physical plan!Manipulate & optimize DataFrame transformation tree!

    Subquery elimination use aliases to collapse subqueries!Constant folding replace expression with constant!Simplify filters remove unnecessary filters!Predicate/filter pushdowns avoid unnecessary data load!Projection collapsing avoid unnecessary projections!

    Hooks for custom rules!Rules = Scala Case Classes!

    val newPlan = MyFilterRule(analyzedPlan)



    Apply to any !plan stage!

  • IBM |

    Plan Debugging!$"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)!

    Requires explain(true)!





  • IBM |

    Plan Visualization & Join/Aggregation Metrics!

    Effectiveness !of Filter!

    Cost-based !Optimization!is Applied!

    Peak Memory for!Joins and Aggs!

    Optimized !CPU-cache-aware!

    Binary Format!Minimizes GC &!

    Improves Join Perf!(Project Tungsten)!

    New in Spark 1.5!!

  • IBM |

    Data Sources API!Relations (o.a.s.sql.sources.interfaces.scala)!

    BaseRelation (abstract class): Provides schema of data!TableScan (impl): Read all data from source, construct rows !PrunedFilteredScan (impl): Read with column pruning & predicate pushdowns!InsertableRelation (impl): Insert or overwrite data based on SaveMode enum!

    RelationProvider (trait/interface): Handles user options, creates BaseRelation!Execution (o.a.s.sql.execution.commands.scala)!

    RunnableCommand (trait/interface)!ExplainCommand(impl: case class)!CacheTableCommand(impl: case class)!

    Filters (o.a.s.sql.sources.filters.scala)!Filter (abstract class for all filter pushdowns for this data source)!

    EqualTo (impl)!GreaterThan (impl)!StringStartsWith (impl)!

  • IBM |

    Creating a Custom Data Source!Study Existing Native and Third-Party Data Source Impls!!

    Native: JDBC (o.a.s.sql.execution.datasources.jdbc)! class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation !

    Third-Party: Cassandra (o.a.s.sql.cassandra)! class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation!


  • IBM |

    Contributing a Custom Data Source!!

    Managed by!Contains links to externally-managed github projects!Ratings and comments!Spark version requirements of each package!


  • Partitions, Pruning, Pushdowns

  • IBM |

    Demo Dataset (from previous Spark After Dark talks)!

    RATINGS !========!

    UserID,ProfileID,Rating !(1-10)!


    UserID,Gender !(M,F,U)!

    !Anonymous !

  • IBM |

    Partitions!Partition based on data usage patterns!

    /genders.parquet/gender=M/ /gender=F/

  • IBM |


    Partition Pruning!Filter out entire partitions of rows on partitioned data SELECT id, gender FROM genders where gender = U

    Column Pruning!

    Filter out entire columns for all rows if not required!Extremely useful for columnar storage formats!

    Parquet, ORC! SELECT id, gender FROM genders !

  • IBM |

    Pushdowns!Predicate (aka Filter) Pushdowns!

    Predicate returns {true, false} for a given function/condition!Filters rows as deep into the data source as possible!

    Data Source must implement PrunedFilteredScan!

  • Native Spark SQL Data Sources

  • IBM |

    Spark SQL Native Data Sources - Source Code!

  • IBM |

    JSON Data Source!DataFrame!

    val ratingsDF ="json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2")

    -- or --! val ratingsDF = ("file:/root/pipeline/datasets/dating/ratings.json.bz2")

    SQL Code! CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2")

    Convenience Method!

  • IBM |

    JDBC Data Source!Add Driver to Spark JVM System Classpath!

    $ export SPARK_CLASSPATH=

    DataFrame! val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> schema.tablename")"jdbc").options(jdbcConfig).load()

    SQL! CREATE TABLE genders USING jdbc OPTIONS (url, dbtable, driver, )

  • IBM |

    Parquet Data Source!Configuration!

    spark.sql.parquet.filterPushdown=true ! spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true ! spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]

    DataFrames! val gendersDF ="parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet")! gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")

    SQL! CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")

  • IBM |

    ORC Data Source!Configuration!


    DataFrames! val gendersDF ="orc") .load("file:/root/pipeline/datasets/dating/genders")! gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders")