20140908 spark sql & catalyst

download 20140908 spark sql & catalyst

of 44

Embed Size (px)

description

A presentation material for Spark Meetup 2014/09/08.

Transcript of 20140908 spark sql & catalyst

  • 1. Introduction toSpark SQL & CatalystTakuya UESHIN!Spark Meetup 2014/09/08(Mon)

2. Who am I?Takuya UESHIN@ueshingithub.com/ueshinNautilus Technologies, Inc.A Spark contributor2 3. AgendaWhat is Spark SQL?Catalyst in depthSQL core in depthInteresting issuesHow to contribute3 4. What is Spark SQL? 5. What is Spark SQL?Spark SQL is one ofSpark components.Executes SQL on SparkBuilds SchemaRDD like LINQOptimizes execution plan.5 6. What is Spark SQL?Catalyst provides a execution planningframework for relational operations.Including:SQL parser & analyzerLogical operators & general expressionsLogical optimizerA framework to transform operator tree.6 7. What is Spark SQL?To execute query needs some steps.ParseAnalyzeLogical PlanOptimizePhysical PlanExecute7 8. What is Spark SQL?To execute query needs some steps.ParseAnalyzeLogical PlanOptimizePhysical PlanExecute8Catalyst 9. What is Spark SQL?To execute query needs some steps.ParseAnalyzeLogical PlanOptimizePhysical PlanExecute9SQL core 10. Catalyst in depth 11. Catalyst in depthProvides a execution planning framework forrelational operations.Row & DataTypesTrees & RulesLogical OperatorsExpressionsOptimizations11 12. Row & DataTypeso.a.s.sql.catalyst.types.DataTypeLong, Int, Short, Byte, Float, Double, DecimalString, Binary, Boolean, TimestampArray, Map, Structo.a.s.sql.catalyst.expressions.RowRepresents a single row.Can contain complex types.12 13. Trees & Ruleso.a.s.sql.catalyst.trees.TreeNodeProvides transformations of tree.foreach, map, flatMap, collecttransform, transformUp,transformDownUsed for operator tree, expression tree.13 14. Trees & Ruleso.a.s.sql.catalyst.rules.RuleRepresents a tree transform rule.o.a.s.sql.catalyst.rules.RuleExecutorA framework to transform treesbased on rules.14 15. Logical OperatorsBasic OperatorsProject, Filter, Binary OperatorsJoin, Except, Intersect, Union, AggregateGenerate, DistinctSort, LimitInsertInto, WriteToFile15ProjectFilterJoinTable Table 16. ExpressionsLiteralArithmeticsUnaryMinus, Sqrt, MaxOfAdd, Subtract, Multiply, PredicatesEqualTo, LessThan, LessThanOrEqual, GreaterThan,GreaterThanOrEqualNot, And, Or, In, If, CaseWhen16+1 2 17. ExpressionsCastGetItem, GetFieldCoalesce, IsNull, IsNotNullStringOperationsLike, Upper, Lower, Contains,StartsWith, EndsWith, Substring, 17 18. OptimizationsConstantFoldingNullPropagationConstantFoldingBooleanSimplificationSimplifyFiltersFilterPushdownCombineFiltersPushPredicateThroughProjectPushPredicateThroughJoinColumnPruning18 19. OptimizationsNullPropagation, ConstantFoldingReplace expressions that can be evaluatedwith some literal value to the value.ex)1 + null => null1 + 2 => 3Count(null) => 019 20. OptimizationsBooleanSimplificationSimplifies boolean expressions that can be determined.ex)false AND $right => falsetrue AND $right => $righttrue OR $right => truefalse OR $right => $rightIf(true, $then, $else) => $then20 21. OptimizationsSimplifyFiltersRemoves filters that can be evaluatedtrivially.ex)Filter(true, child) => childFilter(false, child) => empty21 22. OptimizationsCombineFiltersMerges two filters.ex)Filter($fc, Filter($nc, child)) =>Filter(AND($fc, $nc), child)22 23. OptimizationsPushPredicateThroughProjectPushes Filter operators throughProject operator.ex)Filter(i === 1, Project(i, j, child))=>Project(i, j, Filter(i === 1, child))23 24. OptimizationsPushPredicateThroughJoinPushes Filter operators through Joinoperator.ex)Filter(left.i.attr === 1, Join(left,right) =>Join(Filter(i === 1, left), right)24 25. OptimizationsColumnPruningEliminates the reading of unusedcolumns.ex)Join(left, right, LeftSemi,left.id.attr === right.id.attr) =>Join(left, Project(id, right), LeftSemi)25 26. SQL core in depth 27. SQL core in depthProvides:Physical operators to build RDDConversion from Existing RDD of Product toSchemaRDD supportParquet file read/write supportJSON file read supportColumnar in-memory table support27 28. SQL core in deptho.a.s.sql.SchemaRDDExtends RDD[Row].Has logical plan tree.Provides LINQ-like interfaces to constructlogical plan.select, where, join, orderBy, Executes the plan.28 29. SQL core in deptho.a.s.sql.execution.SparkStrategiesConverts logical plan to physical.Some rules are based on statisticsof the operators.29 30. SQL core in depthParquet read/write supportColumnar storage format for HadoopReads existing Parquet files.Converts Parquet schema to row schema.Writes new Parquet files.Currently DecimalType andTimestampType are not supported.30 31. SQL core in depthJSON read supportLoads a JSON file (one object per line)Infers row schema from the entiredataset.Giving the schema is experimental.Inferring the schema by sampling isalso experimental.31 32. SQL core in depthColumnar in-memory table supportCaches table like RDD.cache, but ascolumnar style.Can prune unnecessary columnswhen read data.32 33. Interesting issues 34. Interesting issuesSupport the GroupingSet/ROLLUP/CUBEhttps://issues.apache.org/jira/browse/SPARK-2663Use statistics to skip partitions whenreading from in-memory columnardatahttps://issues.apache.org/jira/browse/SPARK-296134 35. Interesting issuesPluggable interface for shuffleshttps://issues.apache.org/jira/browse/SPARK-2044Sort-merge joinhttps://issues.apache.org/jira/browse/SPARK-2213Cost-based join reorderinghttps://issues.apache.org/jira/browse/SPARK-221635 36. How to contribute 37. How to contributeSee: Contributing to SparkOpen an issue on JIRASend pull-request at GitHubCommunicate with committers andreviewersCongratulations!37 38. ConclusionIntroduced Spark SQL & CatalystNow you know them very well!And you know how to contribute.!Lets contribute to Spark & Spark SQL!!38 39. an addition 40. What are we doing? 41. What are we doing?To execute query needs some steps.ParseAnalyzeLogical PlanOptimizePhysical PlanExecute 42. What are we doing?To execute query needs some steps.ParseAnalyzeLogical PlanOptimizePhysical PlanExecute 43. What are we doing?To execute query needs some steps.ParseAnalyzeLogical PlanOptimizePhysical PlanExecuteDSLASG++++++business logic 44. Thanks!