Spark sql meetup

Spark SQL 漫谈Cheng Hao

Oct 25, 2014Copyright © 2014 Intel Corporation.

Agenda Spark SQL Overview Catalyst in Depth SQL Core API Introduction V.S. Shark & Hive-on-Spark Our Contributions Useful Materials

2Copyright © 2014 Intel Corporation.

Spark SQL Overview

Copyright © 2014 Intel Corporation.

Spark SQL in Spark

4

Spark Streamin

greal-time

GraphXGraph(alpha)

MLLibMachine Learning

Spark Core

Spark SQL

Spark SQL was first released in Spark 1.0 (May, 2014) Initial committed by Michael Armbrust & Reynold Xin from Databricks


5

Spark SQL Component Stack (User Perspective)

Hive-like interface(JDBC Service / CLI) SQL API support (LINQ-like) Both Hive QL & Simple SQL dialects are

Supported DDL is 100% compatible with Hive Metastore Hive QL aims to 100% compatible with Hive

DML Simple SQL dialect is now very weak in

functionality, but easy to extendSpark Core

Spark Execution OperatorsCatalyst

Hive QL Simple SQL

SQL APICLI

User Application

JDBC Service

Data Analyst

Hive Meta Store

Simple Catalog


6

Spark SQL Architecture

Catalyst

BackendFrontend


By Michael Armbrust @ Databricks

Catalyst in Depth


8

Understand Some Terminology Logical and Physical query plans

Both are trees representing query evaluation Internal nodes are operators over the data Logical plan is higher-level and algebraic Physical plan is lower-level and operational

Logical plan operators Correspond to query language constructs Conceptually describe what operation needs to be

performed Physical plan operators

Correspond to implemented access methods Physically Implement the operation described by

logical operators

Unresolved Logical Plan

Logical Plan

SQL Text

Optimized Logical Plan

Physical Plan

Parsing

Binding & Analyzing

Optimizing

Query Planning


9

Examples

• CREATE TABLE T (key: String, value: String)

• EXPLAIN EXTENDED SELECT a.key * (2 + 3), b.value FROM T a JOIN T b ON a.key=b.key AND a.key>3

We execute the following commands on Spark SQL CLI.


10 Copyright © 2014 Intel Corporation.

Understand some terminologies== Parsed Logical Plan ==Project [('a.key * (2 + 3)) AS c_0#24,'b.value] Join Inner, Some((('a.key = 'b.key) && ('a.key > 3))) UnresolvedRelation None, T, Some(a) UnresolvedRelation None, T, Some(b)

== Analyzed Logical Plan ==Project [(CAST(key#27, DoubleType) * CAST((2 + 3), DoubleType)) AS c_0#24,value#30] Join Inner, Some(((key#27 = key#29) && (CAST(key#27, DoubleType) > CAST(3, DoubleType)))) MetastoreRelation default, T, Some(a) MetastoreRelation default, T, Some(b)

== Optimized Logical Plan ==Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] Join Inner, Some((key#27 = key#29)) Project [key#27] Filter (CAST(key#27, DoubleType) > 3.0) MetastoreRelation default, T, Some(a) MetastoreRelation default, T, Some(b)

== Physical Plan ==Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] BroadcastHashJoin [key#27], [key#29], BuildLeft Filter (CAST(key#27, DoubleType) > 3.0) HiveTableScan [key#27], (MetastoreRelation default, T, Some(a)), None HiveTableScan [key#29,value#30], (MetastoreRelation default, T, Some(b)), None

11

• Catalyst essentially a extensible framework to Analyze & Optimize the logical plan, expression.

• Core Elements:• Tree Node API• Expression Optimization • Data Type & Schema• Row API• Logical Plan (Unresolved) Binding & Analyzing (Rules)• Logical Plan (Resolved) Optimizing (Rules)

• SPI (Service Provider Interface)• FunctionRegistry• Schema Catalog

Catalyst Overview


12

Data Type & Schema Primitive Type

StringType, FloatType, IntegerType, ByteType, ShortType, DoubleType, LongType, BinaryType, BooleanType, DecimalType, TimestampType, DateType, Varchar(Not Complete Supported Yet), Char(Not Complete Supported Yet)

Complex Type ArrayType

ArrayType(elementType: DataType) StructType

StructField(name: String, dataType: DataType) StructType(fields: Seq[StructField])

MapType MapType(keyType: DataType, valueType: DataType)

UnionType (Not Supported Yet)

Relation Schema


13

Row APItrait Row extends Seq[Any] with Serializable { def apply(i: Int): Any def isNullAt(i: Int): Boolean def getInt(i: Int): Int def getLong(i: Int): Long def getDouble(i: Int): Double def getFloat(i: Int): Float def getBoolean(i: Int): Boolean def getShort(i: Int): Short def getByte(i: Int): Byte def getString(i: Int): String def getAs[T](int: Int): T}

Row class is the key data structure widely used internal / external Spark SQL.

“def getAs[T]” is used for non-primitive data types

Field value represented as native language data type.

Field type represented as DataType described in last slice.

14

Logical Plan Binding & Analyzing • Essentially about data binding & semantic analysis• Example Rules

• Bind Attributes, Relations with concrete data.• ResolveReferences, ResolveRelation

• Expressions Analysis• Data Type Coercion (PropagateTypes, PromoteString, BooleanCasts, Division

etc.)• Bind UDF(ResolveFunctions)

• Evict / Expand the Analysis Logical Plan Operators• StarExpansion, EliminateAnalysisOperators

• Implicit Semantic Supplement• Add sort expressions into the child projection list.(ResolveSortReferences)• Convert projection into aggregation if the projection contains aggregate

function(GlobalAggregates).• UnresolvedHavingClauseAttributes

• Semantic Checking• Unresolved Function, Relation, Attributes (CheckResolution)• Illegal expressions in projection of an Aggregation (CheckAggregation)

• ….


15

Logical Plan Optimizing• Simplify the Logical Plan Tree based on Relational / Logical Algebra, Common Sense (Rule

Based)• Example Rules

• Expression Optimization.• NullPropagation, ConstantFolding, SimplifyFilters, SimplifyCasts, OptimizeIn etc.

• Filter PushDown• UnionPushdown, PushPredicateThroughProject,

PushPredicateThroughJoin,ColumnPruning• Combine Operators

• CombineFilters, CombineLimits• Concrete Example

• IsNull(‘a + null) => IsNull(null) => Literal(true)• SELECT a.key, b.key FROM a, b ON a.key=b.key AND b.key>10 => SELECT a.key, b.key FROM a, (SELECT key FROM b WHERE key>10) ON a.key=b.key


16

Spark SQL Dialects

Hive AST

Hive Parser

Logical Plan


Hive+Spark Planner

Spark Planner

Execution Operators

SQL Parser


Hive Catelog

Simple Catelog

HiveContext SQLContext

DSL API

Frontend

Catalyst

Backend

XXXContext

XXX Planner

XXX Catelog

XX Parser / API

Frontend +

Catalyst + SPI+

Backend||

Tool


17

Spark Plan (Physical Plan) Root class of Spark Plan Operator (Physical Plan Operator for Spark)

Spark Plan Operators Joins: BroadcastHashJoin, CartesianProduct, HashOuterJoin, LeftSemiJoinHash etc.) Aggregate: Aggregate BasicOperators: Distinct, Except, Filter, Limit, Project, Sort, Union etc.) Shuffle: AddExchange, Exchange Commands: CacheTableCommand, DescribeCommand, ExplainCommand etc.) ..

Spark Strategy (SparkPlanner) Map the Optimized Logical Plan to Spark Plan

abstract class SparkPlan { def children: Seq[SparkPlan] /** Specifies how data is partitioned across different nodes in the cluster. */ def outputPartitioning: Partitioning = UnknownPartitioning(0) /** Specifies any partition requirements on the input data for this operator. */ def requiredChildDistribution: Seq[Distribution] = Seq.fill(children.size)(UnspecifiedDistribution)

def execute(): RDD[Row]}


Spark Plan

RDD

Spark Execution

18

Case Study for Catalyst in Depth• StreamSQL

• Reuse the HiveContext but with different Frontend / Backend.• Frontend: Slight modification of the HiveParser• Backend: Customed Query Planner, to generate the physical plan based on

Spark DStream.• JIRA: https://issues.apache.org/jira/browse/SPARK-1363 • Source: https://github.com/thunderain-project/StreamSQL

• SQL 92 Support• Reuse the HiveContext but with different Frontend• Frontend: A modified HiveParser & Hive QL translator.• https://github.com/intel-hadoop/spark/tree/panthera

• Pig on Spark POC• Modify the SQLContext• Provide a PigParser to translate the Pig script into Catalyst unresolved logical

plan• https://github.com/databricks/pig-on-spark


https://issues.apache.org/jira/browse/SPARK-1363

https://issues.apache.org/jira/browse/SPARK-1363

https://github.com/thunderain-project/StreamSQL



https://github.com/intel-hadoop/spark/tree/panthera

https://github.com/intel-hadoop/spark/tree/panthera

https://github.com/databricks/pig-on-spark

https://github.com/databricks/pig-on-spark

SQL Core API Introduction


20

SchemaRDD• What’s SchemaRDD?• Spark SQL Core API (In Scala)

• Create SchemaRDD instance from• Plain SQL Text def sql(sqlText: String)• An existed Logical Plan def logicalPlanToSparkQuery(plan: LogicalPlan)• Spark RDD def createSchemaRDD[A <: Product: TypeTag](rdd: RDD[A])• Spark RDD with Schema def applySchema(rowRDD: RDD[Row], schema: StructType)• Frequently used format file (json, parquet, etc.) def parquetFile(path: String)

• SQL DSL• select, where, join, orderBy, limit, groupBy, unionAll, etc.

• Data Sink• Persist the data with specified storage level def persist(newLevel: StorageLevel)• Save the data as ParquetFile def saveAsParquetFile(path: String)• Save the data as a new Table def registerTempTable(tableName: String)• Insert the data into existed table def insertInto(tableName: String, overwrite:

Boolean)• ….

• Java API / Python API supported


class SchemaRDD( @transient val sqlContext: SQLContext, @transient val baseLogicalPlan: LogicalPlan) extends RDD[Row](sqlContext.sparkContext, Nil)

21

Conceptual State Transition Diagram

RDD

Schema RDD


SQL API

SQL Text / File / Table

* Unresolved Logical Plan RDD (Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plan Spark RDD)


File / Memory etc.

…

22

Code Examplesbt/sbt hive/console

// HiveContext is created by default, and the object is imported, so we can call the object methods directly.sql("CREATE TABLE IF NOT EXISTS kv_text(key INT, value STRING)")sql("LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE kv_text") // create a Hive table and load data into it

case class KV(key: Int, value: String)val kvRdd = sparkContext.parallelize((1 to 100).map(i => KV(i, s"val_$i"))) // create a normal RDD// implicitly convert the kvRDD into a SchemaRDDkvRdd.where('key >= 1).where('key <=5).registerTempTable("kv_rdd") // create a Hive Table from a SchemaRDD

jsonFile("/tmp/file2.json").registerTempTable("kv_json") // load json file and register as a Hive Tableval result = sql("SELECT a.key, b.value, c.key from kv_text a join kv_rdd b join kv_json c")

result.collect().foreach(row => { val f0 = if(row.isNullAt(0)) "null" else row.getInt(0) val f1 = if(row.isNullAt(1)) "null" else row.getString(1) val f2 = if(row.isNullAt(2)) "null" else row.getInt(2) println(s"result:$f0, $f1, $f2")}) Copyright © 2014 Intel Corporation.

V.S. Shark & Hive


24

Background of Shark/Hive-on-Spark/Spark SQL Shark is the first SQL on Spark product, based on the earlier versions of Hive (with a re-write

QueryPlanner to generate Spark RDD-based Physicial Plan); Shark is retired now and replaced by Spark SQL.

Hive-on-Spark is an QueryPlanner extension of Hive, it focus on the SparkPlanner and Spark RDD-based physical operators implementation. Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future.

Spark SQL is a new SQL engine on Spark developed from scratch. Functionality

Spark SQL almost support all of the functionalities that Hive provided from the perspective of data analysts.

SQL API on Spark Shell V.S. Pig latin. Spark SQL is an extensible / flexible framework for developers (based on Catalyst), new extensions are

very easy to be integrated. Implementation Philosophy of Spark SQL (Simple & Nature)

Largely employs the Scala features (Pattern Matching, Implicit Conversion, Partial Function etc.) Large small pieces of simple rule to bind, analyze, optimize logical plan & expression tree, and also the

physical plan generation. In-memory Computing & Maximize the Memory Usage (Cache related SQL API & Command). Spark SQL benefits a lot from Hive by reusing its components (Hive QL Parser, Metatore, SerDe,

StorageHandler etc.) Stability

Hive is the defacto standard for SQL on big data so far, and it has been proven as a productive tool for couple of years in practices, many corner cases are covered in its continuous enhancements.

Spark SQL just start its journey ( ~0.5 year), we need more time to prove / improve it.


Our Contributions


26

Totally 60+ PRs, 50+ Merged on Spark SQL Features

Add serde support for CTAS (PR2570) Support the Grouping Set (PR1567) Support EXTENDED for EXPLAIN (PR1982) Cross join support in HiveQL (PR2124) Add support for left semi join (PR837) Add Date type support (PR2344) Add Timestamp type support (PR275) Add Expression RLike & Like support (PR224) ..

Performance Enhancement / Improvement Avoid table creation in logical plan analyzing for CTAS (PR1846) Extract the joinkeys from join condition (PR1190) Reduce the Expression tree object creations for aggregation function (min/max) (PR2113) Pushdown the join filter & predication for outer join (PR1015) Constant Folding for Expression Optimization (PR482) Fix Performance Issue in data type casting (PR679) Not limit argument type for hive simple udf (PR2506) Use GenericUDFUtils.ConversionHelper for Simple UDF type conversions (PR2407) Select null from table would throw a MatchError (PR2396) Type Coercion should support every type to have null value (PR2246) ….

Bugs Fixing ….


Useful Materials


28

References http://

spark-summit.org/wp-content/uploads/2013/10/J-Michael-Armburst-catalyst-spark-summit-dec-2013.pptx

http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-Michael-Armbrust.pdf

https://www.youtube.com/watch?v=GQSNJAzxOr8 http://

www.slideshare.net/ueshin/20140908-spark-sql-catalyst?qid=3bb8abf4-3d8d-433f-9397-c24c5256841d https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark http://web.stanford.edu/class/cs346/qpnotes.html http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf http://codex.cs.yale.edu/avi/db-book/db6/slide-dir/PDF-dir/ch13.pdf https://courses.cs.washington.edu/courses/cse444/12sp/lectures/ http://www.cs.uiuc.edu/class/sp06/cs411/lectures.html

• User Mail List [email protected]

• Dev Mail List [email protected]

• Jira https://issues.apache.org/jira/browse/SPARK/component/12322623

• DevDoc https://spark.apache.org/docs/latest/sql-programming-guide.html

• Github https://github.com/apache/spark/tree/master/sql


http://spark-summit.org/wp-content/uploads/2013/10/J-Michael-Armburst-catalyst-spark-summit-dec-2013.pptx






https://www.youtube.com/watch?v=GQSNJAzxOr8

http://www.slideshare.net/ueshin/20140908-spark-sql-catalyst?qid=3bb8abf4-3d8d-433f-9397-c24c5256841d

http://www.slideshare.net/ueshin/20140908-spark-sql-catalyst?qid=3bb8abf4-3d8d-433f-9397-c24c5256841d

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark



http://web.stanford.edu/class/cs346/qpnotes.html

http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

http://codex.cs.yale.edu/avi/db-book/db6/slide-dir/PDF-dir/ch13.pdf

https://courses.cs.washington.edu/courses/cse444/12sp/lectures/

https://courses.cs.washington.edu/courses/cse444/12sp/lectures/

http://www.cs.uiuc.edu/class/sp06/cs411/lectures.html

http://www.cs.uiuc.edu/class/sp06/cs411/lectures.html

mailto:[email protected]

mailto:[email protected]

https://issues.apache.org/jira/browse/SPARK/component/12322623

https://issues.apache.org/jira/browse/SPARK/component/12322623

https://spark.apache.org/docs/latest/sql-programming-guide.html

https://spark.apache.org/docs/latest/sql-programming-guide.html

https://github.com/apache/spark/tree/master/sql

https://github.com/apache/spark/tree/master/sql

Notice and Disclaimers: Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may

be claimed as the property of others.See Trademarks on intel.com for full list of Intel trademarks.

Optimization Notice:Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.

No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate

from publish.


http://www.intel.com/sites/corporate/tradmarx.htm

Spark sql meetup

Internet

Transcript of Spark sql meetup