An Overview of Spark DataFrames with Scala

Post on 07-Jan-2017

4.085 views 1 download

Transcript of An Overview of Spark DataFrames with Scala

An Overview of Spark DataFrames with Scala

An Overview of Spark DataFrames with Scala

Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

Who am I ?Who am I ?

Himanshu Gupta (@himanshug735)

Spark Certified Developer

Apache Spark Third-Party Package contributor - spark-streaming-gnip

Sr. Software Consultant at Knoldus Software LLP

Himanshu Gupta (@himanshug735)

Spark Certified Developer

Apache Spark Third-Party Package contributor - spark-streaming-gnip

Sr. Software Consultant at Knoldus Software LLP

Img src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpgImg src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpg

AgendaAgenda

● What is Spark ?

● What is a DataFrame ?

● Why we need DataFrames ?

● A brief example

● Demo

● What is Spark ?

● What is a DataFrame ?

● Why we need DataFrames ?

● A brief example

● Demo

Apache SparkApache Spark

● Distributed compute engine for large-scale data processing.

● 100x faster than Hadoop MapReduce.

● Provides APIs in Python, Scala, Java and R (Spark 1.4)

● Combines SQL, streaming and complex analytics.

● Runs on Hadoop, Mesos, or in the cloud.

● Distributed compute engine for large-scale data processing.

● 100x faster than Hadoop MapReduce.

● Provides APIs in Python, Scala, Java and R (Spark 1.4)

● Combines SQL, streaming and complex analytics.

● Runs on Hadoop, Mesos, or in the cloud.

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

● Distributed collection of data organized into named columns (i.e. SchemaRDD)

● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata

● Available in Python, Scala, Java and R (in Spark 1.4)

● Distributed collection of data organized into named columns (i.e. SchemaRDD)

● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata

● Available in Python, Scala, Java and R (in Spark 1.4)

Spark DataFramesSpark DataFrames

Google Trends for DataFramesGoogle Trends for DataFrames

Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30

Speed of Spark DataFrames!Speed of Spark DataFrames!

Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter

Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter

RDD API vs DataFrames APIRDD API vs DataFrames API

val linesRDD = sparkContext.textFile(“file.txt”)val wordCountRDD = linesRDD.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)

val (word, (sum, n)) = wordCountRDD.map { case (word, count) => (word, (count, 1)) } .reduce { case ((word1, (count1, n1)), (word2, (count2, n2))) =>

("", (count1 + count2, n1 + n2)) }

val average = sum.toDouble / n

val linesDF = sparkContext.textFile(“file.txt”).toDF("line")val wordsDF = linesDF.explode("line", "word")((line: String) => line.split(" "))val wordCountDF = wordsDF.groupBy("word").count()

val average = wordCountDF.agg(avg("count"))

RDD APIRDD API

DataFrame APIDataFrame API

Catalyst Optimizer

Optimization & Execution Plan shared by DataFrames and SparkSQL

Img src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlImg src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

AnalysisAnalysis

Begins with a relation to be computed.

Builds an “Unresolved Logical Plan”.

Applies Catalyst rules.

DataFrame

UnresolvedLogical Plan

Catalyst Rules

Logical OptimizationsLogical Optimizations

Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

● Applies standard rule-based optimizations to the logical plan.

● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification

● Applies standard rule-based optimizations to the logical plan.

● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification

object DecimalAggregates extends Rule[LogicalPlan] { /** Maximum number of decimal digits in a Long */ val MAX_LONG_DIGITS = 18 def apply(plan: LogicalPlan): LogicalPlan = { plan transformAllExpressions { case Sum(e @ DecimalType.Expression(prec, scale)) if prec + 10 <= MAX_LONG_DIGITS => MakeDecimal(Sum(UnscaledValue(e)), prec + 10, scale) } }}

Physical PlanningPhysical Planning

● Generates one or more physical plans from logical plan.

● Selects a plan using Cost Model.

● Generates one or more physical plans from logical plan.

● Selects a plan using Cost Model.

Optimized Logical Plan

Physical PlansCost

ModelSelected

Physical Plan

Physical PlanningPhysical Planning

Code GenerationCode Generation

● Generates Java bytecode for speed of execution.

● Uses Scala language, Quasiquotes.

● Quasiquotes allow programmatic construction of ASTs

● Generates Java bytecode for speed of execution.

● Uses Scala language, Quasiquotes.

● Quasiquotes allow programmatic construction of ASTs

def compile(node: Node): AST = node match { case Literal(value) => q"$value" case Attribute(name) => q"row.get($name)" case Add(left, right) => q"${compile(left)} + ${compile(right)}" }

Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

ExampleExample

val tweets = sqlContext.read.json("tweets.json")

tweets .select("tweetId", "username", "timestamp") .filter("timestamp > 0") .explain(extended = true)

== Parsed Logical Plan =='Filter ('timestamp > 0) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Analyzed Logical Plan ==tweetId: bigint, username: string, timestamp: bigintFilter (timestamp#14L > cast(0 as bigint)) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Optimized Logical Plan ==Project [tweetId#15L,username#16,timestamp#14L] Filter (timestamp#14L > 0) Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Physical Plan ==Filter (timestamp#14L > 0) Scan JSONRelation[file:/home/knoldus/data/json/tweets.json][tweetId#15L,username#16,timestamp#14L]

Example (contd.)Example (contd.)

project

filter

Logical PlanOptimized

Logical Plan Physical Plan

tweets

filter

project

tweets

project

filter

tweets

project

filter

tweets

filter

tweets

filter

Scan (tweets)

DemoDemo

Download Code

https://github.com/knoldus/spark-dataframes-meetup

References

http://spark.apache.org/

Spark Summit EU 2015

Deep Dive into Spark SQL’s Catalyst Optimizer

Spark SQL: Relational Data Processing in Spark

Spark SQL and DataFrame Programming Guide

Introducing DataFrames in Spark for Large Scale Data Science

Beyond SQL: Speeding up Spark with DataFrames

Presenter:himanshu@knoldus.com

@himanshug735

Presenter:himanshu@knoldus.com

@himanshug735

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Thanks