An Overview of Spark DataFrames with Scala

19
An Overview of Spark DataFrames with Scala An Overview of Spark DataFrames with Scala Himanshu Gupta Sr. Software Consultant Knoldus Software LLP Himanshu Gupta Sr. Software Consultant Knoldus Software LLP

Transcript of An Overview of Spark DataFrames with Scala

Page 1: An Overview of Spark DataFrames with Scala

An Overview of Spark DataFrames with Scala

An Overview of Spark DataFrames with Scala

Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

Page 2: An Overview of Spark DataFrames with Scala

Who am I ?Who am I ?

Himanshu Gupta (@himanshug735)

Spark Certified Developer

Apache Spark Third-Party Package contributor - spark-streaming-gnip

Sr. Software Consultant at Knoldus Software LLP

Himanshu Gupta (@himanshug735)

Spark Certified Developer

Apache Spark Third-Party Package contributor - spark-streaming-gnip

Sr. Software Consultant at Knoldus Software LLP

Img src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpgImg src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpg

Page 3: An Overview of Spark DataFrames with Scala

AgendaAgenda

● What is Spark ?

● What is a DataFrame ?

● Why we need DataFrames ?

● A brief example

● Demo

● What is Spark ?

● What is a DataFrame ?

● Why we need DataFrames ?

● A brief example

● Demo

Page 4: An Overview of Spark DataFrames with Scala

Apache SparkApache Spark

● Distributed compute engine for large-scale data processing.

● 100x faster than Hadoop MapReduce.

● Provides APIs in Python, Scala, Java and R (Spark 1.4)

● Combines SQL, streaming and complex analytics.

● Runs on Hadoop, Mesos, or in the cloud.

● Distributed compute engine for large-scale data processing.

● 100x faster than Hadoop MapReduce.

● Provides APIs in Python, Scala, Java and R (Spark 1.4)

● Combines SQL, streaming and complex analytics.

● Runs on Hadoop, Mesos, or in the cloud.

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

Page 5: An Overview of Spark DataFrames with Scala

● Distributed collection of data organized into named columns (i.e. SchemaRDD)

● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata

● Available in Python, Scala, Java and R (in Spark 1.4)

● Distributed collection of data organized into named columns (i.e. SchemaRDD)

● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata

● Available in Python, Scala, Java and R (in Spark 1.4)

Spark DataFramesSpark DataFrames

Page 6: An Overview of Spark DataFrames with Scala

Google Trends for DataFramesGoogle Trends for DataFrames

Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30

Page 7: An Overview of Spark DataFrames with Scala

Speed of Spark DataFrames!Speed of Spark DataFrames!

Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter

Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter

Page 8: An Overview of Spark DataFrames with Scala

RDD API vs DataFrames APIRDD API vs DataFrames API

val linesRDD = sparkContext.textFile(“file.txt”)val wordCountRDD = linesRDD.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)

val (word, (sum, n)) = wordCountRDD.map { case (word, count) => (word, (count, 1)) } .reduce { case ((word1, (count1, n1)), (word2, (count2, n2))) =>

("", (count1 + count2, n1 + n2)) }

val average = sum.toDouble / n

val linesDF = sparkContext.textFile(“file.txt”).toDF("line")val wordsDF = linesDF.explode("line", "word")((line: String) => line.split(" "))val wordCountDF = wordsDF.groupBy("word").count()

val average = wordCountDF.agg(avg("count"))

RDD APIRDD API

DataFrame APIDataFrame API

Page 9: An Overview of Spark DataFrames with Scala

Catalyst Optimizer

Optimization & Execution Plan shared by DataFrames and SparkSQL

Img src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlImg src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Page 10: An Overview of Spark DataFrames with Scala

AnalysisAnalysis

Begins with a relation to be computed.

Builds an “Unresolved Logical Plan”.

Applies Catalyst rules.

DataFrame

UnresolvedLogical Plan

Catalyst Rules

Page 11: An Overview of Spark DataFrames with Scala

Logical OptimizationsLogical Optimizations

Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

● Applies standard rule-based optimizations to the logical plan.

● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification

● Applies standard rule-based optimizations to the logical plan.

● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification

object DecimalAggregates extends Rule[LogicalPlan] { /** Maximum number of decimal digits in a Long */ val MAX_LONG_DIGITS = 18 def apply(plan: LogicalPlan): LogicalPlan = { plan transformAllExpressions { case Sum(e @ DecimalType.Expression(prec, scale)) if prec + 10 <= MAX_LONG_DIGITS => MakeDecimal(Sum(UnscaledValue(e)), prec + 10, scale) } }}

Page 12: An Overview of Spark DataFrames with Scala

Physical PlanningPhysical Planning

● Generates one or more physical plans from logical plan.

● Selects a plan using Cost Model.

● Generates one or more physical plans from logical plan.

● Selects a plan using Cost Model.

Optimized Logical Plan

Physical PlansCost

ModelSelected

Physical Plan

Physical PlanningPhysical Planning

Page 13: An Overview of Spark DataFrames with Scala

Code GenerationCode Generation

● Generates Java bytecode for speed of execution.

● Uses Scala language, Quasiquotes.

● Quasiquotes allow programmatic construction of ASTs

● Generates Java bytecode for speed of execution.

● Uses Scala language, Quasiquotes.

● Quasiquotes allow programmatic construction of ASTs

def compile(node: Node): AST = node match { case Literal(value) => q"$value" case Attribute(name) => q"row.get($name)" case Add(left, right) => q"${compile(left)} + ${compile(right)}" }

Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Page 14: An Overview of Spark DataFrames with Scala

ExampleExample

val tweets = sqlContext.read.json("tweets.json")

tweets .select("tweetId", "username", "timestamp") .filter("timestamp > 0") .explain(extended = true)

== Parsed Logical Plan =='Filter ('timestamp > 0) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Analyzed Logical Plan ==tweetId: bigint, username: string, timestamp: bigintFilter (timestamp#14L > cast(0 as bigint)) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Optimized Logical Plan ==Project [tweetId#15L,username#16,timestamp#14L] Filter (timestamp#14L > 0) Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Physical Plan ==Filter (timestamp#14L > 0) Scan JSONRelation[file:/home/knoldus/data/json/tweets.json][tweetId#15L,username#16,timestamp#14L]

Page 15: An Overview of Spark DataFrames with Scala

Example (contd.)Example (contd.)

project

filter

Logical PlanOptimized

Logical Plan Physical Plan

tweets

filter

project

tweets

project

filter

tweets

project

filter

tweets

filter

tweets

filter

Scan (tweets)

Page 16: An Overview of Spark DataFrames with Scala

DemoDemo

Page 17: An Overview of Spark DataFrames with Scala

Download Code

https://github.com/knoldus/spark-dataframes-meetup

Page 18: An Overview of Spark DataFrames with Scala

References

http://spark.apache.org/

Spark Summit EU 2015

Deep Dive into Spark SQL’s Catalyst Optimizer

Spark SQL: Relational Data Processing in Spark

Spark SQL and DataFrame Programming Guide

Introducing DataFrames in Spark for Large Scale Data Science

Beyond SQL: Speeding up Spark with DataFrames