An Overview of Spark DataFrames with Scala

Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

Who am I ?Who am I ?

Himanshu Gupta (@himanshug735)

Spark Certified Developer

Apache Spark Third-Party Package contributor - spark-streaming-gnip

Sr. Software Consultant at Knoldus Software LLP

Himanshu Gupta (@himanshug735)

Spark Certified Developer

Apache Spark Third-Party Package contributor - spark-streaming-gnip

Sr. Software Consultant at Knoldus Software LLP

Img src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpgImg src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpg

AgendaAgenda

● What is Spark ?

● What is a DataFrame ?

● Why we need DataFrames ?

● A brief example

● Demo

● What is Spark ?

● What is a DataFrame ?

● Why we need DataFrames ?

● A brief example

● Demo

Apache SparkApache Spark

● Distributed compute engine for large-scale data processing.

● 100x faster than Hadoop MapReduce.

● Provides APIs in Python, Scala, Java and R (Spark 1.4)

● Combines SQL, streaming and complex analytics.

● Runs on Hadoop, Mesos, or in the cloud.

● Distributed compute engine for large-scale data processing.

● 100x faster than Hadoop MapReduce.

● Provides APIs in Python, Scala, Java and R (Spark 1.4)

● Combines SQL, streaming and complex analytics.

● Runs on Hadoop, Mesos, or in the cloud.

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

● Distributed collection of data organized into named columns (i.e. SchemaRDD)

● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata

● Available in Python, Scala, Java and R (in Spark 1.4)

● Distributed collection of data organized into named columns (i.e. SchemaRDD)

● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata

● Available in Python, Scala, Java and R (in Spark 1.4)

Spark DataFramesSpark DataFrames

Google Trends for DataFramesGoogle Trends for DataFrames

Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30

Speed of Spark DataFrames!Speed of Spark DataFrames!

Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter

RDD API vs DataFrames APIRDD API vs DataFrames API

val linesRDD = sparkContext.textFile(“file.txt”)val wordCountRDD = linesRDD.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)

val (word, (sum, n)) = wordCountRDD.map { case (word, count) => (word, (count, 1)) } .reduce { case ((word1, (count1, n1)), (word2, (count2, n2))) =>

("", (count1 + count2, n1 + n2)) }

val average = sum.toDouble / n

val linesDF = sparkContext.textFile(“file.txt”).toDF("line")val wordsDF = linesDF.explode("line", "word")((line: String) => line.split(" "))val wordCountDF = wordsDF.groupBy("word").count()

val average = wordCountDF.agg(avg("count"))

RDD APIRDD API

DataFrame APIDataFrame API

Catalyst Optimizer

Optimization & Execution Plan shared by DataFrames and SparkSQL

Img src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlImg src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

AnalysisAnalysis

Begins with a relation to be computed.

Builds an “Unresolved Logical Plan”.

Applies Catalyst rules.

DataFrame

UnresolvedLogical Plan

Catalyst Rules

Logical OptimizationsLogical Optimizations

Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

● Applies standard rule-based optimizations to the logical plan.

● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification

● Applies standard rule-based optimizations to the logical plan.

● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification

object DecimalAggregates extends Rule[LogicalPlan] { /** Maximum number of decimal digits in a Long */ val MAX_LONG_DIGITS = 18 def apply(plan: LogicalPlan): LogicalPlan = { plan transformAllExpressions { case Sum(e @ DecimalType.Expression(prec, scale)) if prec + 10 <= MAX_LONG_DIGITS => MakeDecimal(Sum(UnscaledValue(e)), prec + 10, scale) } }}

Physical PlanningPhysical Planning

● Generates one or more physical plans from logical plan.

● Selects a plan using Cost Model.

● Generates one or more physical plans from logical plan.

● Selects a plan using Cost Model.

Optimized Logical Plan

Physical PlansCost

ModelSelected

Physical Plan

Physical PlanningPhysical Planning

Code GenerationCode Generation

● Generates Java bytecode for speed of execution.

● Uses Scala language, Quasiquotes.

● Quasiquotes allow programmatic construction of ASTs

● Generates Java bytecode for speed of execution.

● Uses Scala language, Quasiquotes.

● Quasiquotes allow programmatic construction of ASTs

def compile(node: Node): AST = node match { case Literal(value) => q"$value" case Attribute(name) => q"row.get($name)" case Add(left, right) => q"${compile(left)} + ${compile(right)}" }

Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

ExampleExample

val tweets = sqlContext.read.json("tweets.json")

tweets .select("tweetId", "username", "timestamp") .filter("timestamp > 0") .explain(extended = true)

== Parsed Logical Plan =='Filter ('timestamp > 0) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Analyzed Logical Plan ==tweetId: bigint, username: string, timestamp: bigintFilter (timestamp#14L > cast(0 as bigint)) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Optimized Logical Plan ==Project [tweetId#15L,username#16,timestamp#14L] Filter (timestamp#14L > 0) Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Physical Plan ==Filter (timestamp#14L > 0) Scan JSONRelation[file:/home/knoldus/data/json/tweets.json][tweetId#15L,username#16,timestamp#14L]

Example (contd.)Example (contd.)

project

filter

Logical PlanOptimized

Logical Plan Physical Plan

tweets

filter

project

tweets

project

filter

tweets

project

filter

tweets

filter

tweets

filter

Scan (tweets)

DemoDemo

Download Code

https://github.com/knoldus/spark-dataframes-meetup

References

http://spark.apache.org/

Spark Summit EU 2015

Deep Dive into Spark SQL’s Catalyst Optimizer

Spark SQL: Relational Data Processing in Spark

Spark SQL and DataFrame Programming Guide

Introducing DataFrames in Spark for Large Scale Data Science

Beyond SQL: Speeding up Spark with DataFrames

Presenter:himanshu@knoldus.com

@himanshug735

Presenter:himanshu@knoldus.com

@himanshug735

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Thanks

An Overview of Spark DataFrames with Scala

Education

Transcript of An Overview of Spark DataFrames with Scala

Introducing DataFrames in Spark for Large Scale Data Science

Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael Armbrust

Big data apache spark + scala

Spark DataFrames for Data Munging

Scala meetup - Intro to spark

Mahout scala and spark bindings

Real Time Analytics via Spark & Scala | Spark & Scala Fundamentals | Spark & Scala Architecture

Beyond SQL: Speeding up Spark with DataFrames

Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Spark as a distributed Scala

DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015 · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and Scala

Multi dimension aggregations using spark and dataframes

with BSP, Pregel and DataFrames · 2018-11-30 · Spark GraphFrames • DataFrame based extension for Graph processing in Spark. • Can be used in Scala and Python • Graph is defined

Automated Machine Learning Workflow for Distributed Big ... · − Spark Streaming, Flink, ... Spark ML Pipeline Stages Test Data Predictions Test Parquet Files Spark DataFrames Feature

Big Data and Apache Spark - ERASMUS Pulseinfo.cs.pub.ro/scoaladevara/prez2017/ubis2.pdf · Spark SQL Module •component on top of Spark Core •main abstraction DataFrames •support

Intro to DataFrames and Spark SQL - piazza … · Spark SQL 2 Part of the core distribution since Spark 1.0 (April 2014) Graduated from Alpha in 1.3

SPARK ON HIPERGATOR - help.rc.ufl.edu · SPARK SQL AND DATAFRAMES • SparkSQL • Allows SQL-like commands on distributed data sets • Spark DataFrames • Developed in Spark 2.0

Structuring Spark: DataFrames, Datasets, and Streaming