An Overview of Spark DataFrames with Scala

download An Overview of Spark DataFrames with Scala

of 19

  • date post

    07-Jan-2017
  • Category

    Education

  • view

    4.077
  • download

    1

Embed Size (px)

Transcript of An Overview of Spark DataFrames with Scala

  • An Overview of Spark DataFrames with Scala

    An Overview of Spark DataFrames with Scala

    Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

    Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

    http://www.knoldus.com/

  • Who am I ?Who am I ?

    Himanshu Gupta (@himanshug735)

    Spark Certified Developer

    Apache Spark Third-Party Package contributor - spark-streaming-gnip

    Sr. Software Consultant at Knoldus Software LLP

    Himanshu Gupta (@himanshug735)

    Spark Certified Developer

    Apache Spark Third-Party Package contributor - spark-streaming-gnip

    Sr. Software Consultant at Knoldus Software LLP

    Img src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpgImg src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpg

    http://www.knoldus.com/about/team/himanshu.knolhttp://spark-packages.org/package/knoldus/spark-streaming-gniphttp://www.knoldus.com/home.knolhttp://www.knoldus.com/about/team/himanshu.knolhttp://spark-packages.org/package/knoldus/spark-streaming-gniphttp://www.knoldus.com/home.knolhttp://www.knoldus.com/

  • AgendaAgenda

    What is Spark ?

    What is a DataFrame ?

    Why we need DataFrames ?

    A brief example

    Demo

    What is Spark ?

    What is a DataFrame ?

    Why we need DataFrames ?

    A brief example

    Demo

    http://www.knoldus.com/

  • Apache SparkApache Spark

    Distributed compute engine for large-scale data processing.

    100x faster than Hadoop MapReduce.

    Provides APIs in Python, Scala, Java and R (Spark 1.4)

    Combines SQL, streaming and complex analytics.

    Runs on Hadoop, Mesos, or in the cloud.

    Distributed compute engine for large-scale data processing.

    100x faster than Hadoop MapReduce.

    Provides APIs in Python, Scala, Java and R (Spark 1.4)

    Combines SQL, streaming and complex analytics.

    Runs on Hadoop, Mesos, or in the cloud.

    Img src - http://spark.apache.org/Img src - http://spark.apache.org/

    http://www.knoldus.com/

  • Distributed collection of data organized into named columns (i.e. SchemaRDD)

    Domain Specific Language for common tasks - UDFs Sampling Project, filter, aggregate, join, Metadata

    Available in Python, Scala, Java and R (in Spark 1.4)

    Distributed collection of data organized into named columns (i.e. SchemaRDD)

    Domain Specific Language for common tasks - UDFs Sampling Project, filter, aggregate, join, Metadata

    Available in Python, Scala, Java and R (in Spark 1.4)

    Spark DataFramesSpark DataFrames

    http://www.knoldus.com/

  • Google Trends for DataFramesGoogle Trends for DataFrames

    Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30

    http://www.knoldus.com/

  • Speed of Spark DataFrames!Speed of Spark DataFrames!

    Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter

    Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter

    http://www.knoldus.com/

  • RDD API vs DataFrames APIRDD API vs DataFrames API

    val linesRDD = sparkContext.textFile(file.txt)val wordCountRDD = linesRDD.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)

    val (word, (sum, n)) = wordCountRDD.map { case (word, count) => (word, (count, 1)) } .reduce { case ((word1, (count1, n1)), (word2, (count2, n2))) =>

    ("", (count1 + count2, n1 + n2)) }

    val average = sum.toDouble / n

    val linesDF = sparkContext.textFile(file.txt).toDF("line")val wordsDF = linesDF.explode("line", "word")((line: String) => line.split(" "))val wordCountDF = wordsDF.groupBy("word").count()

    val average = wordCountDF.agg(avg("count"))

    RDD APIRDD API

    DataFrame APIDataFrame API

    http://www.knoldus.com/

  • Catalyst Optimizer

    Optimization & Execution Plan shared by DataFrames and SparkSQL

    Img src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlImg src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

    http://www.knoldus.com/

  • AnalysisAnalysis

    Begins with a relation to be computed.

    Builds an Unresolved Logical Plan.

    Applies Catalyst rules.

    DataFrame

    UnresolvedLogical Plan

    Catalyst Rules

    http://www.knoldus.com/

  • Logical OptimizationsLogical Optimizations

    Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

    Applies standard rule-based optimizations to the logical plan.

    Includes operations like - Constant folding Projection pruning Predicate pushdown Boolean expression simplification

    Applies standard rule-based optimizations to the logical plan.

    Includes operations like - Constant folding Projection pruning Predicate pushdown Boolean expression simplification

    object DecimalAggregates extends Rule[LogicalPlan] { /** Maximum number of decimal digits in a Long */ val MAX_LONG_DIGITS = 18 def apply(plan: LogicalPlan): LogicalPlan = { plan transformAllExpressions { case Sum(e @ DecimalType.Expression(prec, scale)) if prec + 10 MakeDecimal(Sum(UnscaledValue(e)), prec + 10, scale) } }}

    http://www.knoldus.com/

  • Physical PlanningPhysical Planning

    Generates one or more physical plans from logical plan.

    Selects a plan using Cost Model.

    Generates one or more physical plans from logical plan.

    Selects a plan using Cost Model.

    Optimized Logical Plan

    Physical Plans CostModelSelected

    Physical Plan

    Physical PlanningPhysical Planning

    http://www.knoldus.com/

  • Code GenerationCode Generation

    Generates Java bytecode for speed of execution.

    Uses Scala language, Quasiquotes.

    Quasiquotes allow programmatic construction of ASTs

    Generates Java bytecode for speed of execution.

    Uses Scala language, Quasiquotes.

    Quasiquotes allow programmatic construction of ASTs

    def compile(node: Node): AST = node match { case Literal(value) => q"$value" case Attribute(name) => q"row.get($name)" case Add(left, right) => q"${compile(left)} + ${compile(right)}" }

    Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

    http://www.knoldus.com/

  • ExampleExampleval tweets = sqlContext.read.json("tweets.json")

    tweets .select("tweetId", "username", "timestamp") .filter("timestamp > 0") .explain(extended = true)

    == Parsed Logical Plan =='Filter ('timestamp > 0) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

    == Analyzed Logical Plan ==tweetId: bigint, username: string, timestamp: bigintFilter (timestamp#14L > cast(0 as bigint)) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

    == Optimized Logical Plan ==Project [tweetId#15L,username#16,timestamp#14L] Filter (timestamp#14L > 0) Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

    == Physical Plan ==Filter (timestamp#14L > 0) Scan JSONRelation[file:/home/knoldus/data/json/tweets.json][tweetId#15L,username#16,timestamp#14L]

    http://www.knoldus.com/

  • Example (contd.)Example (contd.)

    project

    filter

    Logical PlanOptimized

    Logical Plan Physical Plan

    tweets

    filter

    project

    tweets

    project

    filter

    tweets

    project

    filter

    tweets

    filter

    tweets

    filter

    Scan (tweets)

    http://www.knoldus.com/

  • DemoDemo

    http://www.knoldus.com/

  • Download Code

    https://github.com/knoldus/spark-dataframes-meetup

    https://github.com/knoldus/spark-dataframes-meetuphttp://www.knoldus.com/

  • References

    http://spark.apache.org/

    Spark Summit EU 2015

    Deep Dive into Spark SQLs Catalyst Optimizer

    Spark SQL: Relational Data Processing in Spark

    Spark SQL and DataFrame Programming Guide

    Introducing DataFrames in Spark for Large Scale Data Science

    Beyond SQL: Speeding up Spark with DataFrames

    http://spark.apache.org/https://spark-summit.org/eu-2015/https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdfhttp://spark.apache.org/docs/latest/sql-programming-guide.htmlhttp://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-sciencehttp://www.slideshare.net/databricks/spark-sqlsse2015publichttp://www.knoldus.com/

  • Presenter:himanshu@knoldus.com

    @himanshug735

    Presenter:himanshu@knoldus.com

    @himanshug735

    Organizer:@Knolspeak

    http://www.knoldus.comhttp://blog.knoldus.com

    Organizer:@Knolspeak

    http://www.knoldus.comhttp://blog.knoldus.com

    Thanks

    mailto:himanshu@knoldus.comhttps://twitter.com/himanshug735mailto:himanshu@knoldus.comhttp