Scala and Spark: Coevolving Ecosystems for Big Data

download Scala and Spark: Coevolving Ecosystems for Big Data

of 36

Embed Size (px)

Transcript of Scala and Spark: Coevolving Ecosystems for Big Data

  • Scala and Spark:Coevolving Ecosystems

    for Big Data

    John Nestor47 Degrees

    Datapalooza SeattleFebruary 10-11, 2016




    Scala Spark Scala Impact on Spark Spark Impact on Scala Summary Questions


  • Scala



    Scala History

    Scala created by Martin Odersky, EFPL Switzerland wrote javac, the compiler for Java designed Java generics

    2004 Scala announced 2006 Scala 2.0 2011 Typesafe started, Scala 2.9 2012 Scala 2.10 2014 Scala 2.11 2015 Scala 2.12-RC1 2016 Scala now 12 years old and quite mature



    Why Scala?

    Open Source Strong typing Concise elegant syntax Runs on JVM (Java Virtual Machine) Supports both object-oriented and functional Small simple programs (REPL) through very large

    multi-server systems

    Easy to cleanly extend with new libraries and DSLs Ideal for parallel and distributed systems



    Scala: Strong Typing and Concise Syntax

    Strong typing like Java. Compile time checks Better modularity via strongly typed interfaces Easier maintenance: types make code easier to


    Concise syntax like Python. Type inference. Compiler infers most types that had to be

    explicit in Java.

    Powerful syntax that avoid much of the boilerplate of Java code (see next slide).

    Best of both worlds: safety of strong typing with conciseness (like Python).



    Scala Case Class

    Java version class User { private String name; private Int age; public User(String name, Int age) { = name; this.age = age; } public getAge() { return age; } public setAge(Int age) { this.age = age;} } User joe = new User(Joe, 30);

    Scala versioncase class User(name:String, var age:Int) val joe = User(Joe, 30)



    Key Scala Features

    Immutable Collections Seq, Set, Map

    Functional Programming (i:Int) => i + 1 Functions can be parameters and results For collections: map, filter, flatMap, groupBy,

    Case Classes case class Person(name:String, age:Int)

    Define Domain Specific Languages (DSLs) Akka, SBT, Spray, ScalaTest and now Spark



    View Sample Scala Code

    Word Count in Scala


  • Spark



    Spark History

    2009 Mesos developed at UC Berkeley AmpLab 2009 Matei Zaharia starts Spark as a test case for


    2012 Spark 0.5 2014 Spark 1.0, Top Level Apache Project 2014 Databricks started 2016 Spark 1.6



    Why Spark?

    Support for not only batch but also (near) real-time Fast - keeps data in memory as much as possible

    Often 10X to 100X Hadoop speed A clean easy-to-use API A richer set of functional operations than just map and


    A foundation for a wide set of integrated data applications

    Can recover from failures - recompute or (optional) replication

    Scalable for very large data sets and reduced time12


    Spark Components

    Spark Core Scalable multi-node cluster Failure detection and recovery RDDs, Dataframes and functional operations

    MLLib - for machine learning linear regression, SVMs, clustering, collaborative

    filtering, dimension reduction

    more on the way! GraphX - for graph computation Streaming - for near real-time



    Spark RDDs

    RDD[T] - resilient distributed data set typed immutable ordered can be processed in parallel lazy evaluation - permits more global optimizations

    Rich set of functional operations ( a small sample) map, flatMap, reduce, filter, groupBy, sortBy, take, drop,



    Spark Data Frames

    Similar to SQL tables can be transformed using SQL

    Focus of much of the work on Spark performance optimization

    Unlike RDDs, optimized knows about fields Dynamically typed

    Not a natural fit for Scala (more on this later)



    View Sample Spark Code - RDDs

    Word Count for Spark written Scala



    View Sample Spark Code - Data Frames

    Tweet language count using Data Frames written in Scala


  • Scala Impact on Spark



    Spark Impact on Scala

    Written in Scala Scala is primary API Must use Scala to extend Source code as documentation: Scala code

    Design from Scala functional programming immutable collections Spark is a Scala DSL



    Spark Application Language Choices - 1

    71% Scala, 58% Python Code length similar Scala faster for RDDs (no need to move serialized

    data between JVM and Python)

    Scala is generally faster Scala provides strong typing

    Compile time checks Code is easier to understand Types help when writing and maintaining code



    Spark Application Language Choices- 2

    In addition to Scala and Python also supports Java and R If you want to build scalable high performance production

    code based on Spark

    R by itself is too specialized Python is too slow Java is tedious to write and maintain Scala is just right



    Transformations: Parallel versus Serial

    Scala has operations that traverse sequence elements in serial order: foldLeft, scanLeft

    Spark RDDs can only process sequence elements in parallel Enables full use of multiple works with multiple cores

    Serial operations like foldLeft would be slow compared to current parallel operations

    But there are cases where foldLeft or scanLeft is really needed Example: time series where early event can affect later event Example: Sherlock, word count by story

    So should Spark add foldLeft or similar operators?22


    Sample Spark Internal Code

    Lets examine the internal code in Spark Spark Context RDD


  • Spark Impact on Scala



    Pair RDDs for Scala


    Scala Sequence Key Value

    Unique Set Map

    Duplicate Seq ???Odersky

    Spark Sequence Key Value


    Duplicate RDD Pair RDD



    Lazy Evaluation

    Strict evaluation (Scala) Each transformation is evaluated as it is seen Easy to understand and debug

    Lazy evaluation (Spark) Transformations are collected into a linage graph Evaluated only when a final action is applied Allows cross transformation optimization



    Lazy Evaluation in Scala

    Scala has Stream (a kind of sequence) Elements are evaluated in order as needed Allows infinite collections But no cross transform optimization

    Lazy Spark evaluation does all elements at once Enables parallelism

    Scala may add lazy versions of its collections that are more like Spark (Odersky)


  • Copyright 2015 47 Degrees

    Spark Clusters - Serialization


    Your Spark App

    Spark Context

    Cluster Manager

    Worker Node


    Task Task

    Spark Driver

    Worker Node


    Task Task


    Static CodeJar File

    Dynamic CodeClosures


    Dynamic Code Serialization

    Depends on values known at run-time Often a function passed to Spark transformations such

    as map, filter, and flatMap

    Sent from driver to workers Serialized on driver to byte array Deserialized on each work



    Serialization of Closures

    Often include more than needed (or expected) Demo Scala may become smarter about including less in

    closure serialization (Odersky)




    DataFrames are becoming the focus of Spark optimization

    DataFrames are dynamically typed Scala API would be nicer if there was static typing Datasets (new experimental in Spark 1.6) attempt to

    solve this

    Sample Code It would be nice if DataBricks and Typesafe could work

    together to produce a better solution


  • Scala and Spark in Seattle



    Scala and Spark in Seattle

    Seattle Meetups Scala at the Sea Meetup (over 1000 members)

    Seattle Spark Meetup (over 1400 members)

    47 DegreesTraining (Seattle and Worldwide)

    Typesafe Scala Training: Scala, Akka Spark Training: Programming Spark with Scala

    UW Scala Professional Certificate Program