Spark and Spark SQL - SICSictlabs-summer- Distributed Datasets (RDD) Amir H. Payberah (SICS) Spark...

download Spark and Spark SQL - SICSictlabs-summer- Distributed Datasets (RDD) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71 Challenge How to design a distributed memory

of 90

  • date post

    19-Mar-2018
  • Category

    Documents

  • view

    214
  • download

    0

Embed Size (px)

Transcript of Spark and Spark SQL - SICSictlabs-summer- Distributed Datasets (RDD) Amir H. Payberah (SICS) Spark...

  • Spark and Spark SQL

    Amir H. Payberahamir@sics.se

    SICS Swedish ICT

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71

  • What is Big Data?

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 2 / 71

  • Big Data

    ... everyone talks about it, nobody really knows how to do it, ev-eryone thinks everyone else is doing it, so everyone claims they aredoing it.

    - Dan Ariely

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 3 / 71

  • Big Data

    Big data is the data characterized by 4 key attributes: volume,variety, velocity and value.

    - Oracle

    Buzzwo

    rds

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 4 / 71

  • Big Data

    Big data is the data characterized by 4 key attributes: volume,variety, velocity and value.

    - OracleBuzzwo

    rds

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 4 / 71

  • Big Data In Simple Words

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 5 / 71

  • The Four Dimensions of Big Data

    I Volume: data size

    I Velocity: data generation rate

    I Variety: data heterogeneity

    I This 4th V is for Vacillation:Veracity/Variability/Value

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 6 / 71

  • How To Store and ProcessBig Data?

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 7 / 71

  • Scale Up vs. Scale Out

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 8 / 71

  • Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 9 / 71

  • The Big Data Stack

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 10 / 71

  • Data Analysis

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 11 / 71

  • Programming Languages

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 12 / 71

  • Platform - Data Processing

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 13 / 71

  • Platform - Data Storage

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 14 / 71

  • Resource Management

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 15 / 71

  • Spark Processing Engine

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 16 / 71

  • Why Spark?

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 17 / 71

  • Motivation (1/4)

    I Most current cluster programming models are based on acyclic dataflow from stable storage to stable storage.

    I Benefits of data flow: runtime can decide where to run tasks andcan automatically recover from failures.

    I E.g., MapReduce

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

  • Motivation (1/4)

    I Most current cluster programming models are based on acyclic dataflow from stable storage to stable storage.

    I Benefits of data flow: runtime can decide where to run tasks andcan automatically recover from failures.

    I E.g., MapReduce

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

  • Motivation (1/4)

    I Most current cluster programming models are based on acyclic dataflow from stable storage to stable storage.

    I Benefits of data flow: runtime can decide where to run tasks andcan automatically recover from failures.

    I E.g., MapReduce

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

  • Motivation (2/4)

    I MapReduce programming model has not been designed for complexoperations, e.g., data mining.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 19 / 71

  • Motivation (3/4)

    I Very expensive (slow), i.e., always goes to disk and HDFS.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 20 / 71

  • Motivation (4/4)

    I Extends MapReduce with more operators.

    I Support for advanced data flow graphs.

    I In-memory and out-of-core processing.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 21 / 71

  • Spark vs. MapReduce (1/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71

  • Spark vs. MapReduce (1/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71

  • Spark vs. MapReduce (2/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71

  • Spark vs. MapReduce (2/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71

  • Challenge

    How to design a distributed memory abstractionthat is both fault tolerant and efficient?

    Solution

    Resilient Distributed Datasets (RDD)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71

  • Challenge

    How to design a distributed memory abstractionthat is both fault tolerant and efficient?

    Solution

    Resilient Distributed Datasets (RDD)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71

  • Resilient Distributed Datasets (RDD) (1/2)

    I A distributed memory abstraction.

    I Immutable collections of objects spread across a cluster. Like a LinkedList

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 25 / 71

  • Resilient Distributed Datasets (RDD) (2/2)

    I An RDD is divided into a number of partitions, which are atomicpieces of information.

    I Partitions of an RDD can be stored on different nodes of a cluster.

    I Built through coarse grained transformations, e.g., map, filter,join.

    I Fault tolerance via automatic rebuild (no replication).

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

  • Resilient Distributed Datasets (RDD) (2/2)

    I An RDD is divided into a number of partitions, which are atomicpieces of information.

    I Partitions of an RDD can be stored on different nodes of a cluster.

    I Built through coarse grained transformations, e.g., map, filter,join.

    I Fault tolerance via automatic rebuild (no replication).

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

  • Resilient Distributed Datasets (RDD) (2/2)

    I An RDD is divided into a number of partitions, which are atomicpieces of information.

    I Partitions of an RDD can be stored on different nodes of a cluster.

    I Built through coarse grained transformations, e.g., map, filter,join.

    I Fault tolerance via automatic rebuild (no replication).

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

  • Programming Model

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 27 / 71

  • Spark Programming Model

    I A data flow is composed of any number of data sources, operators,and data sinks by connecting their inputs and outputs.

    I Operators are higher-order functions that execute user-defined func-tions in parallel.

    I Two types of RDD operators: transformations and actions.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

  • Spark Programming Model

    I A data flow is composed of any number of data sources, operators,and data sinks by connecting their inputs and outputs.

    I Operators are higher-order functions that execute user-defined func-tions in parallel.

    I Two types of RDD operators: transformations and actions.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

  • Spark Programming Model

    I A data flow is composed of any number of data sources, operators,and data sinks by connecting their inputs and outputs.

    I Operators are higher-order functions that execute user-defined func-tions in parallel.

    I Two types of RDD operators: transformations and actions.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

  • RDD Operators (1/2)

    I Transformations: lazy operators that create new RDDs.

    I Actions: lunch a computation and return a value to the program orwrite data to the external storage.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 29 / 71

  • RDD Operators (2/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 30 / 71

  • RDD Transformations - Map

    I All pairs are independently processed.

    // passing each element through a function.

    val nums = sc.parallelize(Array(1, 2, 3))

    val squares = nums.map(x => x * x) // {1, 4, 9}

    // selecting those elements that func returns true.

    val even = squares.filter(_ % 2 == 0) // {4}

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 31 / 71

  • RDD Transformations - Map

    I All pairs are independently processed.

    // passing each element through a function.

    val nums = sc.parallelize(Array(1, 2, 3))

    val squares = nums.map(x => x * x) // {1, 4, 9}

    // selecting those elements that func returns true.

    val even = squares.filter(_ % 2 == 0) // {4}

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 31 / 71

  • RDD Transformations - Reduce

    I Pairs with identical key are grouped.

    I Groups are independently processed.

    val pets = sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2)))

    pets.groupByKey()

    // {(cat, (1, 2)), (dog, (1))}

    pets.reduceByKey((x, y) => x + y)

    or

    pets.reduceByKey(_ + _)

    // {(cat, 3), (dog, 1)}

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 32 / 71

  • RDD Transformations - Reduce

    I Pairs with identical key are grouped.

    I Groups are independently processed.

    val pets = sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2)))

    pets.groupByKey()

    // {