Spark and Spark SQL - Amir H. Payberah Spark and Spark SQL Amir H. Payberah amir@sics.se SICS...

download Spark and Spark SQL - Amir H. Payberah Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish

of 90

  • date post

    28-May-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of Spark and Spark SQL - Amir H. Payberah Spark and Spark SQL Amir H. Payberah amir@sics.se SICS...

  • Spark and Spark SQL

    Amir H. Payberah amir@sics.se

    SICS Swedish ICT

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71

  • What is Big Data?

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 2 / 71

  • Big Data

    ... everyone talks about it, nobody really knows how to do it, ev- eryone thinks everyone else is doing it, so everyone claims they are doing it.

    - Dan Ariely

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 3 / 71

  • Big Data

    Big data is the data characterized by 4 key attributes: volume, variety, velocity and value.

    - Oracle

    Buz zwo

    rds

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 4 / 71

  • Big Data

    Big data is the data characterized by 4 key attributes: volume, variety, velocity and value.

    - OracleBuz zwo

    rds

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 4 / 71

  • Big Data In Simple Words

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 5 / 71

  • The Four Dimensions of Big Data

    I Volume: data size

    I Velocity: data generation rate

    I Variety: data heterogeneity

    I This 4th V is for Vacillation: Veracity/Variability/Value

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 6 / 71

  • How To Store and Process Big Data?

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 7 / 71

  • Scale Up vs. Scale Out

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 8 / 71

  • Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 9 / 71

  • The Big Data Stack

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 10 / 71

  • Data Analysis

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 11 / 71

  • Programming Languages

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 12 / 71

  • Platform - Data Processing

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 13 / 71

  • Platform - Data Storage

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 14 / 71

  • Resource Management

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 15 / 71

  • Spark Processing Engine

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 16 / 71

  • Why Spark?

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 17 / 71

  • Motivation (1/4)

    I Most current cluster programming models are based on acyclic data flow from stable storage to stable storage.

    I Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures.

    I E.g., MapReduce

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

  • Motivation (1/4)

    I Most current cluster programming models are based on acyclic data flow from stable storage to stable storage.

    I Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures.

    I E.g., MapReduce

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

  • Motivation (1/4)

    I Most current cluster programming models are based on acyclic data flow from stable storage to stable storage.

    I Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures.

    I E.g., MapReduce

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

  • Motivation (2/4)

    I MapReduce programming model has not been designed for complex operations, e.g., data mining.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 19 / 71

  • Motivation (3/4)

    I Very expensive (slow), i.e., always goes to disk and HDFS.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 20 / 71

  • Motivation (4/4)

    I Extends MapReduce with more operators.

    I Support for advanced data flow graphs.

    I In-memory and out-of-core processing.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 21 / 71

  • Spark vs. MapReduce (1/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71

  • Spark vs. MapReduce (1/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71

  • Spark vs. MapReduce (2/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71

  • Spark vs. MapReduce (2/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71

  • Challenge

    How to design a distributed memory abstraction that is both fault tolerant and efficient?

    Solution

    Resilient Distributed Datasets (RDD)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71

  • Challenge

    How to design a distributed memory abstraction that is both fault tolerant and efficient?

    Solution

    Resilient Distributed Datasets (RDD)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71

  • Resilient Distributed Datasets (RDD) (1/2)

    I A distributed memory abstraction.

    I Immutable collections of objects spread across a cluster. • Like a LinkedList

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 25 / 71

  • Resilient Distributed Datasets (RDD) (2/2)

    I An RDD is divided into a number of partitions, which are atomic pieces of information.

    I Partitions of an RDD can be stored on different nodes of a cluster.

    I Built through coarse grained transformations, e.g., map, filter, join.

    I Fault tolerance via automatic rebuild (no replication).

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

  • Resilient Distributed Datasets (RDD) (2/2)

    I An RDD is divided into a number of partitions, which are atomic pieces of information.

    I Partitions of an RDD can be stored on different nodes of a cluster.

    I Built through coarse grained transformations, e.g., map, filter, join.

    I Fault tolerance via automatic rebuild (no replication).

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

  • Resilient Distributed Datasets (RDD) (2/2)

    I An RDD is divided into a number of partitions, which are atomic pieces of information.

    I Partitions of an RDD can be stored on different nodes of a cluster.

    I Built through coarse grained transformations, e.g., map, filter, join.

    I Fault tolerance via automatic rebuild (no replication).

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

  • Programming Model

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 27 / 71

  • Spark Programming Model

    I A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs.

    I Operators are higher-order functions that execute user-defined func- tions in parallel.

    I Two types of RDD operators: transformations and actions.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

  • Spark Programming Model

    I A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs.

    I Operators are higher-order functions that execute user-defined func- tions in parallel.

    I Two types of RDD operators: transformations and actions.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

  • Spark Programming Model

    I A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs.

    I Operators are higher-order functions that execute user-defined func- tions in parallel.

    I Two types of RDD operators: transformations and actions.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

  • RDD Operators (1/2)

    I Transformations: lazy operators that create new RDDs.

    I Actions: lunch a computation and return a value to the program or write data to the external storage.

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 29 / 71

  • RDD Operators (2/2)

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 30 / 71

  • RDD Transformations - Map

    I All pairs are independently processed.

    // passing each element through a function.

    val nums = sc.parallelize(Array(1, 2, 3))

    val squares = nums.map(x => x * x) // {1, 4, 9}

    // selecting those elements that func returns true.

    val even = squares.filter(_ % 2 == 0) // {4}

    Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 31 / 71

  • RDD Transformations - Map

    I All pairs are independently processed.

    // passing each element through a function.

    val nums = sc.parallelize(Array(1, 2, 3))

    val squares = nums.map(x => x * x) // {1, 4, 9}

    // selecting those elements that func returns true.

    val even = squares.filter(_ % 2 == 0) // {4}

    Amir H. Payberah (SICS) Spark and Spark SQL Ju