Apache Spark

Click here to load reader

  • date post

    16-Apr-2017
  • Category

    Technology

  • view

    166
  • download

    1

Embed Size (px)

Transcript of Apache Spark

  • Resilient Distributed Dataset: A Fault-Tolerant Abstraction For In-Memory Cluster Computing

    Mahdi Esmail oghli

    Dr. Bagheri

    AmirKabir University of technology

    SPARK

    http://BigData.ceit.aut.ac.ir

    http://bigdata.ceit.aut.ac.ir

  • A BIGDATA Processing Framework.

    2

  • What is BigData?

    3

  • Dealing With BigData

    4

    Sampling

    Hashing

    Approximation methods

    Map-Reduce Model

  • Map-Reduce Model

    5

    A BigData Programming Model

    Potential of parallelism

    Can be executed on a cluster

  • Map-Reduce Model

    6

    Map

    Map

    Map

    Reduce

    Reduce

    OutputInput

  • Problems with current computing frameworks Specially Map-reduce

    Provides abstraction for accessing the clusters computational resources

    Lack of abstraction for distributed memory

    7

  • Problems with current computing frameworks Specially Map-reduce

    Makes them inefficient for those that reuse intermediate results across multiple computations

    8

  • SPARK Motivation

    Problems with current computing frameworks (ex. Map-Reduce):

    Iterative algorithms

    Interactive data mining tools

    9

  • Data reuse examples

    Iterative machine learning and graph algorithms:

    Page Rank

    K-means clustering

    Logistic regression

    10

  • Data reuse examplesInteractive Data Mining (runs multiple queries on the same subset of data) :

    Statistical queries

    Fraud detection

    Stream queries

    11

  • Current Solution

    The only way to reuse data between computations with current frameworks:

    Write it to an external stable storage

    system. X

    12

  • Map-Reduce Model

    13

    Map

    Map

    Map

    Reduce

    Reduce

    OutputInput

    Stable Storage

  • Developed System For Reusing Intermediate Data

    Pregel : Iterative graph computation

    HALOOP: Iterative map-reduce interface

    14

  • Developed System For Reusing Intermediate Data

    Just for specific computation patterns.

    We need abstraction for more general reuse.

    15

  • RDDResilient Distributed Dataset

    16

  • RDD

    Read-Only partitioned collection of records.

    Can be created on either stable storage or other RDDs (using transformations).

    User can control Persistence and Partitioning

    17

  • RDD

    Efficient data reuse

    Parallel data structure

    Allow explicit persist results

    In-memory computation

    Large clusters

    fault-tolerant manner

    18

  • Current fault tolerant approaches

    Data replication across machines

    Log update across machines

    19

  • 2 Main interface for RDD

    20

    RDD

    Actions Transformations

  • -e.g., map, filter and join

    Transformations

    21

    Interface used for fault tolerance in RDD

  • Actions

    SPARK computes RDDs Lazily (Helps pipelining)

    Actions return value.

    ex. Count - Collect - Save

    22

  • RDD can express other models

    Map - Reduce

    SQL

    Pregel

    HALOOP

    23

  • 24

    rdd1.join(rdd2) .groupby() .filter()

    join

    groupby

    filter

    Task Scheduler

    Execute task by worker

  • Results

    SPARK Runtime

    25

    Driver

    Input Data

    Input Data

    Input DataRAM

    RAM

    RAM

    Worker

    Worker

    Worker

    Tasks

    Tasks

    TasksResults

    Results

  • An Example

    1. lines = spark.textFile(hdfs://)

    2. errors = lines.filter( _.startsWith(ERROR) )

    3. errors.persist()

    4. errors.filter( _.contains(HDFS) ).map( _.split(\t)[3]).count

    26

  • An Example

    27

    lines

    Errors

    HDFS errors

    Part3

    filter(_.startsWith(ERROR))

    filter(_.contains(HDFS))

    map( _.split(\t)[3])

  • Persistent FunctionIndicates which RDDs we want to reuse in the future actions.

    Other persistence strategies like:

    Store the RDD only in disk

    Replicating across machines

    Set persistence priorities to RDDs.

    28

  • SPARKRDD is been

    implemented in a system called SPARK

    In SCALA Language

    29

  • What benchmarks show about SPARK

    20X faster than HADOOP for iterative applications

    It can scan 1TB dataset with 5-7s latency

    30

    100 GB Data 100 node

  • Evaluation(Logistic Regression)

    31

    0

    35

    70

    105

    140

    HADOOP HADOOPBM SPARK

    3

    6276

    46

    139

    80

    First Iteration Later Iterations

  • Evaluation(K-means)

    32

    0

    50

    100

    150

    200

    HADOOP HADOOPBM SPARK

    33

    87106

    82

    182

    115

    First Iteration Later Iterations

  • SPARK STACK

    33

    Apache Spark

    Distributed File System. e.x. HDFS, GlusterFS

    Spark SQLSpark

    Streaming MLlib GraphX

  • SPARK Won Daytona GraySort Contest 2014

    Spark officially sets a new record in large-scale sorting

    34

  • Spark the fastest open source engine for sorting

    35

    HADOOP MR SPARK SPARK 1PT

    Data Size 102.5 TB 100 TB 1000 TB

    Elapsed Time 72 mins 23 mins 234 mins

    # Nodes 2100 206 190

    # Cores 50400 physical 6592 virtualized 6080 virtualized

    Cluster disk throughput

    3150 GB/s 618 GB/s 570 GB/s

    EnvironmentDedicated datacenter

    EC2 (i2.8xlarge)

    EC2 (i2.8xlarge)

    Sort rate 1.42 PT/min 4.27 TB/min 4.27 TB/min

    Without using Sparks in-memory cache

  • Current committers

    36

  • SPARK > HADOOP MR

    37

  • References

    38

    Dean, Jeffrey, and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

    Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.

    http://Spark.apache.org

    https://databricks.com

    http://spark.apache.orghttps://databricks.com

  • Thank you