Spark UCSD 16Nov15 Zonca

download Spark UCSD 16Nov15 Zonca

of 17

Transcript of Spark UCSD 16Nov15 Zonca

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    1/49

    Distributed computwith Spark and Pyth

     A. Zonca - San Diego Supercomputer C

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    2/49

    What is Spark?

    ● A distributed computing framework

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    3/49

    Connect to a spark clus

    Running on 4 instances of SDSC Cloud

    http://bit.ly/sparknotebook

    password: acluster 

     ALWAYS COPY WITH YOUR NAME

    http://bit.ly/sparknotebookhttp://bit.ly/sparknotebook

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    4/49

    House price example

    see house_price.ipynb

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    5/49map-reduce diagram

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    6/49

    What's the point?

    ● write simple mapper and reducer 

    ● framework scales to thousands of ma

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    7/49

    Problem 1: Storage

    ● Big data

    ● Commodity hardware (Cloud)

    Solution: Distributed File System

    ● redundant

    ● fault tolerant

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    8/49

    Problem 2: Computatio

    ● Slow to move data across network

    ● Computations fail

    Solution: Hadoop Mapreduce / Spark

    ● Execute computation where data are

    ● Rerun failed jobs

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    9/49

    Problem 3: Communica

    ● Most of the times, need to summarize

    to get a result

    ● Reduction phase in MapReduce

    ● Need data transfer across network

    Solution: highly optimized Shuffle (All-to-

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    10/49

    Spark and Hadoop

    ● Works within the Hadoop ecosystem

    ● Extends MapReduce

    ● Initially developed at UC Berkeley

    ● Now within the Apache Foundation● ~400 and more developers

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    11/49

    Key features of Spark

    ● Resiliency: tolerant to node failures

    ● Speed: supports in-memory caching

    ● Ease of use:

    ○ Python/Scala interfaces○ interactive shells

    ○ many distributed primitives

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    12/49

    Spark 100TB benchmark

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    13/49

    Worker Node

    Spark

    Executor

    Java Virtual

    Machine

      Python

      Python

      Python

    HDFS

    Worker No

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    14/49

    Exec

    JVM

     

     

     

    Worker No

    Exec

    JVM

     

     

     

    Exec

    JVM

     

     

     

    Worker No

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    15/49

    Exec

    JVM

     

     

     

    Worker No

    Exec

    JVM

     

     

     

    Exec

    JVM

     

     

     

    Cluster Manager

    YARN/Standalone

    Provision/Restart Workers

    Worker No

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    16/49

    Exec

    JVM

     

     

     

    Worker No

    Exec

    JVM

     

     

     

    Exec

    JVM

     

     

     

    Cluster

    Manager

    Spark

    Context

    Spark

    Context

    Driver Program

    S k L l

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    17/49

    Exec

    JVM 

    StandaloneSpark

    Context

    Spark

    Context

    Driver Program

    Spark Local

    Computing

    C t

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    18/49

    Exec

    JVM

     

     

     

    Computing

    Exec

    JVM

     

     

     

    Exec

    JVM

     

     

     Standalone CM

    Spark

    Context

    Spark

    Context

    Driver Program

    Master node

    on Comet

    EC2 node

    A EMR

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    19/49

    Exec

    JVM

     

     

     

    EC2 node

    Exec

    JVM

     

     

     

    Exec

    JVM

     

     

     

    YARNSpark

    Context

    Spark

    Context

    Driver Program

    Master node

    on Amazon EMR

    Computing

    SDSC Cl d

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    20/49

    Exec

    JVM

     

     

     

    Computing

    Exec

    JVM

     

     

     

    Exec

    JVM

     

     

     

    Standalone

    Spark

    Context

    Spark

    Context

    Driver Program

    Master node

    SDSC Cloud

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    21/49

    House price with HDFS

    see house_price_hdfs.ipynb

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    22/49

    Resilient Distributed Datase

    ● Resilient: fault tolerant, lineage is sav

    partitions can be recovered

    ● Distributed: partitions are automatical

    distributed across nodes● Created from: HDFS, S3, HBase, Loc

    Local hierarchy of folders

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    23/49

    map

    map : apply function to each element of RD

    RDD Partitions map

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    24/49

    Other transformations

    • filter(func) - keep only elements whis true

    • sample(withReplacement, fraction,

    get a random data fraction• coalesce(numPartitions) - merge pa

    to reduce them to numPartitions

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    25/49

    filter 

    filter : keep only elements where func

    RDD Partitions filter 

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    26/49

    coalesce

    sc.parallelize(range(10), 4).glom().collect()Out[]: [[0, 1], [2, 3], [4, 5], [6, 7, 8, 9]]

    sc.parallelize(range(10), 4).coalesce(2).glom().collOut[]: [[0, 1, 2, 3], [4, 5, 6, 7, 8, 9]]

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    27/49

    coalesce

    coalesce : reduce the number of partiti

    RDD Partitions coalesce

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    28/49

    Wide transformations

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    29/49

    reduceByKey(func)

      (K, V) pairs => (K, reduce V with fu

    K A, 1

     A A, 2m

     A A, 3

    redshuffle

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    30/49

    Narrow vs Wid

    map

    K A, 1

     A A, 2m

    reducebyKey

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    31/49

    Wide transformations

    • groupByKey : (K, V) pairs => (K, iterable

    V)

    • reduceByKey(func) : (K, V) pairs => (K,

    reduction by func on all V)

    • repartition(numPartitions) : similar to coa

    shuffles all data to increase or decrease n

    of partitions to numPartitions

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    32/49

    Shuffle

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    33/49

    Shuffle

    • Global redistribution of data

    • High impact on performance

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    34/49

    Shuffle

    K A, 1

      B, 5

     A,2| B,8m

     A A, [1, 2]

    B, [5, 8]

    writes todisk

    r

    dat

    n

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    35/49

    Know shuffle, avoid it

    • Which operations cause it?

    • Is it necessary?

    R ll d B K

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    36/49

    Really need groupByKey

    groupByKey: (K, V) pairs => (K, iterable of

     if you plan to call reduce later in the pipelin

     use reduceByKey instead.

    B K d

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    37/49

    groupByKey + reduce

    K A, 1

     A,2| A,8m

     A, [1,2,8] A, 11

    d B K

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    38/49

    reduceByKey

    K A, 1

     A,2| A,8m A, 10

     A, [1, 10] A, 11

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    39/49

    Extract data from RDD

    ● collect() - copy all elements to the driv● take(n) - copy first n elements

    ● saveAsTextFile(filename) - save to file

    ● reduce(func) - aggregate elements wi(takes 2 elements, returns 1)

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    40/49

    Cached RDD

    ● Generally recommended after data cle● Reusing cached data: 10x speedup

    ● Great for iterative algorithms

    ● If RDD too large, will only be partially in memory

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    41/49

    Directed Acyclic Graph Schedule

    Di t d A li G h

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    42/49

    Directed Acyclic Graph

    Di t d A li G h

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    43/49

    Directed Acyclic Graph

    Track dependencies!

    (also known as lineage or

    provenance)

    DAG in Spark

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    44/49

    DAG in Spark

    • nodes are RDDs• arrows are Transformations

    Narrow vs Wid

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    45/49

    Narrow vs Wid

    map

    K A, 1

     A A, 2m

    groupbyK

    Narrow vs Wid

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    46/49

    Narrow vs Wid

    map

    K A, 1

     A A, 2m

    groupbyK

     A, 1

      A, 2

    Spark DAG of transformat

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    47/49

    Spark DAG of transformat

    K A, 1

     A A, 2m

    string

    reducebyK

      A, 1

      A, 2

    (k, v)  HDFS

    maptextFile

    KA 1

    A 1

    maptextFile  join

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    48/49

    K A, 1

     A A,2m

    string

      A, 1

      A, 2

    (k, v1) HDFS

    maptextFile

    K A, 1

    (k,(v1, v2))

      A, 1

      A, 2

    (k, v2) 

    HDFSstring (k, v1 * v2)

    map

    Thank you

  • 8/20/2019 Spark UCSD 16Nov15 Zonca

    49/49

    Thank you

    [email protected]