What Is Spark? General cs162/fa17/static/lectures/25.pdf¢  Spark: unified engine across...

download What Is Spark? General cs162/fa17/static/lectures/25.pdf¢  Spark: unified engine across Spark SQL MLlib

of 19

  • date post

    28-May-2020
  • Category

    Documents

  • view

    3
  • download

    0

Embed Size (px)

Transcript of What Is Spark? General cs162/fa17/static/lectures/25.pdf¢  Spark: unified engine across...

  • Page 1

    CS162� Operating Systems and� Systems Programming�

    Lecture 25 � Apache Spark – Berkeley Data Analytics Stack (BDAS)

    November 27th, 2017 Prof. Ion Stoica

    http://cs162.eecs.Berkeley.edu

    Lec 24.211/27/17 CS162 © UCB Fall 2017

    A Short History

    •  Started at UC Berkeley in 2009

    •  Open Source: 2010

    •  Apache Project: 2013

    •  Today: most popular big data processing engine

    Lec 24.311/27/17 CS162 © UCB Fall 2017

    What Is Spark?

    •  Parallel execution engine for big data processing

    •  Easy to use: 2-5x less code than Hadoop MR – High level API’s in Python, Java, and Scala

    •  Fast: up to 100x faster than Hadoop MR – Can exploit in-memory when available –  Low overhead scheduling, optimized engine

    •  General: support multiple computation models

    Lec 24.411/27/17 CS162 © UCB Fall 2017

    General

    •  Unifies batch, interactive computations

    Spark Core

    Spark SQL

  • Page 2

    Lec 24.511/27/17 CS162 © UCB Fall 2017

    General

    •  Unifies batch, interactive, streaming computations

    Spark Core

    Spark SQL Spark Streaming

    Lec 24.611/27/17 CS162 © UCB Fall 2017

    General

    •  Unifies batch, interactive, streaming computations •  Easy to build sophisticated applications

    –  Support iterative, graph-parallel algorithms –  Powerful APIs in Scala, Python, Java

    Spark Core

    Spark Streaming Spark SQL MLlib GraphX

    Lec 24.711/27/17 CS162 © UCB Fall 2017

    Easy to Write Code

    WordCount in 50+ lines of Java MR

    WordCount in 3 lines of Spark

    Lec 24.811/27/17 CS162 © UCB Fall 2017

    Large-Scale Usage

    Largest cluster: 8000 nodes

    Largest single job: 1 petabyte

    Top streaming intake: 1 TB/hour

    2014 on-disk sort record

  • Page 3

    Lec 24.911/27/17 CS162 © UCB Fall 2017

    Fast: Time to sort 100TB

    2100 machines 2013 Record: Hadoop

    2014 Record: Spark

    Source: Daytona GraySort benchmark, sortbenchmark.org

    72 minutes

    207 machines

    23 minutes

    Also sorted 1PB in 4 hours

    Lec 24.1011/27/17 CS162 © UCB Fall 2017

    RDD: Core Abstraction

    •  Resilient Distributed Datasets (RDDs) – Collections of objects distr. across a

    cluster, stored in RAM or on Disk –  Built through parallel transformations – Automatically rebuilt on failure

    Operations •  Transformations�

    (e.g. map, filter, groupBy) •  Actions�

    (e.g. count, collect, save)

    Write programs in terms of distributed datasets and operations on them

    Lec 24.1111/27/17 CS162 © UCB Fall 2017

    Operations on RDDs

    •  Transformations f(RDD) => RDD §  Lazy (not computed immediately) §  E.g. “map”

    •  Actions: §  Triggers computation §  E.g. “count”, “saveAsTextFile”

    Lec 24.1211/27/17 CS162 © UCB Fall 2017

    Working With RDDs

    RDD

    textFile = sc.textFile(”SomeFile.txt”) !

  • Page 4

    Lec 24.1311/27/17 CS162 © UCB Fall 2017

    Working With RDDs

    RDD RDD RDD RDD

    Transformations

    linesWithSpark = textFile.filter(lambda line: "Spark” in line) !

    textFile = sc.textFile(”SomeFile.txt”) !

    Lec 24.1411/27/17 CS162 © UCB Fall 2017

    Working With RDDs

    RDD RDD RDD RDD

    Transformations

    Action Value

    linesWithSpark = textFile.filter(lambda line: "Spark” in line) !

    linesWithSpark.count()! 74! ! linesWithSpark.first()! # Apache Spark!

    textFile = sc.textFile(”SomeFile.txt”) !

    Lec 24.1511/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    Example: Log Mining

    Lec 24.1611/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    Worker

    Worker

    Worker

    Driver

    Example: Log Mining

  • Page 5

    Lec 24.1711/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    Worker

    Worker

    Worker

    Driver

    lines = spark.textFile(“hdfs://...”)

    Example: Log Mining

    Lec 24.1811/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    Worker

    Worker

    Worker

    Driver

    lines = spark.textFile(“hdfs://...”)

    Base RDD

    Example: Log Mining

    Lec 24.1911/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(“hdfs://...”)

    errors = lines.filter(lambda s: s.startswith(“ERROR”))

    Worker

    Worker

    Worker

    Driver

    Example: Log Mining

    Lec 24.2011/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(“hdfs://...”)

    errors = lines.filter(lambda s: s.startswith(“ERROR”))

    Worker

    Worker

    Worker

    Driver

    Transformed RDD

    Example: Log Mining

  • Page 6

    Lec 24.2111/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(“hdfs://...”)

    errors = lines.filter(lambda s: s.startswith(“ERROR”))

    messages = errors.map(lambda s: s.split(“\t”)[2])

    messages.cache()

    Worker

    Worker

    Worker

    Driver

    messages.filter(lambda s: “mysql” in s).count()

    Example: Log Mining

    Lec 24.2211/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(“hdfs://...”)

    errors = lines.filter(lambda s: s.startswith(“ERROR”))

    messages = errors.map(lambda s: s.split(“\t”)[2])

    messages.cache()

    Worker

    Worker

    Worker

    Driver

    messages.filter(lambda s: “mysql” in s).count() Action

    Example: Log Mining

    Lec 24.2311/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(“hdfs://...”)

    errors = lines.filter(lambda s: s.startswith(“ERROR”))

    messages = errors.map(lambda s: s.split(“\t”)[2])

    messages.cache()

    Worker

    Worker

    Worker

    Driver

    messages.filter(lambda s: “mysql” in s).count()

    Block 1

    Block 2

    Block 3

    Example: Log Mining

    Lec 24.2411/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(“hdfs://...”)

    errors = lines.filter(lambda s: s.startswith(“ERROR”))

    messages = errors.map(lambda s: s.split(“\t”)[2])

    messages.cache()

    Worker

    Worker

    Worker messages.filter(lambda s: “mysql” in s).count()

    Block 1

    Block 2

    Block 3

    Driver tasks

    tasks

    tasks

    Example: Log Mining

  • Page 7

    Lec 24.2511/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(“hdfs://...”)

    errors = lines.filter(lambda s: s.startswith(“ERROR”))

    messages = errors.map(lambda s: s.split(“\t”)[2])

    messages.cache()

    Worker

    Worker

    Worker messages.filter(lambda s: “mysql” in s).count()

    Block 1

    Block 2

    Block 3

    Driver

    Read
 HDFS Block

    Read
 HDFS Block

    Read HDFS Block

    Example: Log Mining

    Lec 24.2611/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(“hdfs://...”)

    errors = lines.filter(lambda s: s.startswith(“ERROR”))

    messages = errors.map(lambda s: s.split(“\t”)[2])

    messages.cache()

    Worker

    Worker

    Worker messages.filter(lambda s: “mysql” in s).count()

    Block 1

    Block 2

    Block 3

    Driver

    Cache 1

    Cache 2

    Cache 3

    Process
 & Cache Data

    Process
 & Cache Data

    Process
 & Cache Data

    Example: Log Mining

    Lec 24.2711/27/17 CS162 © UCB Fall 2017

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(“hdfs://...”)

    errors = lines.filter(lambda s: s.startswith(“ERROR”))

    messages = errors.map(lambda s: s.split(“\t”)[2])

    messages.cache()

    Worker

    Worker

    Worker messages.filter(lambda s: “mysql” in s).count()

    Block 1

    Block 2

    Block 3

    Driver

    Cache 1

    Cache 2

    Cache 3

    results

    results

    results

    Example: Log Mining

    Lec 24.2811/27/17 CS162 © UCB Fall 2017

    L