Spark 1.0 Spark 1.1 Roadmap - 1.1 Meetup Talk - Wendell.pdf · PDF file What’s new...

Click here to load reader

  • date post

    20-May-2020
  • Category

    Documents

  • view

    2
  • download

    0

Embed Size (px)

Transcript of Spark 1.0 Spark 1.1 Roadmap - 1.1 Meetup Talk - Wendell.pdf · PDF file What’s new...

  • Spark 1.1 and Beyond

    Patrick Wendell

  • About Me

    Work at Databricks leading the Spark team

    Spark 1.1 Release manager

    Committer on Spark since AMPLab days

  • This Talk

    Spark 1.1 (and a bit about 1.2)

    A few notes on performance

    Q&A with myself, Tathagata Das, and Josh Rosen

  • Spark RDD API

    Spark Streaming

    real-time

    GraphX Graph (alpha)

    MLLib machine learning

    DStream’s: Streams of RDD’s

    RDD-Based Matrices

    RDD-Based Graphs

    Spark SQL

    RDD-Based Tables

    A Bit about Spark…

    HDFS, S3, Cassandra YARN, Mesos, Standalone

  • Spark Release Process

    ~3 month release cycle, time-scoped

    2 months of feature development

    1 month of QA

    Maintain older branches with bug fixes

    Upcoming release: 1.1.0 (previous was 1.0.2)

  • Master

    branch-1.1

    branch-1.0

    V1.0.0 V1.0.1

    V1.1.0

    More features

    More stable

    For any P.O.C or non production cluster, we

    always recommend running off of the head

    of a release branch.

  • Spark 1.1

    1,297 patches

    200+ contributors (still counting)

    Dozens of organizations

    To get updates – join our dev list:

    E-mail dev-subscribe@spark.apache.org

  • Roadmap

    Spark 1.1 and 1.2 have similar themes

    Spark core:

    Usability, stability, and performance

    MLlib/SQL/Streaming:

    Expanded feature set and performance Around ~40% of mailing list traffic is about these libraries.

  • Spark Core in 1.1

    Performance “out of the box”

    Sort-based shuffle

    Efficient broadcasts

    Disk spilling in Python

    YARN usability improvements

    Usability

    Task progress and user-defined counters

    UI behavior for failing or large jobs

  • 1.0 was the first “preview” release

    1.1 provides upgrade path for Shark

    Replaced Shark in our benchmarks with 2-3X perf gains

    Can perform optimizations with 10-100X less effort than Hive.

    Spark SQL in 1.1

  • Turning an RDD into a Relation

    • // Define the schema using a case class. case class Person(name: String, age: Int)

    // Create an RDD of Person objects, register it as a table. val people =

    sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))

    people.registerAsTable("people")

  • Querying using SQL

    • // SQL statements can be run directly on RDD’s val teenagers =

    sql("SELECT name FROM people WHERE age >= 13 AND age

  • JDBC server for multi-tenant access and BI tools

    Native JSON support

    Public types API – “make your own” Schema RDD’s

    Improved operator performance

    Native Parquet support and optimizations

    Spark SQL in 1.1

  • Spark Streaming

    Stability improvements across the board

    Amazon Kinesis support

    Rate limiting for streams

    Support for polling Flume streams

    Streaming + ML: Streaming linear regressions

  • What’s new in MLlib v1.1 • Contributors: 40 (v1.0) -> 68

    • Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression

    • Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec

    • Statistics: sampling (core), correlations, hypothesis testing, random data generation

    • Performance and scalability: major improvement to decision tree, tree aggregation

    • Python API: decision tree, statistics, linear methods

  • Performance (v1.0 vs. v1.1)

  • Sort-based Shuffle

    Old shuffle:

    Each mapper opens a file for each reducer and writes output simultaneously.

    Files = # mappers * # reducers

    New Shuffle:

    Each mapper buffers reduce output in memory, spills, then sort-merges on disk data.

  • GroupBy Operator

    Spark groupByKey != SQL groupBy

    NO:

    people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)

    YES:

    people.map(p => (p.zipCode, p.getIncome)) .reduceByKey(_ + _)

  • GroupBy Operator

    Spark groupByKey != SQL groupBy

    NO:

    people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)

    YES:

    people.groupBy(‘zipCode).select(sum(‘income))

  • GroupBy Operator

    Spark groupByKey != SQL groupBy

    NO:

    people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)

    YES:

    SELECT sum(income) FROM people GROUP BY zipCode;

  • Spark RDD API

    Spark Streaming

    real-time

    GraphX Graph (alpha)

    MLLib machine learning

    DStream’s: Streams of RDD’s

    RDD-Based Matrices

    RDD-Based Graphs

    Spark SQL

    RDD-Based Tables

    Other efforts

    HDFS, S3, Cassandra YARN, Mesos, Standalone

    Pig on Spark

    Hive on Spark

    Ooyala Job Server

  • Looking Ahead to 1.2+

    [Core]

    Scala 2.11 support

    Debugging tools (task progress, visualization)

    Netty-based communication layer

    [SQL]

    Portability across Hive versions

    Performance optimizations (TPC-DS and Parquet)

    Planner integration with Cassandra and other sources

  • Looking Ahead to 1.2+

    [Streaming]

    Python Support

    Lower level Kafka API w/ recoverability

    [MLLib]

    Multi-model training

    Many new algorithms

    Faster internal linear solver

  • Q and A

    Josh Rosen PySpark and Spark Core

    Tathagata Das Spark Streaming Lead