Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf ·...

24
Spark 1.1 and Beyond Patrick Wendell

Transcript of Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf ·...

Page 1: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Spark 1.1 and Beyond

Patrick Wendell

Page 2: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

About Me

Work at Databricks leading the Spark team

Spark 1.1 Release manager

Committer on Spark since AMPLab days

Page 3: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

This Talk

Spark 1.1 (and a bit about 1.2)

A few notes on performance

Q&A with myself, Tathagata Das, and Josh Rosen

Page 4: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Spark RDD API

Spark Streaming

real-time

GraphXGraph(alpha)

MLLibmachine learning

DStream’s: Streams of RDD’s

RDD-BasedMatrices

RDD-Based Graphs

SparkSQL

RDD-Based Tables

A Bit about Spark…

HDFS, S3, Cassandra YARN, Mesos, Standalone

Page 5: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Spark Release Process

~3 month release cycle, time-scoped

2 months of feature development

1 month of QA

Maintain older branches with bug fixes

Upcoming release: 1.1.0 (previous was 1.0.2)

Page 6: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Master

branch-1.1

branch-1.0

V1.0.0 V1.0.1

V1.1.0

Morefeatures

More stable

For any P.O.C or non production cluster, we

always recommend running off of the head

of a release branch.

Page 7: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Spark 1.1

1,297 patches

200+ contributors (still counting)

Dozens of organizations

To get updates – join our dev list:

E-mail [email protected]

Page 8: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Roadmap

Spark 1.1 and 1.2 have similar themes

Spark core:

Usability, stability, and performance

MLlib/SQL/Streaming:

Expanded feature set and performanceAround ~40% of mailing list traffic is about these libraries.

Page 9: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Spark Core in 1.1

Performance “out of the box”

Sort-based shuffle

Efficient broadcasts

Disk spilling in Python

YARN usability improvements

Usability

Task progress and user-defined counters

UI behavior for failing or large jobs

Page 10: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

1.0 was the first “preview” release

1.1 provides upgrade path for Shark

Replaced Shark in our benchmarks with 2-3X perf gains

Can perform optimizations with 10-100X less effort than Hive.

Spark SQL in 1.1

Page 11: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Turning an RDD into a Relation

• // Define the schema using a case class.case class Person(name: String, age: Int)

// Create an RDD of Person objects, register it as a table.val people =

sc.textFile("examples/src/main/resources/people.txt").map(_.split(",").map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

Page 12: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Querying using SQL

• // SQL statements can be run directly on RDD’sval teenagers =

sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support // normal RDD operations.val nameList = teenagers.map(t => "Name: " + t(0)).collect()

• // Language integrated queries (ala LINQ)val teenagers =people.where('age >= 10).where('age <= 19).select('name)

Page 13: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

JDBC server for multi-tenant access and BI tools

Native JSON support

Public types API – “make your own” Schema RDD’s

Improved operator performance

Native Parquet support and optimizations

Spark SQL in 1.1

Page 14: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Spark Streaming

Stability improvements across the board

Amazon Kinesis support

Rate limiting for streams

Support for polling Flume streams

Streaming + ML: Streaming linear regressions

Page 15: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

What’s new in MLlib v1.1• Contributors: 40 (v1.0) -> 68

• Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression

• Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec

• Statistics: sampling (core), correlations, hypothesis testing, random data generation

• Performance and scalability: major improvement to decision tree, tree aggregation

• Python API: decision tree, statistics, linear methods

Page 16: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Performance (v1.0 vs. v1.1)

Page 17: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Sort-based Shuffle

Old shuffle:

Each mapper opens a file for each reducer and writes output simultaneously.

Files = # mappers * # reducers

New Shuffle:

Each mapper buffers reduce output in memory, spills, then sort-merges on disk data.

Page 18: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

GroupBy Operator

Spark groupByKey != SQL groupBy

NO:

people.map(p => (p.zipCode, p.getIncome)).groupByKey().map(incomes => incomes.sum)

YES:

people.map(p => (p.zipCode, p.getIncome)).reduceByKey(_ + _)

Page 19: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

GroupBy Operator

Spark groupByKey != SQL groupBy

NO:

people.map(p => (p.zipCode, p.getIncome)).groupByKey().map(incomes => incomes.sum)

YES:

people.groupBy(‘zipCode).select(sum(‘income))

Page 20: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

GroupBy Operator

Spark groupByKey != SQL groupBy

NO:

people.map(p => (p.zipCode, p.getIncome)).groupByKey().map(incomes => incomes.sum)

YES:

SELECT sum(income) FROM people GROUP BY zipCode;

Page 21: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Spark RDD API

Spark Streaming

real-time

GraphXGraph(alpha)

MLLibmachine learning

DStream’s: Streams of RDD’s

RDD-BasedMatrices

RDD-Based Graphs

SparkSQL

RDD-Based Tables

Other efforts

HDFS, S3, Cassandra YARN, Mesos, Standalone

Pig on Spark

Hive on Spark

OoyalaJob Server

Page 22: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Looking Ahead to 1.2+

[Core]

Scala 2.11 support

Debugging tools (task progress, visualization)

Netty-based communication layer

[SQL]

Portability across Hive versions

Performance optimizations (TPC-DS and Parquet)

Planner integration with Cassandra and other sources

Page 23: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Looking Ahead to 1.2+

[Streaming]

Python Support

Lower level Kafka API w/ recoverability

[MLLib]

Multi-model training

Many new algorithms

Faster internal linear solver

Page 24: Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Q and A

Josh RosenPySpark and Spark Core

Tathagata DasSpark Streaming Lead