Download - Spark 1.0 Spark 1.1 Roadmap - Meetupfiles.meetup.com/3138542/Spark 1.1 Meetup Talk - Wendell.pdf · What’s new in MLlib v1.1 •Contributors: 40 (v1.0) -> 68 •Algorithms: SVD

Spark 1.1 and Beyond

Patrick Wendell

About Me

Work at Databricks leading the Spark team

Spark 1.1 Release manager

Committer on Spark since AMPLab days

This Talk

Spark 1.1 (and a bit about 1.2)

A few notes on performance

Q&A with myself, Tathagata Das, and Josh Rosen

Spark RDD API

Spark Streaming

real-time

GraphXGraph(alpha)

MLLibmachine learning

DStream’s: Streams of RDD’s

RDD-BasedMatrices

RDD-Based Graphs

SparkSQL

RDD-Based Tables

A Bit about Spark…

HDFS, S3, Cassandra YARN, Mesos, Standalone

Spark Release Process

~3 month release cycle, time-scoped

2 months of feature development

1 month of QA

Maintain older branches with bug fixes

Upcoming release: 1.1.0 (previous was 1.0.2)

Master

branch-1.1

branch-1.0

V1.0.0 V1.0.1

V1.1.0

Morefeatures

More stable

For any P.O.C or non production cluster, we

always recommend running off of the head

of a release branch.

Spark 1.1

1,297 patches

200+ contributors (still counting)

Dozens of organizations

To get updates – join our dev list:

E-mail [email protected]

Roadmap

Spark 1.1 and 1.2 have similar themes

Spark core:

Usability, stability, and performance

MLlib/SQL/Streaming:

Expanded feature set and performanceAround ~40% of mailing list traffic is about these libraries.

Spark Core in 1.1

Performance “out of the box”

Sort-based shuffle

Efficient broadcasts

Disk spilling in Python

YARN usability improvements

Usability

Task progress and user-defined counters

UI behavior for failing or large jobs

1.0 was the first “preview” release

1.1 provides upgrade path for Shark

Replaced Shark in our benchmarks with 2-3X perf gains

Can perform optimizations with 10-100X less effort than Hive.

Spark SQL in 1.1

Turning an RDD into a Relation

• // Define the schema using a case class.case class Person(name: String, age: Int)

// Create an RDD of Person objects, register it as a table.val people =

sc.textFile("examples/src/main/resources/people.txt").map(_.split(",").map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

Querying using SQL

• // SQL statements can be run directly on RDD’sval teenagers =

sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support // normal RDD operations.val nameList = teenagers.map(t => "Name: " + t(0)).collect()

• // Language integrated queries (ala LINQ)val teenagers =people.where('age >= 10).where('age <= 19).select('name)

JDBC server for multi-tenant access and BI tools

Native JSON support

Public types API – “make your own” Schema RDD’s

Improved operator performance

Native Parquet support and optimizations

Spark SQL in 1.1

Spark Streaming

Stability improvements across the board

Amazon Kinesis support

Rate limiting for streams

Support for polling Flume streams

Streaming + ML: Streaming linear regressions

What’s new in MLlib v1.1• Contributors: 40 (v1.0) -> 68

• Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression

• Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec

• Statistics: sampling (core), correlations, hypothesis testing, random data generation

• Performance and scalability: major improvement to decision tree, tree aggregation

• Python API: decision tree, statistics, linear methods

Performance (v1.0 vs. v1.1)

Sort-based Shuffle

Old shuffle:

Each mapper opens a file for each reducer and writes output simultaneously.

Files = # mappers * # reducers

New Shuffle:

Each mapper buffers reduce output in memory, spills, then sort-merges on disk data.

GroupBy Operator

Spark groupByKey != SQL groupBy

NO:

people.map(p => (p.zipCode, p.getIncome)).groupByKey().map(incomes => incomes.sum)

YES:

people.map(p => (p.zipCode, p.getIncome)).reduceByKey(_ + _)

GroupBy Operator


NO:


YES:

people.groupBy(‘zipCode).select(sum(‘income))

GroupBy Operator


NO:


YES:

SELECT sum(income) FROM people GROUP BY zipCode;

Spark RDD API

Spark Streaming

real-time

GraphXGraph(alpha)

MLLibmachine learning

DStream’s: Streams of RDD’s

RDD-BasedMatrices

RDD-Based Graphs

SparkSQL

RDD-Based Tables

Other efforts

HDFS, S3, Cassandra YARN, Mesos, Standalone

Pig on Spark

Hive on Spark

OoyalaJob Server

Looking Ahead to 1.2+

[Core]

Scala 2.11 support

Debugging tools (task progress, visualization)

Netty-based communication layer

[SQL]

Portability across Hive versions

Performance optimizations (TPC-DS and Parquet)

Planner integration with Cassandra and other sources

Looking Ahead to 1.2+

[Streaming]

Python Support

Lower level Kafka API w/ recoverability

[MLLib]

Multi-model training

Many new algorithms

Faster internal linear solver

Q and A

Josh RosenPySpark and Spark Core

Tathagata DasSpark Streaming Lead