Spark - Meetupfiles.meetup.com/3138542/Spark in 2015 - Wendell.pdfmachine learning& Spark& SQL...

Spark: Looking Back, Looking Forward

Databricks

Patrick Wendell

Welcome to Databricks!

Founded by creators of Spark… donated Spark to Apache foundation in 2013. Databricks cloud – integrated analytics platform based on Apache Spark (limited Beta). http://databricks.com/registration New office, so pardon any kinks!

About Me

Work at Databricks managing the Spark team Spark 1.2 release manager Committer on Spark since Berkeley days

Agenda for Today

Reflections and directions for Spark Deeper dive for new API’s in Spark SQL and Mllib Committer panel / Q&A

Show of Hands!

How familiar are you with Spark? A.  Heard of it, but haven't used it before. B.  Kicked the tires with some basics. C.  Worked or working on a proof-of-concept

deployment. D.  Worked or working on a production deployment.

Spark RDD API

Spark Streaming real-‐time

GraphX Graph

MLLib machine learning

DStream’s: Streams of RDD’s

RDD-‐Based Matrices

RDD-‐Based Graphs

Spark SQL

RDD-‐Based Tables

A Bit about Spark…

KaEa, S3, Cassandra, HDFS YARN, Mesos, Standalone

User app

User app

Spark in 2014 – Project Growth

1071

3567

0

500

1000

1500

2000

2500

3000

3500

4000

2013 2014

Code Patches

137

417

0

50

100

150

200

250

300

350

400

450

2013 2014

Contributors

Spark in 2014 – User Growth

Spark 0.9 (Feb) Spark 1.0 (May) Spark 1.1 (Sep)

Downloads 30 Days From Release (Sampled)

Spark in 2014 – Broader Ecosystem

Now supported by all major Hadoop vendors… But also beyond Hadoop…

Spark in 2014 – Major additions

Usability and portability of core engine API stability (Spark 1.0!) Vastly expanded UI and instrumentation Integration with Hadoop security Disk-spilling and shuffle optimizations

Feature coverage for libraries

Spark SQL library and SchemaRDD’s GraphX library Expansion of MLlib

So… what’s coming?

New Technical Directions in 2015

SchemaRDD’s as a common interchange format. Data-frame style API’s From developers to data scientists

Extensibility and pluggable API’s

Data source API (SQL) Pipelines API (Mllib) Receiver API (Streaming) Spark Packages

SchemaRDD’s as a Key Concept

RDD: “Immutable partitioned collection of elements”. SchemaRDD: “An RDD of Row objects that has an associated schema”

Why SchemaRDD’s are Useful

Having structure and types is very powerful. Allows us to optimize performance more easily. Fosters interoperability between libraries (Spark’s and third party). Enables higher level and safer user-facing API.

SchemaRDDs - Data Frame API’s

# Pandas ‘data frame’ style lineitems.groupby(‘customer’).agg(Map( ‘units’ -‐> ‘avg’, ‘totalPrice’ -‐> ‘std’

)) # or SQL style SELECT AVG(units), STD(totalPrice) FROM linetiems GROUP BY customer

SchemaRDDs - Data Frame API’s

Data frame API’s are more familiar to data scientists and easier to use. Many user issues would be solved by writing against such API’s.

Schema RDDs - Interoperability

Any data source made available to Spark SQL is instantly available in Java, Python, and R, with correct types. Major internal API’s (such as ML pipelines API) can make assumptions about input RDD’s.

Base RDD API


GraphX Graph


DStream’s: Streams of RDD’s

RDD-‐Based Matrices

RDD-‐Based Graphs

Spark SQL

RDD-‐Based Tables

Spark with Schema RDD’s


User app

User app

Schema RDD / Data Frame API

Technical Directions in 2015

SchemaRDD’s as a common interchange format. “Data frame” style API’s From developers to data scientists

Extensibility and pluggable API’s

Data source API (SQL) Pipelines API (Mllib) Receiver API (Streaming) Spark Packages

Spark SQL – Initial Input Support

Hive metastore tables JSON (built in) Parquet (built in) Any Spark RDD + user-schema creation

Spark SQL – Datasources API

Schema RDD

Data Source API Table scan/sink Optimization rules Table catalog

Mllib – Pipelines API

Practical ML pipelines involve feature extraction, model fitting, testing, and validation. Pipelines API provides re-usable components and a language for describing workflows. Relies heavily on SchemaRDD’s for inter-operability.

Pluggable APIs

Goal is to stabilize both of these API’s in the next few releases to allow community implementations to proliferate. Now Spark is facilitating feedback from third party library authors.

spark-packages.org

spark-packages.org

Standard library for Spark-related projects (think CRAN, PyPi, etc). Plan is to make installation on Spark a single click.

Base RDD API


GraphX Graph


Spark SQL


User app

Schema RDD / Data Frame API

Community Package

Spark in 2015 – Production Use Cases

Submissions to Spark Summit East show a broad variety of production use cases:

Hadoop workload migration Recommendations

Data pipeline and ETL Fraud detection

User engagement analytics Scientific computing

Medical diagnosis Smart grid/utility

analytics

Community goals in 2015

Maintain a strong, cohesive community, as we grow.

Continue to provide transparency and community involvement in technical roadmap. Maintain trust of users to update to new releases. Encourage ecosystem projects outside of Spark (stable API’s are a huge part of this).

To conclude. In 2015 expect…

-  Increasing focus on Schema RDD as an integration point.

-  Friendlier, higher level API’s.

-  Continued focus on usability and performance.

Spark - Meetupfiles.meetup.com/3138542/Spark in 2015 - Wendell.pdfmachine learning& Spark& SQL...

Documents

Transcript of Spark - Meetupfiles.meetup.com/3138542/Spark in 2015 - Wendell.pdfmachine learning& Spark& SQL...