Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1....

36
Spark 2.0: What’s Next Reynold Xin @rxin Spark Conference Japan Feb 8, 2016

Transcript of Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1....

Page 1: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Spark 2.0: What’s Next

Reynold Xin @rxinSpark Conference JapanFeb 8, 2016

Page 2: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Please put up your hand

if you know what Spark is?

Page 3: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Put up your hand

if you think your significant otherknow what Spark is?

(girlfriend, boyfriend, wife, husband, …)

Page 4: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

This Talk

What is Spark?How are people using it?Spark 2.0

Page 5: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

open source data processing engine built around speed, ease of use, and sophisticated analytics�����������!#��� Ò�Á�Ñ´¶

������������"� �

Page 6: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order
Page 7: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order
Page 8: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

About Databricks

Founded by creators of Spark & behind Spark development

Cloud Enterprise Spark Platform• Cluster management, interactive notebooks,

dashboards, production jobs,data governance, security, …

Databricks Àº¥»

Spark�ht½Spark�hÒA>³Î�¶¸À˹»�l°Ï¶

ØďêđüĉÕçÝĉÖñ Spark üĉíðúÙđăĐÝĉæên` .�$ôđðûíÝ

ĐëíäĆÿđñ åĈûb<

ĐïđêÛöòďæ èÜĆĊîÔ

Page 9: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

2015: Great Year for Spark

Most active open source project in data (1000+ contributors)

New language: R

Widespread industry support & adoption

2015: SparkÀ½¹»'«¿3

ïđê� IÉYh¿Úđüďéđæüčå×Ýð (1000��Â�^t)

D²¥�� : R

24¥ReâĀđð½@c

Page 10: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Meetup Groups: December 2014

source: meetup.com

Page 11: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Meetup Groups: December 2015

source: meetup.com

TokyoSpark Meetup

Page 12: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

IBMÃApache Spark Â�6�ÅÂáĂíðĄďðÒÓòÖďæ TÂ103¼IÉ�~¿Úđüďéđæüčå×Ýð½¿Î�w;ÒkÈ»¥Î½¥¦

Spark Or Hadoop – ¾¸Ìªýæð¿øíÞïđêúČđăĎđÝē

Apache Spark ª�W:�G·½�PqOªj³

Page 13: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

“Spark is the Taylor Swiftof big data software.”

- Derrick Harris, Fortune¢Spark ÃøíÞïđêéúðÖ×ÓÂîÕĉđĐæÖÔúð�·£

- ëđĊíÝĐõĊæ úÙđìĆď

��X : FM¼ÃîČøfp¢îĉæõÖæ£Â��U¿¾¼K

Page 14: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

“Spark is the �1H(of big data software.”

(A Japanese engineer told me)

Page 15: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

How are people using Spark?

Page 16: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Diverse Runtime EnvironmentsHOW RESPONDENTS ARE

RUNNING SPARK

51%on a public cloud

MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)

48% 40% 11%Standalone mode YARN Mesos

Cluster Managers

°Ç±Ç¿+{a%

Page 17: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Industries Using Spark

Other

Software(SaaS, Web, Mobile)

Consulting (IT)Retail,

e-Commerce

Advertising,Marketing, PR

Banking, Finance

Health, Medical,Pharmacy, Biotech

Carriers,Telecommunications

Education

Computers, Hardware

29.4%

17.7%

14.0%

9.6%

6.7%

6.5%

4.4%

4.4%

3.9%

3.5%

SparkÒ�c²»¥ÎRe

éúðÖ×Ó

áďâċîÔďÞ (IT)

�{ �z

áďùĆđêđ õđñÖ×Ó

Bv

�7 �g y� öÕÚîÝôčåđ

ÜąĊÓ ��

4" 

āđßîÔďÞ PR

/& eáāđæ

µÂ�

Page 18: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Top Applications

29%

36%

40%

44%

52%

68%

Fraud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business IntelligenceøåóæÕďîĊå×ďæ

ïđêÖ×ÓõÖåďÞ

ČáĄďïđäĈď

čÞ�`

ćđãđ!­âđøæ

�VQ� / èÜĆĊîÔ

��ÂÓüĊßđäĈď

Page 19: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Are we done?

No. Development is faster than ever!

ɦ)<ē

¥¥§¡�hÃ�Ǽ�ÀYhÀ¿¹»r¥»¥ÎĒ

Page 20: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

2012

started@

Berkeley

2010

researchpaper

2013

Databricksstarted

& donatedto ASF

2014

Spark 1.0 & libraries(SQL, ML, GraphX)

2015

DataFramesTungsten

ML Pipelines

2016

Spark 2.0

Page 21: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

SQL Streaming MLlib

Spark Core (RDD)

GraphX

Spark stack diagram SparkÂæêíÝ#

Page 22: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Frontend(user facing APIs)

Backend(execution)

Spark stack diagram(a different take)

SparkÂæêíÝ#(�¦�E¼)

������

(ćđãđÀ�³ÎAPI)

öíÝØďñ

(+{)

Page 23: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Frontend(RDD, DataFrame, ML pipelines, …)

Backend(scheduler, shuffle, operators, …)

Spark stack diagram(a different take)

SparkÂæêíÝ#(�¦�E¼)

������

(RDD, DataFrame, ML pipelines, …)

öíÝØďñ

(æßåĆđĉ äąíúċ [m( 

…)

Page 24: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

FrontendAPI Foundation

Streaming DataFrame/Dataset

SQL

Backend10X Performance

Whole-stage CodegenVectorization

Spark 2.0

účďðØďñ API Â��

æðĊđĂďÞ

DataFrame/DatasetSQL

öíÝØďñ 10�Â÷úÙđāďæ

�æîđå áđñb<

ýÝðċ�

Page 25: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Guiding Principles for API Foundation

1. Simple yet expressive

2. (Semantics) well-defined

3. Sufficiently abstracted to allow optimized backends

API Ò�ÎÀ¤¶¹»Â?�

äďüċ·ª|_�©À

(èāďîÔÝæª) ��*s°Ï»¥Î

öíÝØďñÂI��ª¼«Î˦��À=��°Ï»¥Î

Page 26: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Java/Scalafrontend

JVMbackend

RDD

DataFramefrontend

Logical Plan

Physical execution

Catalystoptimizer

DataFrame

Page 27: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Python Java/Scala SQL

DataFrameLogical Plan

JVM Tungsten …

Page 28: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

API Foundations in Spark 2.0

1. Streaming DataFrames

2. Maturing and merging DataFrame and Dataset

3. ANSI SQL• natural join, subquery, view support

Spark 2.0 À¨­ÎAPIÂ��

æðĊđĂďÞ DataFrames

DataFrame ½ Dataset Â<]½āđå

x\q� âûÝØĊ øĆđÂâĀđð

Page 29: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Challenges with Stream Processing

Stream processing is hard to reason about• Output over time• Late data• Failures• Distribution

And all this has to work across complex operations• Windows, sessions, aggregation, etc

æðĊđă�`À�³Î��

æðĊđă�`ª�²¥`dÃ

Đ�¥L�ÀZÎÓÖðüíð

Đ�Ï»¬Îïđê

�-

��

®Ï̳ƻª}�¿ÚþČđäĈďÀѶ¹»Sw²¿­ÏÄ¿Ì¿¥

ĐÖÔďñÖ èíäĈď ÓÞĊàđäĈď ¿¾

Page 30: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Next-gen Streaming with DataFrames

1. Easy-to-use APIs (batch, streaming, and interactive)

2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics

3. Leverages Tungsten backend

DataFramesÀËÎT�æðĊđĂďÞ

1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�ͼ¿¥ïđê

�-

Đexactly-once èāďîÔÝæÒ>º source / sink

3. Tungsten öíÝØďñÂ�c

Page 31: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Next-gen Streaming with DataFrames

1. Easy-to-use APIs (batch, streaming, and interactive)

2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics

3. Leverages Tungsten backend

DataFramesÀËÎT�æðĊđĂďÞ

1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�ͼ¿¥ïđê

�-

Đexactly-once èāďîÔÝæÒ>º source / sink

3. Tungsten öíÝØďñÂ�c

More details next few weeksC��9ÀËÍ�oÒ

Page 32: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Spark is already pretty fast.

Can we make it 10X faster in 2.0?

Spark ó¼À©¿Í�¥

2.0¼ 10���À¼«Î·Ц©ē

Page 33: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Spark 1.6 13.95 millionrows/sec

Spark 2.0work-in-progress

125 millionrows/sec

High throughput�æċđüíð

Teaser: SQL/DataFrame Performance

come to my talk this afternoon to learn more�²¬iͶ¥Ë¦¼²¶Ì�9ÂѶ²Â�Òu«ÀN»¬·°¥

0²·­,�: SQL/DataFrame��������

Page 34: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Tungsten Execution

PythonSQL R Streaming

DataFrame (& Dataset)

AdvancedAnalytics

Page 35: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

Spark 2.0 Release Schedule

Under active development on GitHub

March – April: code freeze

April – May: official release

Spark 2.0 ÂĊĊđææßåĆđċ

GitHub�¼YhÀ�h�

3J-4J : áđñúĊđç

4J-5J : V8ĊĊđæ

Page 36: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016  · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order

¤Íª½¦¯±¥Ç²¶@rxin