Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1....

Spark 2.0: What’s Next

Reynold Xin @rxinSpark Conference JapanFeb 8, 2016

Please put up your hand

if you know what Spark is?

Put up your hand

if you think your significant otherknow what Spark is?

(girlfriend, boyfriend, wife, husband, …)

This Talk

What is Spark?How are people using it?Spark 2.0

open source data processing engine built around speed, ease of use, and sophisticated analytics��!#�� Ò�Á�Ñ´¶

��"� �

About Databricks

Founded by creators of Spark & behind Spark development

Cloud Enterprise Spark Platform• Cluster management, interactive notebooks,

dashboards, production jobs,data governance, security, …

Databricks Àº¥»

Spark�ht½Spark�hÒA>³Î�¶¸ÀË¹»�l°Ï¶

ØďêđüĉÕçÝĉÖñ Spark üĉíðúÙđăĐÝĉæên` .�$ôđðûíÝ

ĐëíäĆÿđñ åĈûb<

ĐïđêÛöòďæ èÜĆĊîÔ

2015: Great Year for Spark

Most active open source project in data (1000+ contributors)

New language: R

Widespread industry support & adoption

2015: SparkÀ½¹»'«¿3

ïđê� IÉYh¿Úđüďéđæüčå×Ýð (1000��Â�^t)

D²¥�� : R

24¥ReâĀđð½@c

Meetup Groups: December 2014

source: meetup.com

Meetup Groups: December 2015

source: meetup.com

TokyoSpark Meetup

IBMÃApache Spark Â�6�ÅÂáĂíðĄďðÒÓòÖďæ TÂ103¼IÉ�~¿Úđüďéđæüčå×Ýð½¿Î�w;ÒkÈ»¥Î½¥¦

Spark Or Hadoop – ¾¸Ìªýæð¿øíÞïđêúČđăĎđÝē

Apache Spark ª�W:�G·½�PqOªj³

“Spark is the Taylor Swiftof big data software.”

- Derrick Harris, Fortune¢Spark ÃøíÞïđêéúðÖ×ÓÂîÕĉđĐæÖÔúð�·£

- ëđĊíÝĐõĊæ úÙđìĆď

��X : FM¼ÃîČøfp¢îĉæõÖæ£Â��U¿¾¼K

“Spark is the �1H(of big data software.”

(A Japanese engineer told me)

How are people using Spark?

Diverse Runtime EnvironmentsHOW RESPONDENTS ARE

RUNNING SPARK

51%on a public cloud

MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)

48% 40% 11%Standalone mode YARN Mesos

Cluster Managers

°Ç±Ç¿+{a%

Industries Using Spark

Other

Software(SaaS, Web, Mobile)

Consulting (IT)Retail,

e-Commerce

Advertising,Marketing, PR

Banking, Finance

Health, Medical,Pharmacy, Biotech

Carriers,Telecommunications

Education

Computers, Hardware

29.4%

17.7%

14.0%

9.6%

6.7%

6.5%

4.4%

4.4%

3.9%

3.5%

SparkÒ�c²»¥ÎRe

éúðÖ×Ó

áďâċîÔďÞ (IT)

�{ �z

áďùĆđêđ õđñÖ×Ó

Bv

�7 �g y� öÕÚîÝôčåđ

ÜąĊÓ ��

4"

āđßîÔďÞ PR

/& eáāđæ

µÂ�

Top Applications

29%

36%

40%

44%

52%

68%

Fraud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business IntelligenceøåóæÕďîĊå×ďæ

ïđêÖ×ÓõÖåďÞ

ČáĄďïđäĈď

čÞ�`

ćđãđ!âđøæ

�VQ� / èÜĆĊîÔ

��ÂÓüĊßđäĈď

Are we done?

No. Development is faster than ever!

É¦)<ē

¥¥§¡�hÃ�Ç¼�ÀYhÀ¿¹»r¥»¥ÎĒ

2012

started@

Berkeley

2010

researchpaper

2013

Databricksstarted

& donatedto ASF

2014

Spark 1.0 & libraries(SQL, ML, GraphX)

2015

DataFramesTungsten

ML Pipelines

2016

Spark 2.0

SQL Streaming MLlib

Spark Core (RDD)

GraphX

Spark stack diagram SparkÂæêíÝ#

Frontend(user facing APIs)

Backend(execution)

Spark stack diagram(a different take)

SparkÂæêíÝ#(�¦�E¼)

��

(ćđãđÀ�³ÎAPI)

öíÝØďñ

(+{)

Frontend(RDD, DataFrame, ML pipelines, …)

Backend(scheduler, shuffle, operators, …)

Spark stack diagram(a different take)

SparkÂæêíÝ#(�¦�E¼)

��

(RDD, DataFrame, ML pipelines, …)

öíÝØďñ

(æßåĆđĉ äąíúċ [m(

…)

FrontendAPI Foundation

Streaming DataFrame/Dataset

SQL

Backend10X Performance

Whole-stage CodegenVectorization

Spark 2.0

účďðØďñ API Â��

æðĊđĂďÞ

DataFrame/DatasetSQL

öíÝØďñ 10�Â÷úÙđāďæ

�æîđå áđñb<

ýÝðċ�

Guiding Principles for API Foundation

1. Simple yet expressive

2. (Semantics) well-defined

3. Sufficiently abstracted to allow optimized backends

API Ò�ÎÀ¤¶¹»Â?�

äďüċ·ª|_�©À

(èāďîÔÝæª) ��*s°Ï»¥Î

öíÝØďñÂI��ª¼«ÎË¦��À=��°Ï»¥Î

Java/Scalafrontend

JVMbackend

RDD

DataFramefrontend

Logical Plan

Physical execution

Catalystoptimizer

DataFrame

Python Java/Scala SQL

DataFrameLogical Plan

JVM Tungsten …

API Foundations in Spark 2.0

1. Streaming DataFrames

2. Maturing and merging DataFrame and Dataset

3. ANSI SQL• natural join, subquery, view support

Spark 2.0 À¨ÎAPIÂ��

æðĊđĂďÞ DataFrames

DataFrame ½ Dataset Â<]½āđå

x\q� âûÝØĊ øĆđÂâĀđð

Challenges with Stream Processing

Stream processing is hard to reason about• Output over time• Late data• Failures• Distribution

And all this has to work across complex operations• Windows, sessions, aggregation, etc

æðĊđă�`À�³Î��

æðĊđă�`ª�²¥`dÃ

Đ�¥L�ÀZÎÓÖðüíð

Đ�Ï»¬Îïđê

Đ�-

Đ��

®ÏÌ³Æ»ª}�¿ÚþČđäĈďÀÑ¶¹»Sw²¿ÏÄ¿Ì¿¥

ĐÖÔďñÖ èíäĈď ÓÞĊàđäĈď ¿¾

Next-gen Streaming with DataFrames

1. Easy-to-use APIs (batch, streaming, and interactive)

2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics

3. Leverages Tungsten backend

DataFramesÀËÎT�æðĊđĂďÞ

1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�Í¼¿¥ïđê

Đ�-

Đexactly-once èāďîÔÝæÒ>º source / sink

3. Tungsten öíÝØďñÂ�c

Next-gen Streaming with DataFrames

1. Easy-to-use APIs (batch, streaming, and interactive)

2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics

3. Leverages Tungsten backend

DataFramesÀËÎT�æðĊđĂďÞ

1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�Í¼¿¥ïđê

Đ�-

Đexactly-once èāďîÔÝæÒ>º source / sink

3. Tungsten öíÝØďñÂ�c

More details next few weeksC��9ÀËÍ�oÒ

Spark 1.6 13.95 millionrows/sec

Spark 2.0work-in-progress

125 millionrows/sec

High throughput�æċđüíð

Teaser: SQL/DataFrame Performance

come to my talk this afternoon to learn more�²¬iÍ¶¥Ë¦¼²¶Ì�9ÂÑ¶²Â�Òu«ÀN»¬·°¥

0²·,�: SQL/DataFrame��

Tungsten Execution

PythonSQL R Streaming

DataFrame (& Dataset)

AdvancedAnalytics

Spark 2.0 Release Schedule

Under active development on GitHub

March – April: code freeze

April – May: official release

Spark 2.0 ÂĊĊđææßåĆđċ

GitHub�¼YhÀ�h�

3J-4J : áđñúĊđç

4J-5J : V8ĊĊđæ

¤Íª½¦¯±¥Ç²¶@rxin

Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1....

Documents

Transcript of Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1....