Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1....
Transcript of Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1....
Spark 2.0: What’s Next
Reynold Xin @rxinSpark Conference JapanFeb 8, 2016
Please put up your hand
if you know what Spark is?
Put up your hand
if you think your significant otherknow what Spark is?
(girlfriend, boyfriend, wife, husband, …)
This Talk
What is Spark?How are people using it?Spark 2.0
open source data processing engine built around speed, ease of use, and sophisticated analytics�����������!#��� Ò�Á�Ñ´¶
������������"� �
About Databricks
Founded by creators of Spark & behind Spark development
Cloud Enterprise Spark Platform• Cluster management, interactive notebooks,
dashboards, production jobs,data governance, security, …
Databricks Àº¥»
Spark�ht½Spark�hÒA>³Î�¶¸À˹»�l°Ï¶
ØďêđüĉÕçÝĉÖñ Spark üĉíðúÙđăĐÝĉæên` .�$ôđðûíÝ
ĐëíäĆÿđñ åĈûb<
ĐïđêÛöòďæ èÜĆĊîÔ
2015: Great Year for Spark
Most active open source project in data (1000+ contributors)
New language: R
Widespread industry support & adoption
2015: SparkÀ½¹»'«¿3
ïđê� IÉYh¿Úđüďéđæüčå×Ýð (1000��Â�^t)
D²¥�� : R
24¥ReâĀđð½@c
Meetup Groups: December 2014
source: meetup.com
Meetup Groups: December 2015
source: meetup.com
TokyoSpark Meetup
IBMÃApache Spark Â�6�ÅÂáĂíðĄďðÒÓòÖďæ TÂ103¼IÉ�~¿Úđüďéđæüčå×Ýð½¿Î�w;ÒkÈ»¥Î½¥¦
Spark Or Hadoop – ¾¸Ìªýæð¿øíÞïđêúČđăĎđÝē
Apache Spark ª�W:�G·½�PqOªj³
“Spark is the Taylor Swiftof big data software.”
- Derrick Harris, Fortune¢Spark ÃøíÞïđêéúðÖ×ÓÂîÕĉđĐæÖÔúð�·£
- ëđĊíÝĐõĊæ úÙđìĆď
��X : FM¼ÃîČøfp¢îĉæõÖæ£Â��U¿¾¼K
“Spark is the �1H(of big data software.”
(A Japanese engineer told me)
How are people using Spark?
Diverse Runtime EnvironmentsHOW RESPONDENTS ARE
RUNNING SPARK
51%on a public cloud
MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)
48% 40% 11%Standalone mode YARN Mesos
Cluster Managers
°Ç±Ç¿+{a%
Industries Using Spark
Other
Software(SaaS, Web, Mobile)
Consulting (IT)Retail,
e-Commerce
Advertising,Marketing, PR
Banking, Finance
Health, Medical,Pharmacy, Biotech
Carriers,Telecommunications
Education
Computers, Hardware
29.4%
17.7%
14.0%
9.6%
6.7%
6.5%
4.4%
4.4%
3.9%
3.5%
SparkÒ�c²»¥ÎRe
éúðÖ×Ó
áďâċîÔďÞ (IT)
�{ �z
áďùĆđêđ õđñÖ×Ó
Bv
�7 �g y� öÕÚîÝôčåđ
ÜąĊÓ ��
4"
āđßîÔďÞ PR
/& eáāđæ
µÂ�
Top Applications
29%
36%
40%
44%
52%
68%
Fraud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
Business IntelligenceøåóæÕďîĊå×ďæ
ïđêÖ×ÓõÖåďÞ
ČáĄďïđäĈď
čÞ�`
ćđãđ!âđøæ
�VQ� / èÜĆĊîÔ
��ÂÓüĊßđäĈď
Are we done?
No. Development is faster than ever!
ɦ)<ē
¥¥§¡�hÃ�Ǽ�ÀYhÀ¿¹»r¥»¥ÎĒ
2012
started@
Berkeley
2010
researchpaper
2013
Databricksstarted
& donatedto ASF
2014
Spark 1.0 & libraries(SQL, ML, GraphX)
2015
DataFramesTungsten
ML Pipelines
2016
Spark 2.0
SQL Streaming MLlib
Spark Core (RDD)
GraphX
Spark stack diagram SparkÂæêíÝ#
Frontend(user facing APIs)
Backend(execution)
Spark stack diagram(a different take)
SparkÂæêíÝ#(�¦�E¼)
������
(ćđãđÀ�³ÎAPI)
öíÝØďñ
(+{)
Frontend(RDD, DataFrame, ML pipelines, …)
Backend(scheduler, shuffle, operators, …)
Spark stack diagram(a different take)
SparkÂæêíÝ#(�¦�E¼)
������
(RDD, DataFrame, ML pipelines, …)
öíÝØďñ
(æßåĆđĉ äąíúċ [m(
…)
FrontendAPI Foundation
Streaming DataFrame/Dataset
SQL
Backend10X Performance
Whole-stage CodegenVectorization
Spark 2.0
účďðØďñ API Â��
æðĊđĂďÞ
DataFrame/DatasetSQL
öíÝØďñ 10�Â÷úÙđāďæ
�æîđå áđñb<
ýÝðċ�
Guiding Principles for API Foundation
1. Simple yet expressive
2. (Semantics) well-defined
3. Sufficiently abstracted to allow optimized backends
API Ò�ÎÀ¤¶¹»Â?�
äďüċ·ª|_�©À
(èāďîÔÝæª) ��*s°Ï»¥Î
öíÝØďñÂI��ª¼«Î˦��À=��°Ï»¥Î
Java/Scalafrontend
JVMbackend
RDD
DataFramefrontend
Logical Plan
Physical execution
Catalystoptimizer
DataFrame
Python Java/Scala SQL
DataFrameLogical Plan
JVM Tungsten …
API Foundations in Spark 2.0
1. Streaming DataFrames
2. Maturing and merging DataFrame and Dataset
3. ANSI SQL• natural join, subquery, view support
Spark 2.0 À¨ÎAPIÂ��
æðĊđĂďÞ DataFrames
DataFrame ½ Dataset Â<]½āđå
x\q� âûÝØĊ øĆđÂâĀđð
Challenges with Stream Processing
Stream processing is hard to reason about• Output over time• Late data• Failures• Distribution
And all this has to work across complex operations• Windows, sessions, aggregation, etc
æðĊđă�`À�³Î��
æðĊđă�`ª�²¥`dÃ
Đ�¥L�ÀZÎÓÖðüíð
Đ�Ï»¬Îïđê
�-
��
®Ï̳ƻª}�¿ÚþČđäĈďÀѶ¹»Sw²¿ÏÄ¿Ì¿¥
ĐÖÔďñÖ èíäĈď ÓÞĊàđäĈď ¿¾
Next-gen Streaming with DataFrames
1. Easy-to-use APIs (batch, streaming, and interactive)
2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics
3. Leverages Tungsten backend
DataFramesÀËÎT�æðĊđĂďÞ
1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�ͼ¿¥ïđê
�-
Đexactly-once èāďîÔÝæÒ>º source / sink
3. Tungsten öíÝØďñÂ�c
Next-gen Streaming with DataFrames
1. Easy-to-use APIs (batch, streaming, and interactive)
2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics
3. Leverages Tungsten backend
DataFramesÀËÎT�æðĊđĂďÞ
1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�ͼ¿¥ïđê
�-
Đexactly-once èāďîÔÝæÒ>º source / sink
3. Tungsten öíÝØďñÂ�c
More details next few weeksC��9ÀËÍ�oÒ
Spark is already pretty fast.
Can we make it 10X faster in 2.0?
Spark ó¼À©¿Í�¥
2.0¼ 10���À¼«Î·Ц©ē
Spark 1.6 13.95 millionrows/sec
Spark 2.0work-in-progress
125 millionrows/sec
High throughput�æċđüíð
Teaser: SQL/DataFrame Performance
come to my talk this afternoon to learn more�²¬iͶ¥Ë¦¼²¶Ì�9ÂѶ²Â�Òu«ÀN»¬·°¥
0²·,�: SQL/DataFrame��������
Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
AdvancedAnalytics
Spark 2.0 Release Schedule
Under active development on GitHub
March – April: code freeze
April – May: official release
Spark 2.0 ÂĊĊđææßåĆđċ
GitHub�¼YhÀ�h�
3J-4J : áđñúĊđç
4J-5J : V8ĊĊđæ
¤Íª½¦¯±¥Ç²¶@rxin