State of Spark, and where it is going - GitHub...

18
State of Spark, and where it is going Reynold Xin @rxin Strata Singapore Dec 3 rd , 2015

Transcript of State of Spark, and where it is going - GitHub...

Page 1: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

State of Spark, and where it is going

Reynold Xin @rxinStrata SingaporeDec 3rd, 2015

Page 2: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

SQL Streaming MLlib

Spark Core (RDD)

GraphX

Spark stack diagram

Page 3: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

A Great Year for Spark

Most active open source project in big data

New language: R

Widespread industry support & adoption

Page 4: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Community Growth

2014 2015

Summit Attendees

2014 2015

MeetupMembers

2014 2015

Developers Contributing

3900

1100

50K

12K

500

1000

Page 5: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Meetup Groups: December 2014

source: meetup.com

Page 6: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Meetup Groups: December 2015

source: meetup.com

Page 7: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.
Page 8: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Users

1000+ companies

Distributors + Apps

50+ companies

Page 9: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Diverse Runtime EnvironmentsHOW RESPONDENTS ARE

RUNNING SPARK

51%on a public cloud

MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)

48% 40% 11%Standalone mode YARN Mesos

Cluster Managers

Page 10: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Industries Using Spark

Other

Software(SaaS, Web, Mobile)

Consulting (IT)Retail,

e-Commerce

Advertising,Marketing, PR

Banking, Finance

Health, Medical,Pharmacy, Biotech

Carriers,Telecommunications

Education

Computers, Hardware

29.4%

17.7%

14.0%

9.6%

6.7%

6.5%

4.4%

4.4%

3.9%

3.5%

Page 11: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Top Applications

29%

36%

40%

44%

52%

68%

Faud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business Intelligence

Page 12: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Largest Cluster & Daily Intake

12

800 million+active users

8000+nodes

150 PB+1 PB+/day

Page 13: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Alibaba Taobao

13

clustering(community detection)

belief propagation(influence & credibility)

collaborative filtering(recommendation)

* Spark Summit San Francisco 2014

Page 14: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Possible Assets

Targeted Marketing

Financial Networking

Huawei FusionInsight Spark

……

DB/DW

Credit proof:about 2 Weeks

Credit Proof2~5 Seconds

Off LineHistory Query

On LineHistory query

Structured Data Structured, Semi-Structured, Unstructured Data

Higher

History Query 7 years+1 year

Micro- loan Conversion Rate 40X

Credit Proof 2-5s15days

Top Retail Bank Huawei

Top Retail Bank & Huawei

Page 15: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Are We Done?

No! Development is faster than ever. Expect Spark 2.0 in 2016.

Biggest technical change in 2015 was DataFrames• Moves many computations onto the relational Spark SQL optimizer

Enables both new APIs and more optimization, which is now happening through Project Tungsten

Page 16: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Coming in Spark 1.6

Dataset API: typed interface over DataFrames / Tungsten• Common ask from developers who saw DataFrames

case class Person(name: String, age: Int)

val dataframe = read.json(“people.json”)val ds: Dataset[Person] = dataframe.as[Person]

ds.filter(p => p.name.startsWith(“M”)).groupBy(“name”).avg(“age”)

Page 17: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Other Upcoming Features

DataFrame integration with GraphX and Streaming

More Tungsten features: faster in-memory cache, SSD storage, better code generation

Data sources for Streaming

Page 18: State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016.

Thank you.@rxin