Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open...

28
Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki

Transcript of Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open...

Page 1: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Apache Spark Usage in the Open Source Ecosystem

Hossein Falaki @mhfalaki

Page 2: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

About me

•  Software Engineer / part-time Data Scientist at Databricks •  I started using Apache Spark since version 0.6 •  Developed first version of Apache Spark CSV data source •  Worked on SparkR and R notebooks at Databricks

2

Page 3: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Stackoverflow 2016 trending tech

3

Page 4: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Apache Spark Philosophy Unified engine Support end-to-end applications

High-level APIs Easy to use, rich optimizations

Integrate broadly Storage systems, libraries, etc

SQL Streaming ML Graph

1

2

3

Page 5: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Databricks Community Edition

•  In February Databricks launched a free version of its cloud based platform in beta •  Since then more than 8,000 users registered •  Users created over 61,000 notebooks in different languages •  This is an analysis of third party libraries that our beta users

imported to complement Apache Spark in Scala, Python, and R

5

Page 6: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

What % of users use other libraries Language %usersimpor/ngexternallibs Average#libs Median#libs

Python 75% 9 2

Scala 55% 3 1

R 57% 6 1

6

Page 7: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Installing libraries is easy

7

Page 8: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Python Packages

8

Page 9: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Most popular Python packages

9

Page 10: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

What is test_helper?

10

Page 11: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

What are these?

ETL •  re •  datetime •  pandas •  json •  csv •  string •  math / operator •  urllib / urllib2

11

Visualization •  matplotlib •  ggplot •  seaborn

Advanced analytics •  numpy •  sklearn •  graphframes •  tensorflow •  scipy

Other •  test_helper •  os •  md5

Page 12: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Python package categories

12

Page 13: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

What packages go together?

13

Page 14: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Scala Packages

14

Page 15: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Most popular Scala libraries

15

Page 16: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

What are these?

ETL •  java/scala util •  scala.collection •  scala.math •  java.{io, nio} •  java.text •  o.a.commons •  kafka •  twitter4j

16

Visualization •  ?

Advanced analytics •  spark.ml •  graphframes

Other •  java.net •  scala.sys

Page 17: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Scala package categories

17

Page 18: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

What libraries go together?

18

Page 19: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

R Packages

19

Page 20: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Most popular R packages

20

Page 21: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

What are these?

ETL •  dplyr •  plyr •  reshape2 •  jsonlite •  tidyr •  lubridate •  httr •  data.table

21

Visualization •  ggplot2 •  beanplot •  plotly •  ...

Advanced analytics •  sparkr •  h2o •  caret •  e1071

Other •  devtools •  magrittr

Page 22: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

R package categories

22

Page 23: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Comparing Python, Scala & R

23

Page 24: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Languages have unique features

24

Scala / Python / R R / Python Scala / Python / R

•  25 % of users, use multiple languages •  3% of notebooks mix different languages

Page 25: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Summary

•  Spark users extensively mix it with other packages in different languages – One of goals of Spark project is working well with other projects

•  ETL related libraries are the most popular category – Opportunities for new data sources

•  Notebooks are being used for “small data” as well as “big data.” •  Languages and their ecosystems have diverse capabilities. Users seem to

be mixing languages to their advantage –  Scala is missing visualization libraries

25

Page 26: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Try your favorite library in Databricks

26

http://databricks.com/ce Try latest version of Apache Spark and preview of Spark 2.0

Page 27: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

Thank you!

Page 28: Apache Spark Usage in the Open Source Ecosystem · 2020. 3. 15. · Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki . About me ... imported to complement

What packages are used together?

28