Spark - Meetupfiles.meetup.com/3138542/Spark in 2015 - Wendell.pdfmachine learning& Spark& SQL...
Transcript of Spark - Meetupfiles.meetup.com/3138542/Spark in 2015 - Wendell.pdfmachine learning& Spark& SQL...
Welcome to Databricks!
Founded by creators of Spark… donated Spark to Apache foundation in 2013. Databricks cloud – integrated analytics platform based on Apache Spark (limited Beta). http://databricks.com/registration New office, so pardon any kinks!
About Me
Work at Databricks managing the Spark team Spark 1.2 release manager Committer on Spark since Berkeley days
Agenda for Today
Reflections and directions for Spark Deeper dive for new API’s in Spark SQL and Mllib Committer panel / Q&A
Show of Hands!
How familiar are you with Spark? A. Heard of it, but haven't used it before. B. Kicked the tires with some basics. C. Worked or working on a proof-of-concept
deployment. D. Worked or working on a production deployment.
Spark RDD API
Spark Streaming real-‐time
GraphX Graph
MLLib machine learning
DStream’s: Streams of RDD’s
RDD-‐Based Matrices
RDD-‐Based Graphs
Spark SQL
RDD-‐Based Tables
A Bit about Spark…
KaEa, S3, Cassandra, HDFS YARN, Mesos, Standalone
User app
User app
Spark in 2014 – Project Growth
1071
3567
0
500
1000
1500
2000
2500
3000
3500
4000
2013 2014
Code Patches
137
417
0
50
100
150
200
250
300
350
400
450
2013 2014
Contributors
Spark in 2014 – User Growth
Spark 0.9 (Feb) Spark 1.0 (May) Spark 1.1 (Sep)
Downloads 30 Days From Release (Sampled)
Spark in 2014 – Broader Ecosystem
Now supported by all major Hadoop vendors… But also beyond Hadoop…
Spark in 2014 – Major additions
Usability and portability of core engine API stability (Spark 1.0!) Vastly expanded UI and instrumentation Integration with Hadoop security Disk-spilling and shuffle optimizations
Feature coverage for libraries
Spark SQL library and SchemaRDD’s GraphX library Expansion of MLlib
New Technical Directions in 2015
SchemaRDD’s as a common interchange format. Data-frame style API’s From developers to data scientists
Extensibility and pluggable API’s
Data source API (SQL) Pipelines API (Mllib) Receiver API (Streaming) Spark Packages
SchemaRDD’s as a Key Concept
RDD: “Immutable partitioned collection of elements”. SchemaRDD: “An RDD of Row objects that has an associated schema”
Why SchemaRDD’s are Useful
Having structure and types is very powerful. Allows us to optimize performance more easily. Fosters interoperability between libraries (Spark’s and third party). Enables higher level and safer user-facing API.
SchemaRDDs - Data Frame API’s
# Pandas ‘data frame’ style lineitems.groupby(‘customer’).agg(Map( ‘units’ -‐> ‘avg’, ‘totalPrice’ -‐> ‘std’
)) # or SQL style SELECT AVG(units), STD(totalPrice) FROM linetiems GROUP BY customer
SchemaRDDs - Data Frame API’s
Data frame API’s are more familiar to data scientists and easier to use. Many user issues would be solved by writing against such API’s.
Schema RDDs - Interoperability
Any data source made available to Spark SQL is instantly available in Java, Python, and R, with correct types. Major internal API’s (such as ML pipelines API) can make assumptions about input RDD’s.
Base RDD API
Spark Streaming real-‐time
GraphX Graph
MLLib machine learning
DStream’s: Streams of RDD’s
RDD-‐Based Matrices
RDD-‐Based Graphs
Spark SQL
RDD-‐Based Tables
Spark with Schema RDD’s
KaEa, S3, Cassandra, HDFS YARN, Mesos, Standalone
User app
User app
Schema RDD / Data Frame API
Technical Directions in 2015
SchemaRDD’s as a common interchange format. “Data frame” style API’s From developers to data scientists
Extensibility and pluggable API’s
Data source API (SQL) Pipelines API (Mllib) Receiver API (Streaming) Spark Packages
Spark SQL – Initial Input Support
Hive metastore tables JSON (built in) Parquet (built in) Any Spark RDD + user-schema creation
Spark SQL – Datasources API
Schema RDD
Data Source API Table scan/sink Optimization rules Table catalog
Mllib – Pipelines API
Practical ML pipelines involve feature extraction, model fitting, testing, and validation. Pipelines API provides re-usable components and a language for describing workflows. Relies heavily on SchemaRDD’s for inter-operability.
Pluggable APIs
Goal is to stabilize both of these API’s in the next few releases to allow community implementations to proliferate. Now Spark is facilitating feedback from third party library authors.
spark-packages.org
Standard library for Spark-related projects (think CRAN, PyPi, etc). Plan is to make installation on Spark a single click.
Base RDD API
Spark Streaming real-‐time
GraphX Graph
MLLib machine learning
Spark SQL
KaEa, S3, Cassandra, HDFS YARN, Mesos, Standalone
User app
Schema RDD / Data Frame API
Community Package
Spark in 2015 – Production Use Cases
Submissions to Spark Summit East show a broad variety of production use cases:
Hadoop workload migration Recommendations
Data pipeline and ETL Fraud detection
User engagement analytics Scientific computing
Medical diagnosis Smart grid/utility
analytics
Community goals in 2015
Maintain a strong, cohesive community, as we grow.
Continue to provide transparency and community involvement in technical roadmap. Maintain trust of users to update to new releases. Encourage ecosystem projects outside of Spark (stable API’s are a huge part of this).