Clouderaâ€™sInvestmentsin theSparkEcosystem - - Spark at â€¢...
Embed Size (px)
Transcript of Clouderaâ€™sInvestmentsin theSparkEcosystem - - Spark at â€¢...
1 Cloudera, Inc. All rights reserved.
Clouderas Investments in the Spark Ecosystem Mike Olson | Founder and Chief Strategy Officer
2 Cloudera, Inc. All rights reserved.
Our history with Spark On the radar since 2009 (Matei Zaharia and the RAD Lab) See my 2013 blog post (MapReduce and Spark) 1st vendor to ship and support Spark 6 contributors to Spark v1 (all other Hadoop vendors: zero) 2+ commiXers (all other Hadoop vendors: zero) Complemented by Intels substanYal & early investment Working across the project: Core, Streaming, Security, YARN w Yahoo!, Mllib Sentry, Hive, Pig, Crunch, Dataflow on Spark Cloudera Manager, training, PS (6+), UG, books, etc
Single largest commercial distributor of Spark (per Typesafe/Databricks survey)
3 Cloudera, Inc. All rights reserved.
Our posiYon on Spark
Cloudera is a member of, and aligned with, the global Spark community Spark will replace MapReduce as the general purpose Hadoop framework Tremendous community 400 developers across 50 companies Hadoop ecosystem integraYon (naYve & 3rd party) Doesnt mean MapReduce goes away it will be the historical framework
Spark is not just for data science / ML Spark does not replace special purpose frameworks One size does not fit all for SQL, Search, Graph, Stream
4 Cloudera, Inc. All rights reserved.
Why Spark MaXers: LogisYc Regression (data fits in memory)
1 5 10 20 30
ing Time (s)
Number of Itera5ons
110 s / iteration
first iteration 80 s further iterations 1 s
5 Cloudera, Inc. All rights reserved.
Trends price every 18 months 2x bandwidth every 3 years The numbers get even more interesYng with upcoming enhancements to the Intel architecture.
128 384 GB
50 GB per sec
Memory an enabler for high performance big data applica5ons
6 Cloudera, Inc. All rights reserved.
Delivering Spark in Cloudera Enterprise
Hadoop Integra5on Standard Hadoop data formats Runs under YARN in mixed clusters Security Libraries Mllib Machine Learning toolkit GraphX (alpha) Graph analyYcs
based on PowerGraph abstracYons Spark Streaming Near real-Yme
Language support: SparkR (upcoming) Java 8 PySpark and pandas interoperability Dataframe API Schema support in Sparks APIs SQL support in Spark Streaming (upcoming)
7 Cloudera, Inc. All rights reserved.
Clouderas Spark Investments for 2015 Partner of choice for companies doing Spark integraYon Increase our involvement in the community
Complete Hive on Spark Complete Pig on Spark Oozie acYon for Spark (Oozie team) Improve Spark core shuffle primiYves to be equivalent or beXer than MapReduce in all respects Integrate with Google DataFlow Support advanced features such as runYme DAG opYmizaYon
Batch Tool of choice / Replace MR
EDH IntegraYon and cluster ciYzenship
AutomaYc executor launch / destrucYon based on usage ValidaYon of Parquet / Avro with Impala style usage Improved integraYon with HBase to simplify RDD creaYon against HBase tables ATS integraYon for Spark Container resizing with YARN support Tachyon alternaYve in HDFS (dependent on HDFS team prioriYes) + off-heap caching
Ease of development
Provide EXPLAIN PLAN primiYves at runYme and compile Yme ProgrammaYc job submission interface Auto-compute parYYon model to simplify configuraYon space for users
CM integraYon; AMon integraYon; tuning hints; validaYon Parallel split generaYon REST API for Spark History Server Security - EncrypYon: On-the-wire encrypYon, shuffle encrypYon Security - Navigator: IntegraYon with Audit, Lineage (visible through Hive as well) Scale - Validate Spark at very large scale and improve scalability where issues are found Security - MR / Spark: RecordService for deeper Sentry integraYon Security - AuthorizaYon: Integrate Schema RDDs with Sentry Availability: Spark Streaming availability (mostly complete)
Data science tool of choice
Hue app for Spark (a la Zeppelin, Databricks): Phase 1 Rest based interface to Spark for Hue Oryx2 built on Spark for data science lifecycle management
8 Cloudera, Inc. All rights reserved.
Cloudera customer use cases core Spark Sector Use case Replaces
Value-at-Risk calculaYons ETL pipeline speed-up Analyzing stock data for 20 years
Home grown applicaYons
Genomics IdenYfy genes implicated in disease onset in full human genome
Data services Trend analysis using staYsYcal methods on large data sets Document classificaYon (LDA) Fraud analyYcs
Netezza replacement Net new
ERP OCR and bill classificaYon Net new
Healthcare CalculaYng Jaccard scores on health care data sets Net new
9 Cloudera, Inc. All rights reserved.
Cloudera customer use cases Streaming Sector Use case Replaces
On-line fraud detecYon Net new
Many ConYnuous ETL
Retail On-line recommender systems Inventory management
10 Cloudera, Inc. All rights reserved.
Deep engineering investment only distribuYon vendor with engineering contribuYons to Spark and actual technical know-how
Field team, support, training and services with experience in many Spark use cases Driving roadmap for Spark
Most customers running Spark across all distribuYons put together Range from few nodes to 800+ nodes Longest field presence first vendor to support and sYll only two vendors with official support
Intel partnership brings 15 Spark developers focused on Cloudera customer use cases Business relaYonship with Databricks to do joint development on Spark
11 Cloudera, Inc. All rights reserved.
Thank you firstname.lastname@example.org @mikeolson