Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at...

1 © Cloudera, Inc. All rights reserved.

Cloudera’s Investments in the Spark Ecosystem Mike Olson | Founder and Chief Strategy Officer


Our history with Spark • On the radar since 2009 (Matei Zaharia and the RAD Lab) •  See my 2013 blog post (“MapReduce and Spark”) •  1st vendor to ship and support Spark •  6 contributors to Spark v1 (all other Hadoop vendors: zero) •  2+ commiXers (all other Hadoop vendors: zero) •  Complemented by Intel’s substanYal & early investment • Working across the project: • Core, Streaming, Security, YARN w Yahoo!, Mllib •  Sentry, Hive, Pig, Crunch, Dataflow on Spark • Cloudera Manager, training, PS (6+), UG, books, etc

•  Single largest commercial distributor of Spark (per Typesafe/Databricks survey)


Our posiYon on Spark

• Cloudera is a member of, and aligned with, the global Spark community •  Spark will replace MapReduce as the general purpose Hadoop framework • Tremendous community – 400 developers across 50 companies • Hadoop ecosystem integraYon (naYve & 3rd party) • Doesn’t mean MapReduce goes away – it will be the historical framework

•  Spark is not just for data science / ML •  Spark does not replace special purpose frameworks • One size does not fit all for SQL, Search, Graph, Stream


Why Spark MaXers: LogisYc Regression (data fits in memory)

0

500

1000

1500

2000

2500

3000

3500

4000

1 5 10 20 30

Runn

ing Time (s)

Number of Itera5ons

Hadoop

Spark

110 s / iteration

first iteration 80 s further iterations 1 s


In-‐Memory Datasets

Trends ½ price every 18 months 2x bandwidth every 3 years The numbers get even more interesYng with upcoming enhancements to the Intel architecture.

128 – 384 GB

12-‐24 cores

50 GB per sec

Memory an enabler for high performance big data applica5ons


Delivering Spark in Cloudera Enterprise

Hadoop Integra5on •  Standard Hadoop data formats •  Runs under YARN in mixed clusters •  Security Libraries •  Mllib – Machine Learning toolkit •  GraphX (alpha) – Graph analyYcs

based on PowerGraph abstracYons •  Spark Streaming – Near real-‐Yme

analyYcs

Language support: •  SparkR (upcoming) •  Java 8 • PySpark and pandas interoperability • Dataframe API •  Schema support in Spark’s APIs •  SQL support in Spark Streaming (upcoming)


Cloudera’s Spark Investments for 2015 Partner of choice for companies doing Spark integraYon Increase our involvement in the community

Community leadership

Complete Hive on Spark Complete Pig on Spark Oozie acYon for Spark (Oozie team) Improve Spark core shuffle primiYves to be equivalent or beXer than MapReduce in all respects Integrate with Google DataFlow Support advanced features such as runYme DAG opYmizaYon

Batch Tool of choice / Replace MR

EDH IntegraYon and cluster ciYzenship

AutomaYc executor launch / destrucYon based on usage ValidaYon of Parquet / Avro with Impala style usage Improved integraYon with HBase to simplify RDD creaYon against HBase tables ATS integraYon for Spark Container resizing with YARN support Tachyon alternaYve in HDFS (dependent on HDFS team prioriYes) + off-‐heap caching

Ease of development

Provide EXPLAIN PLAN primiYves at runYme and compile Yme ProgrammaYc job submission interface Auto-‐compute parYYon model to simplify configuraYon space for users

Enterprise grade

CM integraYon; AMon integraYon; tuning hints; validaYon Parallel split generaYon REST API for Spark History Server Security -‐ EncrypYon: On-‐the-‐wire encrypYon, shuffle encrypYon Security -‐ Navigator: IntegraYon with Audit, Lineage (visible through Hive as well) Scale -‐ Validate Spark at very large scale and improve scalability where issues are found Security -‐ MR / Spark: RecordService for deeper Sentry integraYon Security -‐ AuthorizaYon: Integrate Schema RDDs with Sentry Availability: Spark Streaming availability (mostly complete)

Data science tool of choice

Hue app for Spark (a la Zeppelin, Databricks): Phase 1 Rest based interface to Spark for Hue Oryx2 built on Spark for data science lifecycle management


Cloudera customer use cases – core Spark Sector Use case Replaces

Financial Services

•  Value-‐at-‐Risk calculaYons •  ETL pipeline speed-‐up •  Analyzing stock data for 20 years

Home grown applicaYons

Genomics •  IdenYfy genes implicated in disease onset in full human genome

MySQL engine

Data services •  Trend analysis using staYsYcal methods on large data sets •  Document classificaYon (LDA) •  Fraud analyYcs

•  Netezza replacement •  Net new

ERP •  OCR and bill classificaYon Net new

Healthcare •  CalculaYng Jaccard scores on health care data sets Net new


Cloudera customer use cases – Streaming Sector Use case Replaces

Financial Services

•  On-‐line fraud detecYon Net new

Many •  ConYnuous ETL

Retail •  On-‐line recommender systems •  Inventory management

•  Custom apps


Why Cloudera?

• Deep engineering investment – only distribuYon vendor with engineering contribuYons to Spark and actual technical know-‐how

•  Field team, support, training and services with experience in many Spark use cases • Driving roadmap for Spark

ExperYse

• Most customers running Spark across all distribuYons put together • Range from few nodes to 800+ nodes •  Longest field presence – first vendor to support and sYll only two vendors with official support

Experience

•  Intel partnership brings 15 Spark developers focused on Cloudera customer use cases • Business relaYonship with Databricks to do joint development on Spark

Partnerships


Thank you [email protected] @mikeolson

Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at...

Documents

Transcript of Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at...