Cloudera’sInvestmentsin theSparkEcosystem - - Spark at •...

download Cloudera’sInvestmentsin theSparkEcosystem -   - Spark at  • Mllib–MachineLearningtoolkit ... • SQLsupportinSparkStreaming ... • Netezzareplacement

of 11

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Cloudera’sInvestmentsin theSparkEcosystem - - Spark at •...

  • 1 Cloudera, Inc. All rights reserved.

    Clouderas Investments in the Spark Ecosystem Mike Olson | Founder and Chief Strategy Officer

  • 2 Cloudera, Inc. All rights reserved.

    Our history with Spark On the radar since 2009 (Matei Zaharia and the RAD Lab) See my 2013 blog post (MapReduce and Spark) 1st vendor to ship and support Spark 6 contributors to Spark v1 (all other Hadoop vendors: zero) 2+ commiXers (all other Hadoop vendors: zero) Complemented by Intels substanYal & early investment Working across the project: Core, Streaming, Security, YARN w Yahoo!, Mllib Sentry, Hive, Pig, Crunch, Dataflow on Spark Cloudera Manager, training, PS (6+), UG, books, etc

    Single largest commercial distributor of Spark (per Typesafe/Databricks survey)

  • 3 Cloudera, Inc. All rights reserved.

    Our posiYon on Spark

    Cloudera is a member of, and aligned with, the global Spark community Spark will replace MapReduce as the general purpose Hadoop framework Tremendous community 400 developers across 50 companies Hadoop ecosystem integraYon (naYve & 3rd party) Doesnt mean MapReduce goes away it will be the historical framework

    Spark is not just for data science / ML Spark does not replace special purpose frameworks One size does not fit all for SQL, Search, Graph, Stream

  • 4 Cloudera, Inc. All rights reserved.

    Why Spark MaXers: LogisYc Regression (data fits in memory)










    1 5 10 20 30


    ing Time (s)

    Number of Itera5ons



    110 s / iteration

    first iteration 80 s further iterations 1 s

  • 5 Cloudera, Inc. All rights reserved.

    In-Memory Datasets

    Trends price every 18 months 2x bandwidth every 3 years The numbers get even more interesYng with upcoming enhancements to the Intel architecture.

    128 384 GB

    12-24 cores

    50 GB per sec

    Memory an enabler for high performance big data applica5ons

  • 6 Cloudera, Inc. All rights reserved.

    Delivering Spark in Cloudera Enterprise

    Hadoop Integra5on Standard Hadoop data formats Runs under YARN in mixed clusters Security Libraries Mllib Machine Learning toolkit GraphX (alpha) Graph analyYcs

    based on PowerGraph abstracYons Spark Streaming Near real-Yme


    Language support: SparkR (upcoming) Java 8 PySpark and pandas interoperability Dataframe API Schema support in Sparks APIs SQL support in Spark Streaming (upcoming)

  • 7 Cloudera, Inc. All rights reserved.

    Clouderas Spark Investments for 2015 Partner of choice for companies doing Spark integraYon Increase our involvement in the community

    Community leadership

    Complete Hive on Spark Complete Pig on Spark Oozie acYon for Spark (Oozie team) Improve Spark core shuffle primiYves to be equivalent or beXer than MapReduce in all respects Integrate with Google DataFlow Support advanced features such as runYme DAG opYmizaYon

    Batch Tool of choice / Replace MR

    EDH IntegraYon and cluster ciYzenship

    AutomaYc executor launch / destrucYon based on usage ValidaYon of Parquet / Avro with Impala style usage Improved integraYon with HBase to simplify RDD creaYon against HBase tables ATS integraYon for Spark Container resizing with YARN support Tachyon alternaYve in HDFS (dependent on HDFS team prioriYes) + off-heap caching

    Ease of development

    Provide EXPLAIN PLAN primiYves at runYme and compile Yme ProgrammaYc job submission interface Auto-compute parYYon model to simplify configuraYon space for users

    Enterprise grade

    CM integraYon; AMon integraYon; tuning hints; validaYon Parallel split generaYon REST API for Spark History Server Security - EncrypYon: On-the-wire encrypYon, shuffle encrypYon Security - Navigator: IntegraYon with Audit, Lineage (visible through Hive as well) Scale - Validate Spark at very large scale and improve scalability where issues are found Security - MR / Spark: RecordService for deeper Sentry integraYon Security - AuthorizaYon: Integrate Schema RDDs with Sentry Availability: Spark Streaming availability (mostly complete)

    Data science tool of choice

    Hue app for Spark (a la Zeppelin, Databricks): Phase 1 Rest based interface to Spark for Hue Oryx2 built on Spark for data science lifecycle management

  • 8 Cloudera, Inc. All rights reserved.

    Cloudera customer use cases core Spark Sector Use case Replaces

    Financial Services

    Value-at-Risk calculaYons ETL pipeline speed-up Analyzing stock data for 20 years

    Home grown applicaYons

    Genomics IdenYfy genes implicated in disease onset in full human genome

    MySQL engine

    Data services Trend analysis using staYsYcal methods on large data sets Document classificaYon (LDA) Fraud analyYcs

    Netezza replacement Net new

    ERP OCR and bill classificaYon Net new

    Healthcare CalculaYng Jaccard scores on health care data sets Net new

  • 9 Cloudera, Inc. All rights reserved.

    Cloudera customer use cases Streaming Sector Use case Replaces

    Financial Services

    On-line fraud detecYon Net new

    Many ConYnuous ETL

    Retail On-line recommender systems Inventory management

    Custom apps

  • 10 Cloudera, Inc. All rights reserved.

    Why Cloudera?

    Deep engineering investment only distribuYon vendor with engineering contribuYons to Spark and actual technical know-how

    Field team, support, training and services with experience in many Spark use cases Driving roadmap for Spark


    Most customers running Spark across all distribuYons put together Range from few nodes to 800+ nodes Longest field presence first vendor to support and sYll only two vendors with official support


    Intel partnership brings 15 Spark developers focused on Cloudera customer use cases Business relaYonship with Databricks to do joint development on Spark


  • 11 Cloudera, Inc. All rights reserved.

    Thank you @mikeolson