Apache Spark Briefing

download Apache Spark Briefing

of 25

  • date post

    26-Jan-2015
  • Category

    Technology

  • view

    115
  • download

    8

Embed Size (px)

description

Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.

Transcript of Apache Spark Briefing

  • 1. Apache Spark The Emerging Platform for Distributed Analytics July 2014 Thomas W. Dinsmore

2. What is Apache Spark? Distributed in-memory analytics engine Runs in standalone clusters or Hadoop Fully compatible with Hadoop storage APIs Runs under YARN Top-level Apache project Supported in all major Hadoop distros Open source and vendor neutral Thomas W. Dinsmore 3. SAP Support Spark Timeline + + + + +2009 2010 2011 2012 2013 2014 ++ Project begins Open sourced Spark Summit 2013 Spark Summit 2013 Apache Incubator Apache Top-Level Cloudera Support MapR Support Horton Support Thomas W. Dinsmore News cascade starting late last year. 4. What problems does Spark solve? 5. Problem #1: MapReduce I/O sandbags runtime for advanced analytics. Compute Store Must persist results after each pass through data Advanced analytics often requires multiple passes through data Hadoop Storage Hadoop Storage Thomas W. Dinsmore 6. Spark Vision: Distributed in-memory platform Compute Intermediate results stay in memory. 100X performance improvement for iterative algorithms. Compute Compute Compute Hadoop Storage Thomas W. Dinsmore 7. Problem #2: Many point solutions for advanced analytics in Hadoop Machine ! LearningQueries Graph ! Analytics Streaming ! Analytics Thomas W. Dinsmore 8. Spark Vision: single integrated platform for advanced analytics in Hadoop. Simplied administration Integrated results. Thomas W. Dinsmore 9. How important is Spark? 10. Mike Olson, Cloudera: The leading candidate for successor to MapReduce today is Apache Spark. Thomas W. Dinsmore 11. M.C. Srivas, MapR: We believe Spark on Hadoop is a game changer for any business. Thomas W. Dinsmore 12. Ben Lorica, OReilly Media: The number of companies that are using Spark in production has exploded over the last year. Thomas W. Dinsmore 13. Apache Spark is the most active project in the Hadoop ecosystem. Source: Cloudera Commits, Past 12 Months 22% Thomas W. Dinsmore 14. Sparks Key Capabilities 15. Spark 1.0 Machine Learning Linear Regression Logistic Regression Linear Support Vector Machine Regularization Decision Trees Naive Bayes Alternating Least Squares K-Means Plus-Plus Singular Value Decomposition Principal Components Analysis Stochastic Gradient Descent L-BFGS Spark project expects to double supported techniques in 1.1 (August 2014). Thomas W. Dinsmore 16. Spark SQL Currently most active project Supports fast interactive queries Hive-compatible Works with Hive data Runs unmodied queries Roadmap to support more formats Will absorb Shark project Thomas W. Dinsmore 17. Spark Streaming Supports analysis of data streams in real time Unies streaming and batch data Integrates with popular data sources: HDFS Flume Kafka Twitter Easy to use Fault tolerant Thomas W. Dinsmore 18. Spark Graph Analytics Currently Alpha release Unies graph-parallel and data- parallel computing under single API Performance parity with Giraph Replaces Spark Bagel (Pregel on Spark) Thomas W. Dinsmore 19. Spark Performance Machine Learning 100x faster than MapReduce Queries (Shark) ! Comparable to Impala 100x faster than Hive ! Streaming 2X throughput of Storm Graph (GraphX) ! Comparable to Giraph 10X faster than MapReduce Thomas W. Dinsmore 20. Spark Distributions Thomas W. Dinsmore Connector Every major Hadoop distribution, plus Interface to HANABig Data Appliance 21. Programming Interfaces Supported APIs Alpha Release Thomas W. Dinsmore Spark project expects to release production grade R interface early 2015. SparkR 22. Spark Users Thomas W. Dinsmore 23. Certied on Spark Thomas W. Dinsmore 24. Who is Databricks? Commercial venture, incepted 2013 Founded by Spark principals Services and support business model Gatekeepers to Spark Just landed $33M in Series B Andreeson, Horowitz New Enterprise Associates Just announced Spark Cloud product Thomas W. Dinsmore 25. Thank You