Apache Spark Briefing

of 25 /25
Apache Spark The Emerging Platform for Distributed Analytics July 2014 Thomas W. Dinsmore

Embed Size (px)


Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.

Transcript of Apache Spark Briefing

Page 1: Apache Spark Briefing

Apache SparkThe Emerging Platform for Distributed Analytics

July 2014

Thomas W. Dinsmore

Page 2: Apache Spark Briefing

What is Apache Spark?• Distributed in-memory analytics engine

• Runs in standalone clusters or Hadoop

• Fully compatible with Hadoop storage APIs

• Runs under YARN

• Top-level Apache project

• Supported in all major Hadoop distros

• Open source and vendor neutral

Thomas W. Dinsmore

Page 3: Apache Spark Briefing

SAP Support

Spark Timeline

+ + + + +2009 2010 2011 2012 2013 2014 ++

Project begins Open sourced

Spark Summit 2013

Spark Summit 2013

Apache IncubatorApache Top-Level

Cloudera Support

MapR Support

Horton Support

Thomas W. Dinsmore

News cascade starting late last year.

Page 4: Apache Spark Briefing

What problems does Spark solve?

Page 5: Apache Spark Briefing

Problem #1: MapReduce I/O sandbags runtime for advanced analytics.

Compute Store

Must persist results after each pass through data

Advanced analytics often requires multiple passes through data

Hadoop Storage

Hadoop Storage

Thomas W. Dinsmore

Page 6: Apache Spark Briefing

Spark Vision: Distributed in-memory platform


Intermediate results stay in memory.

100X performance improvement for iterative algorithms.

Compute Compute ComputeHadoop Storage

Thomas W. Dinsmore

Page 7: Apache Spark Briefing

Problem #2: Many “point” solutions for advanced analytics in Hadoop

Machine !LearningQueries

Graph !Analytics

Streaming !Analytics

Thomas W. Dinsmore

Page 8: Apache Spark Briefing

Spark Vision: single integrated platform for advanced analytics in Hadoop.

• Simplified administration • Integrated results.

Thomas W. Dinsmore

Page 9: Apache Spark Briefing

How important is Spark?

Page 10: Apache Spark Briefing

Mike Olson, Cloudera:

“The leading candidate for ‘successor to MapReduce’ today is Apache Spark.”

Thomas W. Dinsmore

Page 11: Apache Spark Briefing

M.C. Srivas, MapR:

“We believe Spark on Hadoop is a game changer for any business.”

Thomas W. Dinsmore

Page 12: Apache Spark Briefing

Ben Lorica, O’Reilly Media:

“The number of companies that are using Spark in production has exploded over the last year.”

Thomas W. Dinsmore

Page 13: Apache Spark Briefing

Apache Spark is the most active project in the Hadoop ecosystem.

Source: Cloudera

Commits, Past 12 Months


Thomas W. Dinsmore

Page 14: Apache Spark Briefing

Spark’s Key Capabilities

Page 15: Apache Spark Briefing

Spark 1.0 Machine Learning• Linear Regression

• Logistic Regression

• Linear Support Vector Machine

• Regularization

• Decision Trees

• Naive Bayes

• Alternating Least Squares

• K-Means Plus-Plus

• Singular Value Decomposition

• Principal Components Analysis

• Stochastic Gradient Descent


Spark project expects to double supported techniques in 1.1 (August 2014).Thomas W. Dinsmore

Page 16: Apache Spark Briefing

Spark SQL• Currently most active project

• Supports fast interactive queries

• Hive-compatible

• Works with Hive data

• Runs unmodified queries

• Roadmap to support more formats

• Will absorb Shark project

Thomas W. Dinsmore

Page 17: Apache Spark Briefing

Spark Streaming• Supports analysis of data streams in real time

• Unifies streaming and batch data

• Integrates with popular data sources:


• Flume

• Kafka

• Twitter

• Easy to use

• Fault tolerant

Thomas W. Dinsmore

Page 18: Apache Spark Briefing

Spark Graph Analytics

• Currently Alpha release

• Unifies graph-parallel and data-parallel computing under single API

• Performance parity with Giraph

• Replaces Spark Bagel (Pregel on Spark)

Thomas W. Dinsmore

Page 19: Apache Spark Briefing

Spark PerformanceMachine Learning

• 100x faster than MapReduce

Queries (Shark) !

• Comparable to Impala

• 100x faster than Hive



• 2X throughput of Storm

Graph (GraphX) !

• Comparable to Giraph

• 10X faster than MapReduce

Thomas W. Dinsmore

Page 20: Apache Spark Briefing

Spark Distributions

Thomas W. Dinsmore


Every major Hadoop distribution, plus…

Interface to HANABig Data Appliance

Page 21: Apache Spark Briefing

Programming Interfaces

Supported APIs “Alpha” Release

Thomas W. Dinsmore

Spark project expects to release production grade R interface early 2015.


Page 22: Apache Spark Briefing

Spark Users

Thomas W. Dinsmore

Page 23: Apache Spark Briefing

Certified on Spark

Thomas W. Dinsmore

Page 24: Apache Spark Briefing

Who is Databricks?• Commercial venture, incepted 2013

• Founded by Spark principals

• Services and support business model

• Gatekeepers to Spark

• Just landed $33M in Series B

• Andreeson, Horowitz

• New Enterprise Associates

• Just announced Spark Cloud product

Thomas W. Dinsmore

Page 25: Apache Spark Briefing

Thank You