Why Spark on Hadoop Matters

18
© 2014 MapR Technologies 1 © 2014 MapR Technologies Why Spark on Hadoop Matters MC Srivas, CTO and Founder, MapR Technologies Apache Spark Summit - July 1, 2014

description

Why Spark on Hadoop Matters. MC Srivas , CTO and Founder , MapR Technologies. Apache Spark Summit - July 1, 2014. MapR Overview. Top Ranked. 500+ Customers. Cloud Leaders. Exponential Growth. 3X. 80%. 90%. < 1%. bookings Q1 ‘13 – Q1 ‘14. of accounts expand 3X. software licenses. - PowerPoint PPT Presentation

Transcript of Why Spark on Hadoop Matters

Page 1: Why Spark on Hadoop Matters

© 2014 MapR Technologies 1© 2014 MapR Technologies

Why Spark on Hadoop Matters

MC Srivas, CTO and Founder, MapR TechnologiesApache Spark Summit - July 1, 2014

Page 2: Why Spark on Hadoop Matters

© 2014 MapR Technologies 2

MapR Overview

Top Ranked Exponential Growth

500+ Customers Cloud Leaders

3X bookings Q1 ‘13 – Q1 ‘14

80% of accounts expand 3X

90% software licenses

< 1% lifetime churn

> $1B in incremental revenuegenerated by 1 customer

Page 3: Why Spark on Hadoop Matters

© 2014 MapR Technologies 3

Rapidly Evolving LandscapeM

anag

emen

t

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEMSecurity

YARN

PigCascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBaseSolr

NoSQL & Search

Juju

Provision

Savannah*

MahoutMLLib

ML, Graph

GraphX

MR v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow &

Data Gov.Tez*

Accumulo*

HiveImpalaSharkDrill*

SQL

Sentry* Oozie ZooKeeperSqoopKnox* WhirrFalcon*Flume

Data Integrtn.& Access

HttpFSHue

* 2014 TIMELINE

Page 4: Why Spark on Hadoop Matters

© 2014 MapR Technologies 4

The Complete Spark Stack on HadoopM

anag

emen

t

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEMSecurity

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provision

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MR v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow &

Data Gov.Tez*

Accumulo*

Hive

Impala

SharkDrill*

SQL

Sentry* Oozie ZooKeeperSqoopKnox* WhirrFalcon*Flume

Data Integrtn.& Access

HttpFSHue

* 2014 TIMELINE

Page 5: Why Spark on Hadoop Matters

© 2014 MapR Technologies 5

A Winning Combination

Page 6: Why Spark on Hadoop Matters

© 2014 MapR Technologies 6

Spark Advantages:

IN-MEMORY PERFORMANCE

EASE OF DEVELOPMENT

COMBINE WORKFLOWS

• Easier APIs• Python, Scala, Java

• RDDs• DAGs Unify Processing

• Shark, ML, Streaming, GraphX

Page 7: Why Spark on Hadoop Matters

© 2014 MapR Technologies 7

Hadoop Advantages:

UNLIMITEDSCALE

WIDE RANGE OF APPLICATIONS

ENTERPRISE PLATFORM

• Multiple data sources• Multiple applications• Multiple users

• Reliability• Multi-tenancy• Security

• Files• Databases• Semi-structured

Page 8: Why Spark on Hadoop Matters

© 2014 MapR Technologies 8

The Combination of Spark on Hadoop

IN-MEMORY PERFORMANCE

EASE OF DEVELOPMENT

COMBINE WORKFLOWS

UNLIMITEDSCALE

WIDE RANGE OF APPLICATIONS

ENTERPRISE PLATFORM

Operational ApplicationsAugmented by In-Memory Performance

Page 9: Why Spark on Hadoop Matters

© 2014 MapR Technologies 9© 2014 MapR Technologies

Case Studies

Page 10: Why Spark on Hadoop Matters

© 2014 MapR Technologies 10

Industry Leading Ad-Targeting Platform

• High performance analytics over MapR M7 NoSQL

• Load from M7 table into RDD to augment scoring in real-time

• Results fed back to M7 for other applications

Page 11: Why Spark on Hadoop Matters

© 2014 MapR Technologies 11

Leading Pharma Company: NextGen Genomics

Existing process takes several weeks to align chemical compounds with genes

ADAM on Spark allows

realignment in a few hours

Geneticists can minimize engineering dependency

Page 12: Why Spark on Hadoop Matters

© 2014 MapR Technologies 12

Cisco: Security Intelligence Operations

Sensor data lands in M7

Spark Streaming on M7 for first check on known threats

Data next processed on GraphX and Mahout

Results queried using SQL via Shark and Impala

Page 13: Why Spark on Hadoop Matters

© 2014 MapR Technologies 13

Insurance Giant: Addressing Health Care Regulations

Patient information in M7 combined with clinical records to compute re-admittance probability

Process uses Spark with transactional data in M7

Insurance options decided in real-time on online portals

Page 14: Why Spark on Hadoop Matters

© 2014 MapR Technologies 14© 2014 MapR Technologies

In Summary

Page 15: Why Spark on Hadoop Matters

© 2014 MapR Technologies 15

Spark on

Hadoop gains traction for Real-time applications

Page 16: Why Spark on Hadoop Matters

© 2014 MapR Technologies 16

Pick the Right Tool for the Job

Page 17: Why Spark on Hadoop Matters

© 2014 MapR Technologies 17

MapR is Unbiased Open Source (a la Linux)• Open source distribution is about providing choice

– Linux includes MySQL, PostgreSQL and SQLite– Linux includes Apache httpd, nginx and Lighttpd

MapR Distribution for Hadoop Distribution C Distribution H

Spark Spark (all of it) and Shark Spark only No

Interactive SQL Shark, Impala, Drill, Hive/Tez One option(Impala)

One option(Hive/Tez)

Versions Hive 0.10, 0.11, 0.12, 0.13Pig 0.11, 012HBase 0.94, 0.98

One version One version

Page 18: Why Spark on Hadoop Matters

© 2014 MapR Technologies 18

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

Thank you