Is Spark Replacing Hadoop

32
© 2016 MapR Technologies 1 ® © 2016 MapR Technologies Is Spark Replacing MapReduce? Hadoop? Keys Botzum, Senior Principal Technologist March 2016 Last update: March 29, 2016

Transcript of Is Spark Replacing Hadoop

Page 1: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 1

®

© 2016 MapR Technologies

Is Spark Replacing MapReduce? Hadoop? Keys Botzum, Senior Principal Technologist

March 2016 Last update: March 29, 2016

Page 2: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 2

Companies With Spark on MapR In Production

Fortune 500 Global Telecom

Fortune 500 Health Care

Global Financial Services

Page 3: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 3

Cisco: Security Intelligence Operations

Sensor data lands in Hadoop Streaming for real time detection and threat alerts Data next processed on GraphX and Mahout to build threat detection models and accelerated reporting Additional SQL querying for end customer reporting and threat detection

Page 4: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 4

Circa 2014 …

Page 5: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 5

Next-Gen Genomics

Existing process takes several weeks to align chemical compounds with genes

ADAM on Spark allows

realignment in a few hours

Geneticists can minimize engineering dependency

Page 6: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 6

Is replacing ?

Page 7: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 7

How about Prod. Mgr’s favorite tool –checkbox list!

DAG Persistent Store Machine Learning Graph Streaming Batch SQL Interactive SQL Security Resource Management Multitenancy Others

What about the use case? Fast changing projects

Analysis Paralysis

Page 8: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 8

Pluggable data parallel framework

HDFS and HBase API based Persistent Store

•  Proven MapReduce, Hive, Pig •  YARN introduces pluggability •  Allows for multiple frameworks

•  Standard for scale out big data store •  Stores data as files and tables •  Secure •  Includes resource management

Wait. What’s Hadoop?

Page 9: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 9

Spark and MapReduce are … •  Scalable frameworks for executing custom code on a cluster •  Nodes in the cluster work independently to process fragments of

data and also combine those fragments together when appropriate to yield a final result

•  Can tolerate loss of a node during a computation •  Require a distributed storage layer for common data view

Page 10: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 10

What’s MapReduce •  Map

–  Loading of the data and defining a set of keys

•  Reduce –  Collects the organized key-based data to process and output

•  Performance can be tweaked based on known details of your source files and cluster shape (size, total number)

Page 11: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 11

MapReduce Processing Model •  Define mappers •  Shuffling is automatic •  Define reducers •  For complex work, chain jobs together

Page 12: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 12

MapReduce: The Good

•  Built in fault tolerance •  Optimized IO path •  Scalable •  Developer focuses on Map/Reduce, not infrastructure •  simple? API

Page 13: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 13

MapReduce: The Bad

•  Batch oriented •  Optimized for disk IO

–  Doesn’t leverage memory well –  Iterative algorithms go through disk IO path again and again

•  Primitive API –  Developer’s have to build on very simple abstraction –  Key/Value in/out –  Even basic things like join require extensive code

•  Result often many files that need to be combined appropriately

Page 14: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 14

Batch Interactive Streaming Framework

Pluggable Persistent Store

•  Powerful API •  Leverages memory aggressively •  Batch and streaming

•  MapR-FS, HDFS •  MapR-DB, HBase, Cassandra •  MapR-Streams, Kafka •  S3

What’s Spark?

Page 15: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 15

Apache Spark

•  spark.apache.org

•  Originally developed in 2009 in UC Berkeley’s AMP Lab

•  Fully open sourced in 2010 – now at Apache Software Foundation

Page 16: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 16

Spark: Ease of Use and Performance

•  Easy to Develop –  Rich APIs in Java, Scala,

Python, R –  Interactive shell

•  Fast to Run –  General execution graphs –  In-memory storage

Less code, simpler code

Page 17: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 17

Resilient Distributed Datasets (RDD) •  Spark revolves around RDDs •  Fault-tolerant read only collection of elements that can be

operated on in parallel •  Cached in memory or on disk http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Newer API based around DataFrames but for this presentation difference isn’t important

Page 18: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 18

RDD Operations - Expressive •  Transformations

–  Creation of a new RDD dataset from an existing •  map, filter, distinct, union, sample, groupByKey, join, reduce, etc…

•  Actions –  Return a value after running a computation

•  collect, count, first, takeSample, foreach, etc…

Check the documentation for a complete list

Page 19: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 19

•  Spark Scala

Easy: Example – Word Count •  Spark Java •  Hadoop MapReduce Java

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}

JavaRDD<String> textFile = sc.textFile("hdfs://..."); JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://...");

Source: http://spark.apache.org/examples.html#

val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" "))

.map(word => (word, 1)) .reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 20: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 20

Faster for Iterative: PageRank Performance

171

80

23

14

0

50

100

150

200

30 60

Itera

tion

time

(s)

Number of machines

Hadoop

Spark

Page 21: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 21

Spark vs. MapReduce •  Spark is faster than MR for iterative algorithms that fit data in

memory •  Spark code is easier to write and easier to understand than MR

–  Your programming is closer to the correct abstraction

•  Spark supports batch and streaming model •  Advantage Spark

–  Caution: not all applications run faster on Spark and Spark may have limitations for some scenarios

Page 22: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 22

Is replacing ?

Is replacing MapReduce? Quite possibly….with time...with caveats

Page 23: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 23

Unified Easy Batch Interactive Streaming Framework

Pluggable Persistent Store

Pluggable data parallel framework

HDFS and HBase Persistent Store

Hadoop is more than MapReduce

Needs a resource manager Includes a resource manager (YARN)

Page 24: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 24

Hadoop Supports so Much •  Alternative batch models: Pig, Cascading, Spark •  Machine learning: Mahout, SparkML •  SQL: Hive, Drill, Hive on Tez, Impala, SparkSQL •  Stream processing: Storm, Flink, Spark, DataTorrent •  ETL: Sqoop, Flume •  Storage: file (HDFS/MapR-FS), table (HBase/MapR-DB/Accumulo),

messaging (Kafka/MapR-Streams) •  Data exploration: Hue •  And too many excellent commercial tools to list

•  Hypothesis: –  Infrastructure and data tend to be sticky while execution frameworks evolve rapidly –  Hadoop’s infrastructure and storage supports a vigorous and growing ecosystem of

“competing” execution engines

Page 25: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 25

Perspective

Unified Easy Batch Interactive Streaming Framework

Pluggable data parallel framework

HDFS and HBase Persistent Store

Interactive SQL (Drill, Impala,

Hive.next)

Streaming (Flink, Storm DataTorrent)

RDBMS (e.g

SpliceMachine)

Ecosystem SLA (YARN resource reservation, distro mgmt tools, Pepperdata, …)

Security (Drill Views, Ranger, Sentry, BlueTalon…) Data Wrangling, discovery and governance (Trifacta, Paxata, Waterline…)

Page 26: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 26

Unified Easy Batch Interactive Streaming Framework

Pluggable Persistent Store

Perspective

Ecosystem/Environments Resource Management – YARN, Mesos, Kubernetes

Deployment – Private OpenStack, Public Cloud, Hybrid

NoSQL/Search (Cassandra, ES)

In Mem (SAP

Hana,MemSQL)

RDBMS (mySQL, Oracle,

etc)

Hadoop (Hbase, HDFS)

Page 27: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 27

Which is More Realistic?

What about classic

applications and data sharing?

Spark becomes primary execution framework

Hadoop remains primary storage and execution framework

Page 28: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 28

Is replacing ?

Is replacing MapReduce? Quite possibly….with time...with caveats

Seems improbable Hadoop grows to embrace new execution frameworks

Page 29: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 29

Page 30: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 30

MapR Platform Services: Open API Architecture Assures Interoperability, Avoids Lock-in

HDFS API

POSIX NFS

SQL, Hbase

API JSON API

Kafka API

Page 31: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 31

Q & A

maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies

Page 32: Is Spark Replacing Hadoop

®© 2016 MapR Technologies 32

References •  Spark vs. MapReduce:

–  https://www.mapr.com/blog/apache-spark-vs-mapreduce-whiteboard-walkthrough

–  http://www.vldb.org/pvldb/vol8/p2110-shi.pdf –  http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/

•  Spark: http://spark.apache.org/ •  Spark on MapR:

http://maprdocs.mapr.com/51/index.html#Spark/Spark_26984599.html