Apache Spark & Hadoop
-
Upload
mapr-technologies -
Category
Technology
-
view
835 -
download
4
Embed Size (px)
description
Transcript of Apache Spark & Hadoop

®© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Apache Spark
Keys Botzum Senior Principal Technologist, MapR Technologies
June 2014

®© 2014 MapR Technologies 2
Agenda • MapReduce • Apache Spark • How Spark Works • Fault Tolerance and Performance • Examples • Spark and More

®© 2014 MapR Technologies 3
MapR: Best Product, Best Business & Best Customers
Top Ranked Exponential Growth
500+ Customers Cloud Leaders
3X bookings Q1 ‘13 – Q1 ‘14
80% of accounts expand 3X
90% software licenses
< 1% lifetime churn
> $1B in incremental revenue generated by 1 customer

®© 2014 MapR Technologies 4 © 2014 MapR Technologies ®
Review: MapReduce

®© 2014 MapR Technologies 5
MapReduce: A Programming Model
• MapReduce: Simplified Data Processing on Large Clusters (published 2004)
• Parallel and Distributed Algorithm:
• Data Locality • Fault Tolerance • Linear Scalability

®© 2014 MapR Technologies 6
MapReduce Basics • Assumes scalable distributed file system that
shards data • Map
– Loading of the data and defining a set of keys
• Reduce – Collects the organized key-based data to process
and output • Performance can be tweaked based on known
details of your source files and cluster shape (size, total number)

®© 2014 MapR Technologies 7
MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together

®© 2014 MapR Technologies 8
MapReduce: The Good
• Built in fault tolerance • Optimized IO path • Scalable • Developer focuses on Map/Reduce, not
infrastructure • simple? API

®© 2014 MapR Technologies 9
MapReduce: The Bad
• Optimized for disk IO – Doesn’t leverage memory well – Iterative algorithms go through disk IO path again
and again • Primitive API
– Developer’s have to build on very simple abstraction – Key/Value in/out – Even basic things like join require extensive code
• Result often many files that need to be combined appropriately

®© 2014 MapR Technologies 10 © 2014 MapR Technologies ®
Apache Spark

®© 2014 MapR Technologies 11
Apache Spark
• spark.apache.org • github.com/apache/spark • [email protected]
• Originally developed in 2009 in UC Berkeley’s AMP Lab
• Fully open sourced in 2010 – now at Apache Software Foundation
- Commercial Vendor Developing/Supporting

®© 2014 MapR Technologies 12
Spark: Easy and Fast Big Data
• Easy to Develop – Rich APIs in
Java, Scala, Python
– Interactive shell
• Fast to Run – General execution
graphs – In-memory storage
2-5× less code

®© 2014 MapR Technologies 13
Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant read only collection of elements
that can be operated on in parallel • Cached in memory or on disk http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

®© 2014 MapR Technologies 14
RDD Operations - Expressive • Transformations
– Creation of a new RDD dataset from an existing • map, filter, distinct, union, sample, groupByKey, join,
reduce, etc…
• Actions – Return a value after running a computation
• collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations

®© 2014 MapR Technologies 15
Easy: Clean API
• Resilient Distributed Datasets
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure
• Operations • Transformations
(e.g. map, filter, groupBy)
• Actions (e.g. count, collect, save)
Write programs in terms of transformations on distributed datasets

®© 2014 MapR Technologies 16
Easy: Expressive API
• map • reduce

®© 2014 MapR Technologies 17
Easy: Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...

®© 2014 MapR Technologies 18
Easy: Example – Word Count
• Spark • Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

®© 2014 MapR Technologies 19
Easy: Example – Word Count
• Spark • Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

®© 2014 MapR Technologies 20
Easy: Works Well With Hadoop • Data Compatibility • Access your existing
Hadoop Data • Use the same data
formats • Adheres to data
locality for efficient processing
• Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing Hadoop cluster or side-by-side

®© 2014 MapR Technologies 21
Easy: User-Driven Roadmap • Language support
– Improved Python support
– SparkR – Java 8 – Integrated Schema
and SQL support in Spark’s APIs
• Better ML – Sparse Data
Support – Model Evaluation
Framework – Performance Testing

®© 2014 MapR Technologies 22
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x) .reduce(lambda x, y: x + y)
w -= gradient print “Final w: %s” % w

®© 2014 MapR Technologies 23
Fast: Logistic Regression Performance
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Run
ning
Tim
e (s
)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s further iterations 1 s

®© 2014 MapR Technologies 24
Easy: Multi-language Support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()
Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()
Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

®© 2014 MapR Technologies 25
Easy: Interactive Shell Scala based shell % /opt/mapr/spark/spark-0.9.1/bin/spark-shell scala> val logs = sc.textFile("hdfs:///user/keys/logdata”)"scala> logs.count()"…"res0: Long = 232681
scala> logs.filter(l => l.contains("ERROR")).count()"…."res1: Long = 205
Python based shell as well - pyspark

®© 2014 MapR Technologies 26 © 2014 MapR Technologies ®
Fault Tolerance and Performance

®© 2014 MapR Technologies 27
Fast: Using RAM, Operator Graphs
• In-memory Caching • Data Partitions read
from RAM instead of disk
• Operator Graphs • Scheduling
Optimizations • Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map

®© 2014 MapR Technologies 28
Directed Acylic Graph (DAG) • Directed
– Only in a single direction
• Acyclic – No looping
• This supports fault-tolerance

®© 2014 MapR Technologies 29
Easy: Fault Recovery
RDDs track lineage information that can be used to efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])
HDFS File Filtered RDD Mapped RDD filter
(func = startsWith(…)) map
(func = split(...))

®© 2014 MapR Technologies 30
RDD Persistence / Caching • Variety of storage levels
– memory_only (default), memory_and_disk, etc… • API Calls
– persist(StorageLevel) – cache() – shorthand for
persist(StorageLevel.MEMORY_ONLY) • Considerations
– Read from disk vs. recompute (memory_and_disk) – Total memory storage size (memory_only_ser) – Replicate to second node for faster fault recovery
(memory_only_2) • Think about this option if supporting a time sensitive client
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

®© 2014 MapR Technologies 31
PageRank Performance
171
80
23
14 0
50
100
150
200
30 60
Itera
tion
time
(s)
Number of machines
Hadoop
Spark

®© 2014 MapR Technologies 32
Other Iterative Algorithms
0.96 110
0 25 50 75 100 125
Logistic Regression
4.1 155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop
Spark
Time per Iteration (s)

®© 2014 MapR Technologies 33
Fast: Scaling Down
69
58
41
30
12
0
20
40
60
80
100
Cache disabled
25% 50% 75% Fully cached
Exec
ution time (s)
% of working set in cache

®© 2014 MapR Technologies 34
Comparison to Storm • Higher throughput than Storm
– Spark Streaming: 670k records/sec/node – Storm: 115k records/sec/node – Commercial systems: 100-500k records/sec/node
0
10
20
30
100 1000
Throughp
ut per nod
e (M
B/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughp
ut per nod
e (M
B/s)
Record Size (bytes)
Grep
Spark
Storm

®© 2014 MapR Technologies 35 © 2014 MapR Technologies ®
How Spark Works

®© 2014 MapR Technologies 36
Working With RDDs

®© 2014 MapR Technologies 37
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”) !

®© 2014 MapR Technologies 38
Working With RDDs
RDD RDD RDD RDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line) !
textFile = sc.textFile(”SomeFile.txt”) !

®© 2014 MapR Technologies 39
Working With RDDs
RDD RDD RDD RDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line) !
linesWithSpark.count()!74!!linesWithSpark.first()!# Apache Spark!
textFile = sc.textFile(”SomeFile.txt”) !

®© 2014 MapR Technologies 40 © 2014 MapR Technologies ®
Example: Log Mining

®© 2014 MapR Technologies 41
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns

®© 2014 MapR Technologies 42
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver

®© 2014 MapR Technologies 43
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)

®© 2014 MapR Technologies 44
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Base RDD

®© 2014 MapR Technologies 45
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver

®© 2014 MapR Technologies 46
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Transformed RDD

®© 2014 MapR Technologies 47
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()

®© 2014 MapR Technologies 48
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count() Action

®© 2014 MapR Technologies 49
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3

®© 2014 MapR Technologies 50
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver tasks
tasks
tasks

®© 2014 MapR Technologies 51
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Read HDFS Block
Read HDFS Block
Read HDFS Block

®© 2014 MapR Technologies 52
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process & Cache Data
Process & Cache Data
Process & Cache Data

®© 2014 MapR Technologies 53
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results

®© 2014 MapR Technologies 54
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()

®© 2014 MapR Technologies 55
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver

®© 2014 MapR Technologies 56
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Process from Cache
Process from Cache
Process from Cache

®© 2014 MapR Technologies 57
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver results
results
results

®© 2014 MapR Technologies 58
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data è Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk

®© 2014 MapR Technologies 59 © 2014 MapR Technologies ®
Example: Page Rank

®© 2014 MapR Technologies 60
Example: PageRank • Good example of a more complex algorithm
– Multiple stages of map & reduce
• Benefits from Spark’s in-memory caching – Multiple iterations over the same data

®© 2014 MapR Technologies 61
Basic Idea Give pages ranks (scores) based on links to them • Links from many
pages è high rank • Link from a high-rank
page è high rank
Image: en.wikipedia.org/wiki/File:PageRank-‐hi-‐res-‐2.png

®© 2014 MapR Technologies 62
Algorithm
1. Start each page at a rank of 1 2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0 1.0
1.0
1.0

®© 2014 MapR Technologies 63
Algorithm
1. Start each page at a rank of 1 2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0 1.0
1.0
1.0
1
0.5
0.5
0.5
1
0.5

®© 2014 MapR Technologies 64
Algorithm
1. Start each page at a rank of 1 2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58 1.0
1.85
0.58

®© 2014 MapR Technologies 65
Algorithm
1. Start each page at a rank of 1 2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58
0.29
0.29
0.5
1.85 0.58 1.0
1.85
0.58
0.5

®© 2014 MapR Technologies 66
Algorithm
1. Start each page at a rank of 1 2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
0.39 1.72
1.31
0.58
. . .

®© 2014 MapR Technologies 67
Algorithm
1. Start each page at a rank of 1 2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
0.46 1.37
1.44
0.73
Final state:

®© 2014 MapR Technologies 68
Scala Implementation
val links = // load RDD of (url, neighbors) pairs var ranks = // give each url rank of 1.0 for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).values.flatMap { case (urls, rank)) => urls.map(dest => (dest, rank/urls.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...) https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala

®© 2014 MapR Technologies 69 © 2014 MapR Technologies ®
Spark and More

®© 2014 MapR Technologies 70
Easy: Unified Platform
Spark SQL (SQL)
Spark Streaming (Streaming)
MLLib (Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Continued innovation bringing new functionality, e.g.,: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)

®© 2014 MapR Technologies 71
Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in
partnership with Databricks – mapr-spark package with Spark, Shark, Spark
Streaming today – Spark-python, GraphX and MLLib soon
• YARN integration – Spark can then allocate resources from cluster
when needed

®© 2014 MapR Technologies 72
References • Based on slides from Pat McDonough at
• Spark web site: http://spark.apache.org/ • Spark on MapR:
– http://www.mapr.com/products/apache-spark – http://doc.mapr.com/display/MapR/Installing+Spark
+and+Shark

®© 2014 MapR Technologies 73
Q & A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies