In-memory data processing Apache Spark
Transcript of In-memory data processing Apache Spark
Parallel Computing – Lecture 13
Distributed data structures - Apache Spark
Pelle Jakovits
November 2020, Tartu
Outline
• Issues with MapReduce
• Disk based vs In-Memory data processing
• Apache Spark Framework
• Resilient Distributed Datasets (RDD)– RDD Actions
– Parallel RDD Transformations
– Persistance models
– Fault Recovery
Pelle Jakovits 2/48
MapReduce model
Pelle Jakovits 3/48
Hadoop Distributed File System
Pelle Jakovits
• HDFS is used for data storage, distribution and replication.
4/48
Advantages of Hadoop MapReduce
• MapReduce = Map, GroupBy, Sort, Reduce
• Designed for Big Data processing
• Provides– Distributed file system
– High scalability
– Automatic parallelization
– Automatic fault recovery• Data is replicated
• Failed tasks are re-executed on other nodes
Pelle Jakovits 5/34
Is MapReduce sufficient?
• One of the most used frameworks for large scale data processing
• However, MapReduce is often not used directly, because:
– It is not suitable for prototyping
– A lot of custom code required even for the simplest tasks
– Need a lot of expertise to optimize MapReduceapplications
– Difficult to manage more complex MapReduce chains
Pelle Jakovits 6/34
Memory vs Disk based processing
• In Hadoop MapReduce all input, intermediate and output data must be written to disk
• Even if data significantly reduced, it can not be kept in memory between Map and Reduce tasks
Pelle Jakovits 7/48
In-Memory data processing
• Hadoop MapReduce is not suitable for all types of algorithms
– Iterative algorithms, graph processing, machine learning
• Computationally complex applications benefit from keeping intermediatedata in memory
• Keep data in memory between data processing operations
• Input & Output can be disk based file storage systems like HDFS
Pelle Jakovits 8/48
In-Memory data processing
• Data must fit into the collective memory of the cluster
• Should still support keeping data in disk
– when it would not fit into memory
– for fault tolerance
• Fault tolerance is more complicated
– The whole application is affected when data is only kept in memory
– In Hadoop, input data is replicated in HDFS and readily available. only the last Map or Reduce task is affected
Pelle Jakovits 9/48
Apache Spark
• MapReduce-like in-memory data processing framework
• From Map & Reduce -> Map, Join, Co-group, Filter, Distinct, Union, Sample, ReduceByKey, etc
• Directed acyclic graph (DAG) task execution engine– Users have more control over the data processing execution flow
• Uses Resilient Distributed Datasets (RDD) abstraction– Input data is load into RDD
– RDD transformations and user defined functions are applied to define data processing applications
Pelle Jakovits 10/48
Apache Spark
• More than just a replacement for MapReduce
– Spark works with Scala, Java, Python and R
– Extended with built-in tools for SQL queries, stream processing, ML and graph processing
• Integrated with Hadoop Yarn and HDFS
• Included in many public cloud platforms alongside with Hadoop MapReduce
– IBM cloud, Amazon AWS, Google Cloud, Microsoft Azure
Pelle Jakovits 11/48
MapReduce model
Pelle Jakovits 12/48
Spark DAG execution flow
Pelle Jakovits
http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
13/48
Resilient Distributed Datasets
• Collections of data objects
• Distributed across cluster
• Stored in RAM or Disk
• Immutable/Read-only
• Built through parallel transformations
• Automatically rebuilt on failures
Pelle Jakovits
Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html
14/48
Structure of RDDs
• Contains a number of rows
• Rows are divided into partitions
• Partitions are distributedbetween nodes in the cluster
• Row is a tuple of recordssimilarily to Apache Pig
• Can contain nested datastructures
Pelle Jakovits
Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html
15/48
Spark in Java
• A lot of additional boilerplate code related to data and function types
• There are different classes for each Tuple length (Tuple2, … , Tuple9):
Tuple2 pair = new Tuple2(a, b);
pair._1 // => a
pair._2 // => b
• In Java 8 you can use lambda functions:
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
• But In older Java you must use predefined function interfaces:– Function, Function2, Function 3
– FlatMapFunction
– PairFunction
Pelle Jakovits 16/48
Java 7 Example - WordCountJavaRDD<String> lines = ctx.textFile(input_folder);
JavaRDD<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
return Arrays.asList(line.split(" "));
}});
JavaPairRDD<String, Integer> ones = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String word) {
return new Tuple2<String, Integer>(word, 1);
}});
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>(){
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}}); Pelle Jakovits 17/48
Java 8 Example - WordCount
JavaRDD<String> lines = ctx.textFile(input_folder);
JavaRDD<String> words = lines.flatMap(
line -> Arrays.asList(line.split(" ")).iterator()
);
JavaPairRDD<String, Integer> pairs = words.mapToPair(
word -> new Tuple2<String, Integer>(word, 1)
);
JavaPairRDD<String, Integer> wordCounts = pairs.reduceByKey(
(x, y) -> x + y
);
Pelle Jakovits 18/48
Python example - WordCount
• Word count in Spark's Python API
lines = spark.textFile(input_folder)
words = lines.flatMap(lambda line: line.split() )
pairs = words.map(lambda word: (word, 1) )
wordCounts = pairs.reduceByKey(lambda a, b: a + b )
Pelle Jakovits 19/48
RDD operations
• Actions
– Creating RDD’s
– Storing RDD’s
– Extracting data from RDD
• Transformations
– Restructure or transform RDDs into new RDDs
– Apply user defined functions
Pelle Jakovits 20/48
RDD Actions
Pelle Jakovits 21
Loading Data
• Local data directly from memorydataset = [1, 2, 3, 4, 5];
slices = 5 # Number of partitions
distData = sc.parallelize(dataset, slices);
• External data from HDFS or local file systeminput = sc.textFile(“file.txt”)
input = sc.textFile(“directory/*.txt”)
input = sc.textFile(“hdfs://xxx:9000/path/file”)
Pelle Jakovits 22/48
Storing data
counts.saveAsTextFile("hdfs://...");
counts.saveAsObjectFile("hdfs://...");
counts.saveAsHadoopFile(
"testfile.seq",
Text.class,
LongWritable.class,
SequenceFileOutputFormat.class
);
Pelle Jakovits 23/48
Extracting data from RDD
• Extract data out of distributed RDD object into driver program memory:
– Collect() – Retrieve the whole RDD content as a list
– First() – Take first element from RDD
– Take(n) - Take n first elements from RDD as a list
Pelle Jakovits 24/48
Broadcast
• Share data b etween to every node in the Spark cluster, which can then be accessed inside Spark functions:
broadcastVar = sc.broadcast([1992, „gray“, „bear“])
result = input.map(lambda line: weight_first_bc(line, broadcastVar))
• Don’t have to use broadcast if data is very small. This would also work:
globalVar = [1992, „gray“, „bear“]
result = input.map(lambda line: weight_first_bc(line, globalVar))
• However, it is inefficient when passed along data is larger (> 1MB)
• Spark uses Torrent protocol to optimize broadcast data distribution
Pelle Jakovits 25/48
Other actions
• reduce(func) – Apply an aggregation function to all tuples in RDD
• count() – count number of elements in RDD
• countByKey() – count values for each unique key
Pelle Jakovits 26/48
RDD Transformations
Pelle Jakovits 27
Map
• Applies a user defined function to every tuple in RDD.
• From the WordCount example, using a lambda function:
pairs = words.map(lambda word: (word, 1))
• Using a separately defined function:
def toPair(word):
pair = (word, 1)
return pair
pairs = words.map(toPair)
Pelle Jakovits 28/48
Map transformation
• pairs = words.map(lambda word: (word, 1))
Pelle Jakovits 29/48
FlatMap
• Similar to Map - applied to each tuple in RDD
• But can result in multiple output tuples
• From the Python WordCount example:
words = file.flatMap(lambda line: line.split())
• User defined function has to return a list
• Each element in the output list results in a new tuple inside the resulting RDD
Pelle Jakovits 30/48
FlatMap transformation
• words = lines.flatMap( lambda line: line.split() )
Pelle Jakovits 31/48
Other Map-Like transformations
• sample(withReplacement, fraction, seed)
• distinct([numTasks]))
• union(otherDataset)
• filter(func)
Pelle Jakovits 32/48
GroupBy & GroupByKey
• Restructure the RDD by grouping all the values inside the RDD
• Such restructuring is inefficient and should be avoided if possible– It is better to use reduceByKey or aggregateByKey, which automatically
applies an aggregation function on the grouped data.
• GroupByKey operation uses first value inside the RDD tuples as the grouping Key
Pelle Jakovits 33/48
GroupByKey transformation
• Groups RDD by key, and results in a nested RDD
wordCounts = pairs.groupByKey()
Pelle Jakovits 34/48
ReduceByKey
• Groups all tuples in RDD by the first field in the tuple
• Applies a user defined aggregation function to all tuples inside a group
• Outputs a single tuple for each group
• From the Python WordCount example:
pairs = words.map(lambda word: (word, 1) )
wordCounts = pairs.reduceByKey(lambda a, b: a + b )
Pelle Jakovits 35/48
ReduceByKey
Pelle Jakovits
• ReduceByKey() applies GroupByKey() together with a nested Reduce(UDF)
wordCounts = pairs.reduceByKey(lambda a, b: a + b)
36/48
Working with Keys
• When using OperationByKey transformations, Spark expects the RDD to contain (Key, Value) tuples
• If input RDD contains longer tuples, then we need to restructure the RDD using a map() operation.
data = sc.parallelize([("hi", 1, "file1"), ("bye", 3, "file2")])
pairs = data.map(lambda (a, b, c) : (a, (b, c)) )
sums = pairs.reduceByKey(lambda (b1, c2), (b2, c2) : b1 + b2)
output = sums.collect()
for (key, value) in output:
print(key, ", " , value)
Pelle Jakovits 37/48
Other transformations
• join(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
• cogroup(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples.
Pelle Jakovits 38/48
Persisting/Caching data
• Spark uses Lazy evaluation
• Intermediate RDD's may be discarded to optimize memory consumption
• To force spark to keep any intermediate data in memory, we can use:
– lineLengths.persist(StorageLevel.MEMORY_ONLY);
– To forces RDD to be cached in memory after irst time it is computed
• NB! Caching should be used when an RDD is accessed multiple times!
Pelle Jakovits 39/48
Persistance level
• DISK_ONLY
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
– More efficient
– Use more CPU
• MEMORY_ONLY_2
– Replicate data on 2 executors
Pelle Jakovits 40/48
Fault tolerance
• Faults are inevitable when running distributed applications in large clusters and repeating long-running tasks can be costly
• Fault recovery is more complicated for In-memory frameworks
– In Spark only the initial input data is replicated on HDFS
– Hadoop MR data is replicated in HDFS, can easily repeat failed tasks
• Spark uses to approaches: Checkpointing and Lineage
• Checkpointing is typically used for long running in-memory distributed applications
– Processes periodically store their memory into disk storage
– Can affect the efficiency of the application
Pelle Jakovits 41/48
Spark Lineage
• Lineage is the history of RDDs
• Spark keeps track of each RDD partition's lineage– What functions were applied to produce it
– Which input data partition were involved
• Rebuild lost RDD partitions according to lineage, using the latest still available partitions
• No performance cost if nothing fails – Checkpointing requires consistent snapshoting, which affects
peformance
• Best result is achieved together with checkpointing– Onnly must recompute blocks since the last checkpoint
Pelle Jakovits 42/48
Lineage
Source: Glenn K. Lockwood, Advanced Technologies at NERSC/LBNL
Pelle Jakovits 43/48
Apache Spark built-in extensions
• Spark SQL - Seamlessly mix SQL queries with Spark programs– Similar to Pig and Hive
• Spark DataFrames – Abstraction for parallel DataFrames
• Spark Streaming – Apply Spark on Streaming data– Structured Streaming – Higher level abstraction for DataFrame or SQL
based streaming applications
• GraphX - API for parallel graph computations– GraphFrames – DataFrame based Graph computing API
• MLlib - Machine learning library
• SparkR – Utilize Spark in R scripts
Pelle Jakovits 44/48
Advantages of Spark
• Much faster than Hadoop IF data fits into the memory– Affects all higher-level Spark or Hadoop MapReduce frameworks
• Support for more programming languages– Scala, Java, Python, R
• Has a lot of built-in extensions– DataFrames, SQL, R, ML, Streaming, Graph processing
• It is constantly being updated
• Well suitable for computationally complex algorithms processing medium-to-large scale data.
Pelle Jakovits 45/48
Disadvantages of Spark
• What if data does not fit into the memory?
• Hard to keep in track of how (well) the data is distributed
• Working in Java still requires a lot of boiler plate code
• Saving as text files can be very slow
Pelle Jakovits 46/48
Conclusions
• RDDs offer a reasonably simple and efficient programming model for a broad range of applications
• Spark provides more data manipulation operations than just Map and Reduce.
• Spark achieves fault tolerance by providing coarse-grained operations and tracking lineage
• Provides definite speedup when data fits into the collective memory
• Very large development community which has resulted in creation of many integrated tools for different types of applications
• Spark is constantly evolving and there are several ways to achieve the same result
Pelle Jakovits 47/48
Conclusions
• Use MapReduce when dealing with large data that does not fit into thecollective memory of the cluster– Can also use Pig or Hive to simplify creating MapReduce applications
• Otherwise it is best to use Spark– It is much faster in general– Prototyping is convenient– Many UDF's which can be included on the fly
• Loading data and transforming it to the required format can be difficult
Whether you choose MapReduce, Pig, Hive or Spark -Everything is automatically parellelized in the background
and can be executed in computer clusters.
Pelle Jakovits 48/48