In-memory data processing Apache Spark

Parallel Computing – Lecture 13

Distributed data structures - Apache Spark

Pelle Jakovits

November 2020, Tartu

Outline

• Issues with MapReduce

• Disk based vs In-Memory data processing

• Apache Spark Framework

• Resilient Distributed Datasets (RDD)– RDD Actions

– Parallel RDD Transformations

– Persistance models

– Fault Recovery

Pelle Jakovits 2/48

MapReduce model

Pelle Jakovits 3/48

Hadoop Distributed File System

Pelle Jakovits

• HDFS is used for data storage, distribution and replication.

4/48

Advantages of Hadoop MapReduce

• MapReduce = Map, GroupBy, Sort, Reduce

• Designed for Big Data processing

• Provides– Distributed file system

– High scalability

– Automatic parallelization

– Automatic fault recovery• Data is replicated

• Failed tasks are re-executed on other nodes

Pelle Jakovits 5/34

Is MapReduce sufficient?

• One of the most used frameworks for large scale data processing

• However, MapReduce is often not used directly, because:

– It is not suitable for prototyping

– A lot of custom code required even for the simplest tasks

– Need a lot of expertise to optimize MapReduceapplications

– Difficult to manage more complex MapReduce chains

Pelle Jakovits 6/34

Memory vs Disk based processing

• In Hadoop MapReduce all input, intermediate and output data must be written to disk

• Even if data significantly reduced, it can not be kept in memory between Map and Reduce tasks

Pelle Jakovits 7/48

In-Memory data processing

• Hadoop MapReduce is not suitable for all types of algorithms

– Iterative algorithms, graph processing, machine learning

• Computationally complex applications benefit from keeping intermediatedata in memory

• Keep data in memory between data processing operations

• Input & Output can be disk based file storage systems like HDFS

Pelle Jakovits 8/48

In-Memory data processing

• Data must fit into the collective memory of the cluster

• Should still support keeping data in disk

– when it would not fit into memory

– for fault tolerance

• Fault tolerance is more complicated

– The whole application is affected when data is only kept in memory

– In Hadoop, input data is replicated in HDFS and readily available. only the last Map or Reduce task is affected

Pelle Jakovits 9/48

Apache Spark

• MapReduce-like in-memory data processing framework

• From Map & Reduce -> Map, Join, Co-group, Filter, Distinct, Union, Sample, ReduceByKey, etc

• Directed acyclic graph (DAG) task execution engine– Users have more control over the data processing execution flow

• Uses Resilient Distributed Datasets (RDD) abstraction– Input data is load into RDD

– RDD transformations and user defined functions are applied to define data processing applications

Pelle Jakovits 10/48

Apache Spark

• More than just a replacement for MapReduce

– Spark works with Scala, Java, Python and R

– Extended with built-in tools for SQL queries, stream processing, ML and graph processing

• Integrated with Hadoop Yarn and HDFS

• Included in many public cloud platforms alongside with Hadoop MapReduce

– IBM cloud, Amazon AWS, Google Cloud, Microsoft Azure


MapReduce model


Spark DAG execution flow

Pelle Jakovits

http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/

13/48

Resilient Distributed Datasets

• Collections of data objects

• Distributed across cluster

• Stored in RAM or Disk

• Immutable/Read-only

• Built through parallel transformations

• Automatically rebuilt on failures

Pelle Jakovits

Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html

14/48

Structure of RDDs

• Contains a number of rows

• Rows are divided into partitions

• Partitions are distributedbetween nodes in the cluster

• Row is a tuple of recordssimilarily to Apache Pig

• Can contain nested datastructures

Pelle Jakovits

Source: http://horicky.blogspot.com.ee/2013/12/spark-low-latency-massively-parallel.html

15/48

Spark in Java

• A lot of additional boilerplate code related to data and function types

• There are different classes for each Tuple length (Tuple2, … , Tuple9):

Tuple2 pair = new Tuple2(a, b);

pair._1 // => a

pair._2 // => b

• In Java 8 you can use lambda functions:

JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);

• But In older Java you must use predefined function interfaces:– Function, Function2, Function 3

– FlatMapFunction

– PairFunction


Java 7 Example - WordCountJavaRDD<String> lines = ctx.textFile(input_folder);

JavaRDD<String> words = lines.flatMap(

new FlatMapFunction<String, String>() {

public Iterable<String> call(String line) {

return Arrays.asList(line.split(" "));

}});

JavaPairRDD<String, Integer> ones = words.mapToPair(

new PairFunction<String, String, Integer>() {

public Tuple2<String, Integer> call(String word) {

return new Tuple2<String, Integer>(word, 1);

}});

JavaPairRDD<String, Integer> counts = ones.reduceByKey(

new Function2<Integer, Integer, Integer>(){

public Integer call(Integer i1, Integer i2) {

return i1 + i2;

}}); Pelle Jakovits 17/48

Java 8 Example - WordCount

JavaRDD<String> lines = ctx.textFile(input_folder);

JavaRDD<String> words = lines.flatMap(

line -> Arrays.asList(line.split(" ")).iterator()

);

JavaPairRDD<String, Integer> pairs = words.mapToPair(

word -> new Tuple2<String, Integer>(word, 1)

);

JavaPairRDD<String, Integer> wordCounts = pairs.reduceByKey(

(x, y) -> x + y

);


Python example - WordCount

• Word count in Spark's Python API

lines = spark.textFile(input_folder)

words = lines.flatMap(lambda line: line.split() )

pairs = words.map(lambda word: (word, 1) )

wordCounts = pairs.reduceByKey(lambda a, b: a + b )


RDD operations

• Actions

– Creating RDD’s

– Storing RDD’s

– Extracting data from RDD

• Transformations

– Restructure or transform RDDs into new RDDs

– Apply user defined functions


RDD Actions

Pelle Jakovits 21

Loading Data

• Local data directly from memorydataset = [1, 2, 3, 4, 5];

slices = 5 # Number of partitions

distData = sc.parallelize(dataset, slices);

• External data from HDFS or local file systeminput = sc.textFile(“file.txt”)

input = sc.textFile(“directory/*.txt”)

input = sc.textFile(“hdfs://xxx:9000/path/file”)


Storing data

counts.saveAsTextFile("hdfs://...");

counts.saveAsObjectFile("hdfs://...");

counts.saveAsHadoopFile(

"testfile.seq",

Text.class,

LongWritable.class,

SequenceFileOutputFormat.class

);


Extracting data from RDD

• Extract data out of distributed RDD object into driver program memory:

– Collect() – Retrieve the whole RDD content as a list

– First() – Take first element from RDD

– Take(n) - Take n first elements from RDD as a list


Broadcast

• Share data b etween to every node in the Spark cluster, which can then be accessed inside Spark functions:

broadcastVar = sc.broadcast([1992, „gray“, „bear“])

result = input.map(lambda line: weight_first_bc(line, broadcastVar))

• Don’t have to use broadcast if data is very small. This would also work:

globalVar = [1992, „gray“, „bear“]

result = input.map(lambda line: weight_first_bc(line, globalVar))

• However, it is inefficient when passed along data is larger (> 1MB)

• Spark uses Torrent protocol to optimize broadcast data distribution


Other actions

• reduce(func) – Apply an aggregation function to all tuples in RDD

• count() – count number of elements in RDD

• countByKey() – count values for each unique key


RDD Transformations

Pelle Jakovits 27

Map

• Applies a user defined function to every tuple in RDD.

• From the WordCount example, using a lambda function:

pairs = words.map(lambda word: (word, 1))

• Using a separately defined function:

def toPair(word):

pair = (word, 1)

return pair

pairs = words.map(toPair)


Map transformation

• pairs = words.map(lambda word: (word, 1))


FlatMap

• Similar to Map - applied to each tuple in RDD

• But can result in multiple output tuples

• From the Python WordCount example:

words = file.flatMap(lambda line: line.split())

• User defined function has to return a list

• Each element in the output list results in a new tuple inside the resulting RDD


FlatMap transformation

• words = lines.flatMap( lambda line: line.split() )


Other Map-Like transformations

• sample(withReplacement, fraction, seed)

• distinct([numTasks]))

• union(otherDataset)

• filter(func)


GroupBy & GroupByKey

• Restructure the RDD by grouping all the values inside the RDD

• Such restructuring is inefficient and should be avoided if possible– It is better to use reduceByKey or aggregateByKey, which automatically

applies an aggregation function on the grouped data.

• GroupByKey operation uses first value inside the RDD tuples as the grouping Key


GroupByKey transformation

• Groups RDD by key, and results in a nested RDD

wordCounts = pairs.groupByKey()


ReduceByKey

• Groups all tuples in RDD by the first field in the tuple

• Applies a user defined aggregation function to all tuples inside a group

• Outputs a single tuple for each group

• From the Python WordCount example:

pairs = words.map(lambda word: (word, 1) )

wordCounts = pairs.reduceByKey(lambda a, b: a + b )


ReduceByKey

Pelle Jakovits

• ReduceByKey() applies GroupByKey() together with a nested Reduce(UDF)

wordCounts = pairs.reduceByKey(lambda a, b: a + b)

36/48

Working with Keys

• When using OperationByKey transformations, Spark expects the RDD to contain (Key, Value) tuples

• If input RDD contains longer tuples, then we need to restructure the RDD using a map() operation.

data = sc.parallelize([("hi", 1, "file1"), ("bye", 3, "file2")])

pairs = data.map(lambda (a, b, c) : (a, (b, c)) )

sums = pairs.reduceByKey(lambda (b1, c2), (b2, c2) : b1 + b2)

output = sums.collect()

for (key, value) in output:

print(key, ", " , value)


Other transformations

• join(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

• cogroup(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples.


Persisting/Caching data

• Spark uses Lazy evaluation

• Intermediate RDD's may be discarded to optimize memory consumption

• To force spark to keep any intermediate data in memory, we can use:

– lineLengths.persist(StorageLevel.MEMORY_ONLY);

– To forces RDD to be cached in memory after irst time it is computed

• NB! Caching should be used when an RDD is accessed multiple times!


Persistance level

• DISK_ONLY

• MEMORY_ONLY

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

– More efficient

– Use more CPU

• MEMORY_ONLY_2

– Replicate data on 2 executors


Fault tolerance

• Faults are inevitable when running distributed applications in large clusters and repeating long-running tasks can be costly

• Fault recovery is more complicated for In-memory frameworks

– In Spark only the initial input data is replicated on HDFS

– Hadoop MR data is replicated in HDFS, can easily repeat failed tasks

• Spark uses to approaches: Checkpointing and Lineage

• Checkpointing is typically used for long running in-memory distributed applications

– Processes periodically store their memory into disk storage

– Can affect the efficiency of the application


Spark Lineage

• Lineage is the history of RDDs

• Spark keeps track of each RDD partition's lineage– What functions were applied to produce it

– Which input data partition were involved

• Rebuild lost RDD partitions according to lineage, using the latest still available partitions

• No performance cost if nothing fails – Checkpointing requires consistent snapshoting, which affects

peformance

• Best result is achieved together with checkpointing– Onnly must recompute blocks since the last checkpoint


Lineage

Source: Glenn K. Lockwood, Advanced Technologies at NERSC/LBNL


Apache Spark built-in extensions

• Spark SQL - Seamlessly mix SQL queries with Spark programs– Similar to Pig and Hive

• Spark DataFrames – Abstraction for parallel DataFrames

• Spark Streaming – Apply Spark on Streaming data– Structured Streaming – Higher level abstraction for DataFrame or SQL

based streaming applications

• GraphX - API for parallel graph computations– GraphFrames – DataFrame based Graph computing API

• MLlib - Machine learning library

• SparkR – Utilize Spark in R scripts


Advantages of Spark

• Much faster than Hadoop IF data fits into the memory– Affects all higher-level Spark or Hadoop MapReduce frameworks

• Support for more programming languages– Scala, Java, Python, R

• Has a lot of built-in extensions– DataFrames, SQL, R, ML, Streaming, Graph processing

• It is constantly being updated

• Well suitable for computationally complex algorithms processing medium-to-large scale data.


Disadvantages of Spark

• What if data does not fit into the memory?

• Hard to keep in track of how (well) the data is distributed

• Working in Java still requires a lot of boiler plate code

• Saving as text files can be very slow


Conclusions

• RDDs offer a reasonably simple and efficient programming model for a broad range of applications

• Spark provides more data manipulation operations than just Map and Reduce.

• Spark achieves fault tolerance by providing coarse-grained operations and tracking lineage

• Provides definite speedup when data fits into the collective memory

• Very large development community which has resulted in creation of many integrated tools for different types of applications

• Spark is constantly evolving and there are several ways to achieve the same result


Conclusions

• Use MapReduce when dealing with large data that does not fit into thecollective memory of the cluster– Can also use Pig or Hive to simplify creating MapReduce applications

• Otherwise it is best to use Spark– It is much faster in general– Prototyping is convenient– Many UDF's which can be included on the fly

• Loading data and transforming it to the required format can be difficult

Whether you choose MapReduce, Pig, Hive or Spark -Everything is automatically parellelized in the background

and can be executed in computer clusters.


In-memory data processing Apache Spark

Documents

Transcript of In-memory data processing Apache Spark