Lecture 07 - CS-5040 - modern database systems
-
Upload
michael-mathioudakis -
Category
Education
-
view
391 -
download
1
Transcript of Lecture 07 - CS-5040 - modern database systems
Modern Database Systems Lecture 7
Aris6des Gionis Michael Mathioudakis
Spring 2016
logis6cs
assignment 1 currently marking
might upload best student solu6ons? will provide annotated pdfs with marks
virtualbox has same path & filename as previous one
Michael Mathioudakis 2
con6nuing from last lecture original paper on spark
Michael Mathioudakis 3
previously... on modern database systems
generaliza6on of mapreduce more suitable for itera6ve workflows itera6ve algorithms and repe66ve querying
rdd: resilient distributed dataset
read-‐only, lazily evaluated, easily re-‐created ephemeral, unless we need to keep them in memory
Michael Mathioudakis 4
example: text search
suppose that a web service is experiencing errors you want to search over terabytes of logs to find the cause
the logs are stored in Hadoop Filesystem (HDFS) errors are wriUen in the logs as lines that
start with the keyword “ERROR”
Michael Mathioudakis 5
example: text search
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
in Scala...
rdd
rdd
from a file
transforma6on
hint: keep in memory!
no work on the cluster so far
ac6on! lines is not loaded to ram!
Michael Mathioudakis 6
example -‐ text search ctd.
let us find errors related to “MySQL”
Michael Mathioudakis 7
example -‐ text search ctd.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
transforma6on ac6on
Michael Mathioudakis 8
example -‐ text search ctd. again
let us find errors related to “HDFS” and extract their 6me field
assuming 6me is field no. 3 in tab-‐separated format
Michael Mathioudakis 9
example -‐ text search ctd. again
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
transforma6ons
ac6on
Michael Mathioudakis 10
example: text search lineage of 6me fields
lines
errors
filter(_.startsWith(“ERROR”))
HDFS errors
time fields
filter(_.contains(“HDFS”)))
map(_.split(‘\t’)(3))
Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.
lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()
Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.
At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:
errors.count()
The user can also perform further transformations onthe RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))
.map(_.split(’\t’)(3))
.collect()
After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).
Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.
Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-
overhead using lineage Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
Table 1: Comparison of RDDs with distributed shared memory.
2.3 Advantages of the RDD Model
To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.
The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.
Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-
3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.
4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.
cached
pipelined transforma6ons if a par66on of errors is lost,
filter is applied only the corresponding par66on of lines
Michael Mathioudakis 11
represen6ng rdds
internal informa6on about rdds
par66ons & par66oning scheme dependencies on parent RDDs
func6on to compute it from parents
Michael Mathioudakis 12
rdd dependencies
narrow dependencies each par66on of the parent rdd is used by
at most one par66on of the child rdd
otherwise, wide dependencies
Michael Mathioudakis 13
rdd dependencies
union
groupByKey
join with inputs not co-partitioned
join with inputs co-partitioned
map, filter
Narrow Dependencies: Wide Dependencies:
Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.
map to the parent’s records in its iterator method.
union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7
sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.
join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).
5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.
Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.
We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).
5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.
Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-
7Note that our union operation does not drop duplicate values.
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.
sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.
Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.
For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.
If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.
Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.
Michael Mathioudakis 14
scheduling
when an ac6on is performed... (e.g., count() or save())
... the scheduler examines the lineage graph builds a DAG of stages to execute
each stage is a maximal pipeline of
transforma6ons over narrow dependencies
Michael Mathioudakis 15
scheduling
union
groupByKey
join with inputs not co-partitioned
join with inputs co-partitioned
map, filter
Narrow Dependencies: Wide Dependencies:
Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.
map to the parent’s records in its iterator method.
union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7
sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.
join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).
5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.
Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.
We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).
5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.
Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-
7Note that our union operation does not drop duplicate values.
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.
sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.
Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.
For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.
If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.
Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.
rdd
par66on
already in ram
Michael Mathioudakis 16
memory management
when not enough memory apply LRU evic6on policy at rdd level
evict par66on from least recently used rdd
Michael Mathioudakis 17
performance
logis6c regression and k-‐means amazon EC2
10 itera6ons on 100GB datasets 100 node-‐clusters
Michael Mathioudakis 18
performance
them simpler to checkpoint than general shared mem-ory. Because consistency is not a concern, RDDs can bewritten out in the background without requiring programpauses or distributed snapshot schemes.
6 EvaluationWe evaluated Spark and RDDs through a series of exper-iments on Amazon EC2, as well as benchmarks of userapplications. Overall, our results show the following:• Spark outperforms Hadoop by up to 20⇥ in itera-
tive machine learning and graph applications. Thespeedup comes from avoiding I/O and deserializationcosts by storing data in memory as Java objects.
• Applications written by our users perform and scalewell. In particular, we used Spark to speed up an an-alytics report that was running on Hadoop by 40⇥.
• When nodes fail, Spark can recover quickly by re-building only the lost RDD partitions.
• Spark can be used to query a 1 TB dataset interac-tively with latencies of 5–7 seconds.
We start by presenting benchmarks for iterative ma-chine learning applications (§6.1) and PageRank (§6.2)against Hadoop. We then evaluate fault recovery in Spark(§6.3) and behavior when a dataset does not fit in mem-ory (§6.4). Finally, we discuss results for user applica-tions (§6.5) and interactive data mining (§6.6).
Unless otherwise noted, our tests used m1.xlarge EC2nodes with 4 cores and 15 GB of RAM. We used HDFSfor storage, with 256 MB blocks. Before each test, wecleared OS buffer caches to measure IO costs accurately.
6.1 Iterative Machine Learning ApplicationsWe implemented two iterative machine learning appli-cations, logistic regression and k-means, to compare theperformance of the following systems:• Hadoop: The Hadoop 0.20.2 stable release.
• HadoopBinMem: A Hadoop deployment that con-verts the input data into a low-overhead binary formatin the first iteration to eliminate text parsing in laterones, and stores it in an in-memory HDFS instance.
• Spark: Our implementation of RDDs.We ran both algorithms for 10 iterations on 100 GB
datasets using 25–100 machines. The key difference be-tween the two applications is the amount of computationthey perform per byte of data. The iteration time of k-means is dominated by computation, while logistic re-gression is less compute-intensive and thus more sensi-tive to time spent in deserialization and I/O.
Since typical learning algorithms need tens of itera-tions to converge, we report times for the first iterationand subsequent iterations separately. We find that shar-ing data via RDDs greatly speeds up future iterations.
80!
139!
46!
115!
182!
82!
76!
62!
3!
106!
87!
33!
0!40!80!
120!160!200!240!
Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!
Logistic Regression! K-Means!
Itera
tion
time
(s)!
First Iteration!Later Iterations!
Figure 7: Duration of the first and later iterations in Hadoop,HadoopBinMem and Spark for logistic regression and k-meansusing 100 GB of data on a 100-node cluster.
184!
111!
76!11
6!
80!
62!
15!
6! 3!
0!50!100!150!200!250!300!
25! 50! 100!
Itera
tion
time
(s)!
Number of machines!
Hadoop!HadoopBinMem!Spark!
(a) Logistic Regression
274!
157!
106!
197!
121!
87!
143!
61!
33!
0!
50!
100!
150!
200!
250!
300!
25! 50! 100!
Itera
tion
time
(s)!
Number of machines!
Hadoop !HadoopBinMem!Spark!
(b) K-Means
Figure 8: Running times for iterations after the first in Hadoop,HadoopBinMem, and Spark. The jobs all processed 100 GB.
First Iterations All three systems read text input fromHDFS in their first iterations. As shown in the light barsin Figure 7, Spark was moderately faster than Hadoopacross experiments. This difference was due to signal-ing overheads in Hadoop’s heartbeat protocol betweenits master and workers. HadoopBinMem was the slowestbecause it ran an extra MapReduce job to convert the datato binary, it and had to write this data across the networkto a replicated in-memory HDFS instance.
Subsequent Iterations Figure 7 also shows the aver-age running times for subsequent iterations, while Fig-ure 8 shows how these scaled with cluster size. For lo-gistic regression, Spark 25.3⇥ and 20.7⇥ faster thanHadoop and HadoopBinMem respectively on 100 ma-chines. For the more compute-intensive k-means appli-cation, Spark still achieved speedup of 1.9⇥ to 3.2⇥.
Understanding the Speedup We were surprised tofind that Spark outperformed even Hadoop with in-memory storage of binary data (HadoopBinMem) by a20⇥ margin. In HadoopBinMem, we had used Hadoop’sstandard binary format (SequenceFile) and a large blocksize of 256 MB, and we had forced HDFS’s data di-rectory to be on an in-memory file system. However,Hadoop still ran slower due to several factors:1. Minimum overhead of the Hadoop software stack,
2. Overhead of HDFS while serving data, and
Michael Mathioudakis 19
performance Example: Logistic Regression
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Runn
ing
Tim
e (s
)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s further iterations 1 s
logis6c regression 2015
Michael Mathioudakis 20
spark programming with python
Michael Mathioudakis 21
spark programming
crea6ng rdds
transforma6ons & ac6ons
lazy evalua6on & persistency
passing custom func6ons
working with key-‐value pairs
data par66oning
accumulators & broadcast variables
pyspark
Michael Mathioudakis 22
driver program
contains the main func6on of the applica6on defines rdds
applies opera6ons on them
e.g., the spark shell itself is a driver program
driver programs access spark through a SparkContext object
Michael Mathioudakis 23
example
import pyspark sc = pyspark.SparkContext(master = "local", appName = "tour") text = sc.textFile(“myfile.txt”) # load data text.count() # count lines
opera6on create rdd
this part is assumed from now on
if we are running on a cluster on machines, different machines might count different parts of the file
SparkContext automa6cally created in
spark shell
Michael Mathioudakis 24
example
text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines
opera6on with custom func6on
on a cluster, Spark ships the func6on to all workers
Michael Mathioudakis 25
lambda func6ons in python
f =( lambda line: 'Spark' in line ) f("we are learning Spark”)
def f(line): return 'Spark' in line f("we are learning Spark")
Michael Mathioudakis 26
stopping
text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines sc.stop()
Michael Mathioudakis 27
rdds
resilient distributed datasets
resilient easy to recover
distributed different par66ons materialize on different nodes
read-‐only (immutable)
but can be transformed to other rdds
Michael Mathioudakis 28
crea6ng rdds
loading an external dataset text = sc.textFile("myfile.txt")
distribu6ng a collec6on of objects data = sc.parallelize( [0,1,2,3,4,5,6,7,8,9] )
transforming other rdds text_spark = text.filter(lambda line: 'Spark' in line) data_length = data.map(lambda num: num ** 2)
Michael Mathioudakis 29
rdd opera6ons
transforma6ons return a new rdd
ac6ons extract informa6on from an rdd or save it to disk
errorsRDD = inputRDD.filter(lambda x: "error" in x) warningsRDD = inputRDD.filter(lambda x: "warning" in x) badlinesRDD = errorsRDD.union(warningsRDD)
print("Input had", badlinesRDD.count(), "concerning lines.") print("Here are some of them:") for line in badlinesRDD.take(10): print(line)
inputRDD = sc.textFile(”logfile.txt")
Michael Mathioudakis 30
rdd opera6ons
def is_prime(num): if num < 1 or num % 1 != 0: raise Excep6on("invalid argument") for d in range(2, int(np.sqrt(num) + 1)): if num % d == 0: return False return True
numbersRDD = sc.parallelize(list(range(1, 1000000))) primesRDD = numbersRDD.filter(is_prime) primes = primesRDD.collect() print(primes[:100])
create RDD
transforma6on
ac6on
opera6on in driver what if primes does not fit in memory?
transforma6ons return a new rdd
ac6ons extract informa6on from an rdd or save it to disk
Michael Mathioudakis 31
rdd opera6ons
def is_prime(num): if num < 1 or num % 1 != 0: raise Excep6on("invalid argument") for d in range(2, int(np.sqrt(num) + 1)): if num % d == 0: return False return True
numbersRDD = sc.parallelize(list(range(1, 1000000))) primesRDD = numbersRDD.filter(is_prime) primesRDD.saveAsTextFile("primes.txt")
transforma6ons return a new rdd
ac6ons extract informa6on from an rdd or save it to disk
Michael Mathioudakis 32
rdds
evaluated lazily ephemeral
can persist in memory (or disk) if we ask
Michael Mathioudakis 33
lazy evalua6on
numbersRDD = sc.parallelize(range(1, 1000000)) primesRDD = numbersRDD.filter(is_prime) primesRDD.saveAsTextFile("primes.txt")
no cluster ac6vity un6l here
numbersRDD = sc.parallelize(range(1, 1000000)) primesRDD = numbersRDD.filter(is_prime) primes = primesRDD.collect() print(primes[:100])
no cluster ac6vity un6l here
Michael Mathioudakis 34
persistence
RDDs can persist in memory, if we ask politely
numbersRDD = sc.parallelize(list(range(1, 1000000))) primesRDD = numbersRDD.filter(is_prime) primesRDD.persist() primesRDD.count() primesRDD.take(10)
RDD already in memory
causes RDD to materialize
Michael Mathioudakis 35
persistence
why?
screenshot from jupyter notebook
Michael Mathioudakis 36
persistence
we can ask Spark to maintain rdds on disk or even keep replicas on different nodes
data.persist(pyspark.StorageLevel(useDisk = True, useMemory = True, replication=2))
to cease persistence
data.unpersist()
removes rdd from memory and disk
Michael Mathioudakis 37
passing func6ons
lambda func6ons
func6on references
text = sc.textFile("myfile.txt") text_spark = text.filter(lambda line: 'Spark' in line)
def f(line): return 'Spark' in line text = sc.textFile("myfile.txt") text_spark = text.filter(f)
Michael Mathioudakis 38
passing func6ons
warning! if func6on is member of an object (self.method) or references fields of an object (e.g., self.field)... Spark serializes and sends the en6re object
to worker nodes this can be very inefficient
Michael Mathioudakis 39
passing func6ons
class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x)
where is the problem in the code below?
Michael Mathioudakis 40
passing func6ons
class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x)
where is the problem in the code below?
reference to object method
reference to object field
Michael Mathioudakis 41
passing func6ons
beUer implementa6on
class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd(self, rdd): query = self.query return rdd.filter(lambda x: query in x)
Michael Mathioudakis 42
common rdd opera6ons
element-‐wise transforma6ons map and filter
inputRDD {1,2,3,4}
mappedRDD {1,2,3,4}
filteredRDD {2,3,4}
.map(lambda x: x**2) .filter(lambda x: x!=1)
map’s return type can be different that its input’s Michael Mathioudakis 43
common rdd opera6ons
element-‐wise transforma6ons produce mul6ple elements per input element
flatMap
phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.count()
9
Michael Mathioudakis 44
common rdd opera6ons
phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.collect()
phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.map(lambda phrase: phrase.split(" ")) words.collect()
how is the result different?
[['hello', 'world'], ['how', 'are', 'you'], ['how', 'do', 'you', 'do']]
['hello', 'world', 'how', 'are', 'you', 'how', 'do', 'you', 'do'] Michael Mathioudakis 45
common rdd opera6ons
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()
(pseudo) set opera6ons
Michael Mathioudakis 46
common rdd opera6ons (pseudo) set opera6ons
union
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()
oneRDD.union(otherRDD).collect()
[1, 1, 1, 2, 3, 3, 4, 4, 1, 4, 4, 7]
Michael Mathioudakis 47
common rdd opera6ons
subtrac6on oneRDD.subtract(otherRDD).collect()
[2, 3, 3]
(pseudo) set opera6ons oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()
Michael Mathioudakis 48
common rdd opera6ons
duplicate removal oneRDD.distinct().collect()
[1, 2, 3, 4]
(pseudo) set opera6ons oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()
Michael Mathioudakis 49
common rdd opera6ons
intersec6on oneRDD.intersection(otherRDD).collect()
[1, 4]
(pseudo) set opera6ons oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()
removes duplicates!
Michael Mathioudakis 50
common rdd opera6ons
cartesian product oneRDD.cartesian(otherRDD).collect()[:5]
[(1, 1), (1, 4), (1, 4), (1, 7), (1, 1)]
(pseudo) set opera6ons oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()
Michael Mathioudakis 51
common rdd opera6ons (pseudo) set opera6ons
union subtrac6on duplicate removal
intersec6on cartesian product
big difference in implementa6on (and efficiency)
par66on shuffling
yes no
Michael Mathioudakis 52
sortBy
common rdd opera6ons
how is sortBy implemented? we’ll see later...
data = sc.parallelize(np.random.rand(10)) data.sortBy(lambda x: x)
Michael Mathioudakis 53
common rdd opera6ons ac6ons
reduce successively operates on two elements of rdd
returns new element of same type
data.reduce(lambda x, y: x + y)
181
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: x * y)
3188536
commuta6ve & associa6ve func6ons
Michael Mathioudakis 54
commuta6ve & associa6ve func6on f(x, y)
e.g., add(x, y) = x + y
commuta6ve f(x, y) = f(y, x)
e.g., add(x, y) = x + y = y + x = add(y, x)
associa6ve f(x, f(y, z)) = f((x, y), z)
e.g., add(x, add(y, z)) = x + (y + z) = (x + y) + z = add(add(x, y), z)
Michael Mathioudakis 55
common rdd opera6ons ac6ons
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: x**2 + y**2)
137823683725010149883130929
compute sum of squares of data is the following correct?
no -‐ why?
reduce successively operates on two elements of rdd
produces single aggregate
56
common rdd opera6ons ac6ons
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2
8927.0
yes -‐ why?
compute sum of squares of data is the following correct?
reduce successively operates on two elements of rdd
produces single aggregate
Michael Mathioudakis 57
common rdd opera6ons ac6ons
aggregate generalizes reduce
the user provides a zero value
the iden6ty element for aggrega6on a sequen6al opera6on (func6on)
to update aggrega6on for one more element in one par66on a combining opera6on (func6on)
to combine aggregates from different par66ons
Michael Mathioudakis 58
common rdd opera6ons ac6ons
data = sc.parallelize([1,43,62,23,52]) aggr = data.aggregate(zeroValue = (0,0), seqOp = (lambda x, y: (x[0] + y, x[1] + 1)), combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))) aggr[0] / aggr[1]
what does the following compute? zero value
sequen6al opera6on
combining opera6on
the average value of data
aggregate generalizes reduce
Michael Mathioudakis 59
common rdd opera6ons ac6ons
opera5on return
collect all elements
take(num) num elements
tries to minimize disk access (e.g., by accessing one par66on)
takeSample a random sample of elements
count number of elements
countByValue number of 6me each element appears first on each par66on, then combines par66on results
top(num) num maximum elements sorts par66ons and merges
Michael Mathioudakis 60
all opera6ons we have described so far apply to all rdds
that’s why the word “common” has been in the 6tle
common rdd opera6ons
Michael Mathioudakis 61
pair rdds
elements are key-‐value pairs
pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2)) pairRDD.collect()[:5]
[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16)]
they come from mapreduce model prac6cal in many cases
spark provides opera6ons tailored to pair rdds
Michael Mathioudakis 62
transforma6ons on pair rdds
pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2))
keys and values
pairRDD.keys().collect()[:5]
[0, 1, 2, 3, 4]
pairRDD.values().collect()[:5]
[0, 1, 4, 9, 16]
Michael Mathioudakis 63
transforma6ons on pair rdds reduceByKey
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
volumePerKey = pairRDD.reduceByKey(lambda x, y: x + y) volumePerKey.collect()
[('$APPL', 201.16), ('$AMZN', 1104.64), ('$GOOG', 706.2)]
reduceByKey is a transforma6on unlike reduce Michael Mathioudakis 64
transforma6ons on pair rdds combineByKey
generalizes reduceByKey
user provides createCombiner func6on
provides the zero value for each key mergeValue func6on
combines current aggregate in one par66on with new value mergeCombiner
to combine aggregates from par66ons
Michael Mathioudakis 65
combineByKey generalizes reduceByKey
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1), mergeValue = lambda x, y: (x[0] + y, x[1] + 1), mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1])) avgPerKey = aggr.map(lambda x: (x[0], x[1][0]/x[1][1])) avgPerKey.collect()
what does the following produce
transforma6ons on pair rdds
Michael Mathioudakis 66
transforma6ons on pair rdds
sortByKey
samples values from rdd to es6mate sorted par66on boundaries
shuffles data sorts by external sor6ng
used to implement common sortBy
idea: create a pair RDD with (sort-‐key, item) elements apply sortByKey on that
Michael Mathioudakis 67
transforma6ons on pair rdds
implemented with a variant of hash join spark also has func6ons for
le| outer join, right outer join, full outer join
(inner) join
Michael Mathioudakis 68
transforma6ons on pair rdds
(inner) join
course_a = sc.parallelize([ ("Antti", 8), ("Tuukka", 10), (“Leena", 9)]) course_b = sc.parallelize([ ("Leena", 10), ("Tuukka", 10)]) result = course_a.join(course_b) result.collect()
[('Tuukka', (10, 10)), ('Leena', (9, 10))]
Michael Mathioudakis 69
transforma6ons on pair rdds
other transforma6ons
groupByKey
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
pairRDD.groupByKey().collect() [('$APPL', < values >), ('$AMZN', < values >), ('$GOOG', < values >)]
for grouping together mul6ple rdds cogroup and groupWith
Michael Mathioudakis 70
ac6ons on pair rdds
lookup(key)
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
countByKey
{'$AMZN': 2, '$APPL': 2, '$GOOG': 1}
pairRDD.countByKey()
collectAsMap
pairRDD.collectAsMap() {'$AMZN': 552.32, '$APPL': 100.52, '$GOOG': 706.2}
pairRDD.lookup("$APPL") [100.64, 100.52] Michael Mathioudakis 71
shared variables
accumulators write-‐only for workers
broadcast variables read-‐only for workers
Michael Mathioudakis 72
accumulators
text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 return length llengthRDD = text.map(line_len) llengthRDD.count()
95
long_lines.value
45
lazy!
Michael Mathioudakis 73
accumulators
fault tolerance
spark executes updates in ac6ons only once e.g., foreach()
foreach: special ac6on
this is not guaranteed for transforma=ons in transforma6ons, use accumulators
only for debugging purposes! Michael Mathioudakis 74
accumulators + foreach
text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 text.foreach(line_len) long_lines.value
45
Michael Mathioudakis 75
broadcast variables
sent to workers only once
read-‐only even if you change its value on a worker,
the change does not propagate to other workers (actually broadcast object is wriUen to file, read
from there by each worker)
release with unpersist()
Michael Mathioudakis 76
broadcast variables def load_address_table(): return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2", "Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"} address_table = sc.broadcast(load_address_table()) def find_address(name): res = None if name in address_table.value: res = address_table.value[name] return res data = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"]) pairRDD = data.map(lambda name: (name, find_address(name))) pairRDD.collectAsMap()
Michael Mathioudakis 77
par66oning
certain opera6ons take advantage of par66oning
e.g., reduceByKey, join
anRDD.par66onBy(numPar66ons, par66onFunc)
users can set number of par66ons and par66oning func6on
Michael Mathioudakis 78
working on per-‐par66on basis
spark provides opera6ons that operate at par66on level e.g., mapPar66on
rdd = sc.parallelize(range(100), 4) def f(iterator): yield sum(iterator) rdd.mapPartitions(f).collect()
used in implementa6on of Spark
Michael Mathioudakis 79
see implementa6on of spark on hUps://github.com/apache/spark/
Michael Mathioudakis 80
references 1. Zaharia, Matei, et al. "Spark: Cluster Compu6ng with Working Sets."
HotCloud 10 (2010): 10-‐10. 2. Zaharia, Matei, et al. "Resilient distributed datasets: A fault-‐tolerant
abstrac6on for in-‐memory cluster compu6ng." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementa=on.
3. Learning Spark: Lightning-‐Fast Big Data Analysis, by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
4. Spark programming guide hUps://spark.apache.org/docs/latest/programming-‐guide.html
5. Spark implementa6on hUps://github.com/apache/spark/ 6. "Making Big Data Processing Simple with Spark," Matei Zaharia,
hUps://youtu.be/d9D-‐Z3-‐44F8
Michael Mathioudakis 81