Lecture 07 - CS-5040 - modern database systems

81
Modern Database Systems Lecture 7 Aris6des Gionis Michael Mathioudakis Spring 2016

Transcript of Lecture 07 - CS-5040 - modern database systems

Page 1: Lecture 07 -  CS-5040 - modern database systems

Modern  Database  Systems  Lecture  7  

Aris6des  Gionis  Michael  Mathioudakis  

 Spring  2016  

Page 2: Lecture 07 -  CS-5040 - modern database systems

logis6cs  

assignment  1    currently  marking  

might  upload  best  student  solu6ons?  will  provide  annotated  pdfs  with  marks  

 virtualbox  has  same  path  &  filename  as  previous  one  

Michael  Mathioudakis   2  

Page 3: Lecture 07 -  CS-5040 - modern database systems

con6nuing  from  last  lecture  original  paper  on  spark  

Michael  Mathioudakis   3  

Page 4: Lecture 07 -  CS-5040 - modern database systems

previously...    on  modern  database  systems  

generaliza6on  of  mapreduce  more  suitable  for  itera6ve  workflows  itera6ve  algorithms  and  repe66ve  querying  

 rdd:  resilient  distributed  dataset  

read-­‐only,  lazily  evaluated,  easily  re-­‐created  ephemeral,  unless  we  need  to  keep  them  in  memory  

Michael  Mathioudakis   4  

Page 5: Lecture 07 -  CS-5040 - modern database systems

example:  text  search  

suppose  that  a  web  service  is  experiencing  errors  you  want  to  search  over  terabytes  of  logs  to  find  the  cause  

the  logs  are  stored  in  Hadoop  Filesystem  (HDFS)  errors  are  wriUen  in  the  logs  as  lines  that    

start  with  the  keyword  “ERROR”  

Michael  Mathioudakis   5  

Page 6: Lecture 07 -  CS-5040 - modern database systems

example:  text  search  

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

in  Scala...  

rdd  

rdd  

from  a  file  

transforma6on  

hint:  keep  in  memory!  

no  work  on  the  cluster  so  far  

ac6on!  lines  is  not  loaded  to  ram!  

Michael  Mathioudakis   6  

Page 7: Lecture 07 -  CS-5040 - modern database systems

example  -­‐  text  search  ctd.  

let  us  find  errors  related  to  “MySQL”  

Michael  Mathioudakis   7  

Page 8: Lecture 07 -  CS-5040 - modern database systems

example  -­‐  text  search  ctd.  

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

transforma6on   ac6on  

Michael  Mathioudakis   8  

Page 9: Lecture 07 -  CS-5040 - modern database systems

example  -­‐  text  search  ctd.  again  

let  us  find  errors  related  to  “HDFS”  and  extract  their  6me  field  

assuming  6me  is  field  no.  3  in  tab-­‐separated  format  

Michael  Mathioudakis   9  

Page 10: Lecture 07 -  CS-5040 - modern database systems

example  -­‐  text  search  ctd.  again  

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

transforma6ons  

ac6on  

Michael  Mathioudakis   10  

Page 11: Lecture 07 -  CS-5040 - modern database systems

example:  text  search  lineage  of  6me  fields  

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

cached  

pipelined  transforma6ons  if  a  par66on  of  errors  is  lost,  

filter  is  applied  only  the  corresponding  par66on  of  lines  

Michael  Mathioudakis   11  

Page 12: Lecture 07 -  CS-5040 - modern database systems

represen6ng  rdds  

internal  informa6on  about  rdds    

par66ons  &  par66oning  scheme  dependencies  on  parent  RDDs  

func6on  to  compute  it  from  parents    

Michael  Mathioudakis   12  

Page 13: Lecture 07 -  CS-5040 - modern database systems

rdd  dependencies  

narrow  dependencies  each  par66on  of  the  parent  rdd  is  used  by  

at  most  one  par66on  of  the  child  rdd    

otherwise,  wide  dependencies  

Michael  Mathioudakis   13  

Page 14: Lecture 07 -  CS-5040 - modern database systems

rdd  dependencies  

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.

Michael  Mathioudakis   14  

Page 15: Lecture 07 -  CS-5040 - modern database systems

scheduling  

when  an  ac6on  is  performed...  (e.g.,  count()  or  save())  

...  the  scheduler  examines  the  lineage  graph  builds  a  DAG  of  stages  to  execute  

 each  stage  is  a  maximal  pipeline  of  

transforma6ons  over  narrow  dependencies  

Michael  Mathioudakis   15  

Page 16: Lecture 07 -  CS-5040 - modern database systems

scheduling  

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.

rdd  

par66on  

already  in  ram  

Michael  Mathioudakis   16  

Page 17: Lecture 07 -  CS-5040 - modern database systems

memory  management  

when  not  enough  memory  apply  LRU  evic6on  policy  at  rdd  level  

evict  par66on  from  least  recently  used  rdd  

Michael  Mathioudakis   17  

Page 18: Lecture 07 -  CS-5040 - modern database systems

performance  

logis6c  regression  and  k-­‐means  amazon  EC2  

10  itera6ons  on  100GB  datasets  100  node-­‐clusters  

Michael  Mathioudakis   18  

Page 19: Lecture 07 -  CS-5040 - modern database systems

performance  

them simpler to checkpoint than general shared mem-ory. Because consistency is not a concern, RDDs can bewritten out in the background without requiring programpauses or distributed snapshot schemes.

6 EvaluationWe evaluated Spark and RDDs through a series of exper-iments on Amazon EC2, as well as benchmarks of userapplications. Overall, our results show the following:• Spark outperforms Hadoop by up to 20⇥ in itera-

tive machine learning and graph applications. Thespeedup comes from avoiding I/O and deserializationcosts by storing data in memory as Java objects.

• Applications written by our users perform and scalewell. In particular, we used Spark to speed up an an-alytics report that was running on Hadoop by 40⇥.

• When nodes fail, Spark can recover quickly by re-building only the lost RDD partitions.

• Spark can be used to query a 1 TB dataset interac-tively with latencies of 5–7 seconds.

We start by presenting benchmarks for iterative ma-chine learning applications (§6.1) and PageRank (§6.2)against Hadoop. We then evaluate fault recovery in Spark(§6.3) and behavior when a dataset does not fit in mem-ory (§6.4). Finally, we discuss results for user applica-tions (§6.5) and interactive data mining (§6.6).

Unless otherwise noted, our tests used m1.xlarge EC2nodes with 4 cores and 15 GB of RAM. We used HDFSfor storage, with 256 MB blocks. Before each test, wecleared OS buffer caches to measure IO costs accurately.

6.1 Iterative Machine Learning ApplicationsWe implemented two iterative machine learning appli-cations, logistic regression and k-means, to compare theperformance of the following systems:• Hadoop: The Hadoop 0.20.2 stable release.

• HadoopBinMem: A Hadoop deployment that con-verts the input data into a low-overhead binary formatin the first iteration to eliminate text parsing in laterones, and stores it in an in-memory HDFS instance.

• Spark: Our implementation of RDDs.We ran both algorithms for 10 iterations on 100 GB

datasets using 25–100 machines. The key difference be-tween the two applications is the amount of computationthey perform per byte of data. The iteration time of k-means is dominated by computation, while logistic re-gression is less compute-intensive and thus more sensi-tive to time spent in deserialization and I/O.

Since typical learning algorithms need tens of itera-tions to converge, we report times for the first iterationand subsequent iterations separately. We find that shar-ing data via RDDs greatly speeds up future iterations.

80!

139!

46!

115!

182!

82!

76!

62!

3!

106!

87!

33!

0!40!80!

120!160!200!240!

Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!

Logistic Regression! K-Means!

Itera

tion

time

(s)!

First Iteration!Later Iterations!

Figure 7: Duration of the first and later iterations in Hadoop,HadoopBinMem and Spark for logistic regression and k-meansusing 100 GB of data on a 100-node cluster.

184!

111!

76!11

6!

80!

62!

15!

6! 3!

0!50!100!150!200!250!300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop!HadoopBinMem!Spark!

(a) Logistic Regression

274!

157!

106!

197!

121!

87!

143!

61!

33!

0!

50!

100!

150!

200!

250!

300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop !HadoopBinMem!Spark!

(b) K-Means

Figure 8: Running times for iterations after the first in Hadoop,HadoopBinMem, and Spark. The jobs all processed 100 GB.

First Iterations All three systems read text input fromHDFS in their first iterations. As shown in the light barsin Figure 7, Spark was moderately faster than Hadoopacross experiments. This difference was due to signal-ing overheads in Hadoop’s heartbeat protocol betweenits master and workers. HadoopBinMem was the slowestbecause it ran an extra MapReduce job to convert the datato binary, it and had to write this data across the networkto a replicated in-memory HDFS instance.

Subsequent Iterations Figure 7 also shows the aver-age running times for subsequent iterations, while Fig-ure 8 shows how these scaled with cluster size. For lo-gistic regression, Spark 25.3⇥ and 20.7⇥ faster thanHadoop and HadoopBinMem respectively on 100 ma-chines. For the more compute-intensive k-means appli-cation, Spark still achieved speedup of 1.9⇥ to 3.2⇥.

Understanding the Speedup We were surprised tofind that Spark outperformed even Hadoop with in-memory storage of binary data (HadoopBinMem) by a20⇥ margin. In HadoopBinMem, we had used Hadoop’sstandard binary format (SequenceFile) and a large blocksize of 256 MB, and we had forced HDFS’s data di-rectory to be on an in-memory file system. However,Hadoop still ran slower due to several factors:1. Minimum overhead of the Hadoop software stack,

2. Overhead of HDFS while serving data, and

Michael  Mathioudakis   19  

Page 20: Lecture 07 -  CS-5040 - modern database systems

performance  Example: Logistic Regression

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop

Spark

110 s / iteration

first iteration 80 s further iterations 1 s

logis6c  regression  2015  

Michael  Mathioudakis   20  

Page 21: Lecture 07 -  CS-5040 - modern database systems

spark  programming  with  python  

Michael  Mathioudakis   21  

Page 22: Lecture 07 -  CS-5040 - modern database systems

spark  programming  

crea6ng  rdds  

transforma6ons  &  ac6ons  

lazy  evalua6on  &  persistency  

passing  custom  func6ons  

working  with  key-­‐value  pairs  

data  par66oning  

accumulators  &  broadcast  variables    

pyspark  

Michael  Mathioudakis   22  

Page 23: Lecture 07 -  CS-5040 - modern database systems

driver  program  

contains  the  main  func6on  of  the  applica6on  defines  rdds  

applies  opera6ons  on  them    

e.g.,  the  spark  shell  itself  is  a  driver  program    

driver  programs  access  spark  through  a  SparkContext  object  

Michael  Mathioudakis   23  

Page 24: Lecture 07 -  CS-5040 - modern database systems

example  

import pyspark sc = pyspark.SparkContext(master = "local", appName = "tour") text = sc.textFile(“myfile.txt”) # load data text.count() # count lines

opera6on  create  rdd  

this  part  is  assumed  from  now  on  

if  we  are  running  on  a  cluster  on  machines,  different  machines  might  count  different  parts  of  the  file  

SparkContext  automa6cally  created  in  

spark  shell  

Michael  Mathioudakis   24  

Page 25: Lecture 07 -  CS-5040 - modern database systems

example  

text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines

opera6on  with  custom  func6on  

on  a  cluster,  Spark  ships  the  func6on  to  all  workers  

Michael  Mathioudakis   25  

Page 26: Lecture 07 -  CS-5040 - modern database systems

lambda  func6ons  in  python  

f =( lambda line: 'Spark' in line ) f("we are learning Spark”)

def f(line): return 'Spark' in line f("we are learning Spark")

Michael  Mathioudakis   26  

Page 27: Lecture 07 -  CS-5040 - modern database systems

stopping  

text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines sc.stop()

Michael  Mathioudakis   27  

Page 28: Lecture 07 -  CS-5040 - modern database systems

rdds  

resilient  distributed  datasets    

resilient    easy  to  recover  

distributed  different  par66ons  materialize  on  different  nodes  

 read-­‐only  (immutable)  

but  can  be  transformed  to  other  rdds  

Michael  Mathioudakis   28  

Page 29: Lecture 07 -  CS-5040 - modern database systems

crea6ng  rdds  

loading  an  external  dataset  text  =  sc.textFile("myfile.txt")  

distribu6ng  a  collec6on  of  objects  data  =  sc.parallelize(  [0,1,2,3,4,5,6,7,8,9]  )  

transforming  other  rdds  text_spark  =  text.filter(lambda  line:  'Spark'  in  line)    data_length  =  data.map(lambda  num:  num  **  2)  

Michael  Mathioudakis   29  

Page 30: Lecture 07 -  CS-5040 - modern database systems

rdd  opera6ons  

transforma6ons  return  a  new  rdd  

 

ac6ons  extract  informa6on  from  an  rdd  or  save  it  to  disk  

errorsRDD  =  inputRDD.filter(lambda  x:  "error"  in  x)  warningsRDD  =  inputRDD.filter(lambda  x:  "warning"  in  x)  badlinesRDD  =  errorsRDD.union(warningsRDD)  

print("Input  had",  badlinesRDD.count(),  "concerning  lines.")  print("Here  are  some  of  them:")  for  line  in  badlinesRDD.take(10):          print(line)  

inputRDD  =  sc.textFile(”logfile.txt")  

Michael  Mathioudakis   30  

Page 31: Lecture 07 -  CS-5040 - modern database systems

rdd  opera6ons  

def  is_prime(num):          if  num  <  1  or  num  %  1  !=  0:                  raise  Excep6on("invalid  argument")          for  d  in  range(2,  int(np.sqrt(num)  +  1)):                  if  num  %  d  ==  0:                          return  False          return  True  

numbersRDD  =  sc.parallelize(list(range(1,  1000000)))    primesRDD  =  numbersRDD.filter(is_prime)    primes  =  primesRDD.collect()    print(primes[:100])  

create  RDD  

transforma6on  

ac6on  

opera6on  in  driver  what  if  primes  does  not  fit  in  memory?  

transforma6ons  return  a  new  rdd  

 

ac6ons  extract  informa6on  from  an  rdd  or  save  it  to  disk  

Michael  Mathioudakis   31  

Page 32: Lecture 07 -  CS-5040 - modern database systems

rdd  opera6ons  

def  is_prime(num):          if  num  <  1  or  num  %  1  !=  0:                  raise  Excep6on("invalid  argument")          for  d  in  range(2,  int(np.sqrt(num)  +  1)):                  if  num  %  d  ==  0:                          return  False          return  True  

numbersRDD  =  sc.parallelize(list(range(1,  1000000)))    primesRDD  =  numbersRDD.filter(is_prime)    primesRDD.saveAsTextFile("primes.txt")  

transforma6ons  return  a  new  rdd  

 

ac6ons  extract  informa6on  from  an  rdd  or  save  it  to  disk  

Michael  Mathioudakis   32  

Page 33: Lecture 07 -  CS-5040 - modern database systems

rdds  

evaluated  lazily  ephemeral  

can  persist  in  memory  (or  disk)  if  we  ask  

Michael  Mathioudakis   33  

Page 34: Lecture 07 -  CS-5040 - modern database systems

lazy  evalua6on  

numbersRDD  =  sc.parallelize(range(1,  1000000))    primesRDD  =  numbersRDD.filter(is_prime)    primesRDD.saveAsTextFile("primes.txt")  

no  cluster  ac6vity  un6l  here  

numbersRDD  =  sc.parallelize(range(1,  1000000))    primesRDD  =  numbersRDD.filter(is_prime)    primes  =  primesRDD.collect()    print(primes[:100])  

no  cluster  ac6vity  un6l  here  

Michael  Mathioudakis   34  

Page 35: Lecture 07 -  CS-5040 - modern database systems

persistence  

RDDs  can  persist  in  memory,  if  we  ask  politely  

numbersRDD  =  sc.parallelize(list(range(1,  1000000)))    primesRDD  =  numbersRDD.filter(is_prime)    primesRDD.persist()    primesRDD.count()    primesRDD.take(10)  

RDD  already  in  memory  

causes  RDD  to  materialize  

Michael  Mathioudakis   35  

Page 36: Lecture 07 -  CS-5040 - modern database systems

persistence  

why?  

screenshot  from  jupyter  notebook  

Michael  Mathioudakis   36  

Page 37: Lecture 07 -  CS-5040 - modern database systems

persistence  

we  can  ask  Spark  to  maintain  rdds  on  disk  or  even  keep  replicas  on  different  nodes  

data.persist(pyspark.StorageLevel(useDisk = True, useMemory = True, replication=2))

to  cease  persistence    

data.unpersist()

removes  rdd  from  memory  and  disk  

Michael  Mathioudakis   37  

Page 38: Lecture 07 -  CS-5040 - modern database systems

passing  func6ons  

lambda  func6ons  

func6on  references  

text = sc.textFile("myfile.txt") text_spark = text.filter(lambda line: 'Spark' in line)

def f(line): return 'Spark' in line text = sc.textFile("myfile.txt") text_spark = text.filter(f)

Michael  Mathioudakis   38  

Page 39: Lecture 07 -  CS-5040 - modern database systems

passing  func6ons  

warning!  if  func6on  is  member  of  an  object  (self.method)  or  references  fields  of  an  object  (e.g.,  self.field)...  Spark  serializes  and  sends  the  en6re  object  

to  worker  nodes  this  can  be  very  inefficient  

Michael  Mathioudakis   39  

Page 40: Lecture 07 -  CS-5040 - modern database systems

passing  func6ons  

class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x)

where  is  the  problem  in  the  code  below?  

Michael  Mathioudakis   40  

Page 41: Lecture 07 -  CS-5040 - modern database systems

passing  func6ons  

class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x)

where  is  the  problem  in  the  code  below?  

reference  to  object  method  

reference  to  object  field  

Michael  Mathioudakis   41  

Page 42: Lecture 07 -  CS-5040 - modern database systems

passing  func6ons  

beUer  implementa6on  

class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd(self, rdd): query = self.query return rdd.filter(lambda x: query in x)

Michael  Mathioudakis   42  

Page 43: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  

element-­‐wise  transforma6ons  map  and  filter  

inputRDD  {1,2,3,4}  

mappedRDD  {1,2,3,4}  

filteredRDD  {2,3,4}  

.map(lambda  x:  x**2)   .filter(lambda  x:  x!=1)  

map’s  return  type  can  be  different  that  its  input’s  Michael  Mathioudakis   43  

Page 44: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  

element-­‐wise  transforma6ons  produce  mul6ple  elements  per  input  element  

flatMap  

phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.count()

9  

Michael  Mathioudakis   44  

Page 45: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  

phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.collect()

phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.map(lambda phrase: phrase.split(" ")) words.collect()

how  is  the  result  different?  

[['hello',  'world'],  ['how',  'are',  'you'],  ['how',  'do',  'you',  'do']]  

['hello',  'world',  'how',  'are',  'you',  'how',  'do',  'you',  'do']  Michael  Mathioudakis   45  

Page 46: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  

oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()

(pseudo)  set  opera6ons  

Michael  Mathioudakis   46  

Page 47: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  (pseudo)  set  opera6ons  

union  

oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()

oneRDD.union(otherRDD).collect()

[1,  1,  1,  2,  3,  3,  4,  4,  1,  4,  4,  7]  

Michael  Mathioudakis   47  

Page 48: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  

subtrac6on  oneRDD.subtract(otherRDD).collect()

[2,  3,  3]  

(pseudo)  set  opera6ons  oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()

Michael  Mathioudakis   48  

Page 49: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  

duplicate  removal  oneRDD.distinct().collect()

[1,  2,  3,  4]  

(pseudo)  set  opera6ons  oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()

Michael  Mathioudakis   49  

Page 50: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  

intersec6on  oneRDD.intersection(otherRDD).collect()

[1,  4]  

(pseudo)  set  opera6ons  oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()

removes  duplicates!  

Michael  Mathioudakis   50  

Page 51: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  

cartesian  product  oneRDD.cartesian(otherRDD).collect()[:5]

[(1,  1),  (1,  4),  (1,  4),  (1,  7),  (1,  1)]  

(pseudo)  set  opera6ons  oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()

Michael  Mathioudakis   51  

Page 52: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  (pseudo)  set  opera6ons  

union   subtrac6on  duplicate  removal  

intersec6on  cartesian  product  

big  difference  in  implementa6on  (and  efficiency)  

par66on  shuffling  

yes  no  

Michael  Mathioudakis   52  

Page 53: Lecture 07 -  CS-5040 - modern database systems

sortBy  

common  rdd  opera6ons  

how  is  sortBy  implemented?  we’ll  see  later...  

data = sc.parallelize(np.random.rand(10)) data.sortBy(lambda x: x)

Michael  Mathioudakis   53  

Page 54: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  ac6ons  

reduce  successively  operates  on  two  elements  of  rdd  

returns  new  element  of  same  type  

data.reduce(lambda x, y: x + y)

181  

data = sc.parallelize([1,43,62,23,52])

data.reduce(lambda x, y: x * y)

3188536  

commuta6ve  &  associa6ve  func6ons  

Michael  Mathioudakis   54  

Page 55: Lecture 07 -  CS-5040 - modern database systems

commuta6ve  &  associa6ve    func6on  f(x,  y)  

e.g.,  add(x,  y)  =  x  +  y    

commuta6ve  f(x,  y)  =  f(y,  x)  

e.g.,  add(x,  y)  =  x  +  y  =  y  +  x  =  add(y,  x)    

associa6ve  f(x,  f(y,  z))  =  f((x,  y),  z)  

e.g.,  add(x,  add(y,  z))  =  x  +  (y  +  z)  =  (x  +  y)  +  z  =  add(add(x,  y),  z)    

Michael  Mathioudakis   55  

Page 56: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  ac6ons  

data = sc.parallelize([1,43,62,23,52])

data.reduce(lambda x, y: x**2 + y**2)

137823683725010149883130929  

compute  sum  of  squares  of  data  is  the  following  correct?  

no  -­‐  why?  

reduce  successively  operates  on  two  elements  of  rdd  

produces  single  aggregate  

56  

Page 57: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  ac6ons  

data = sc.parallelize([1,43,62,23,52])

data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2

8927.0  

yes  -­‐  why?  

compute  sum  of  squares  of  data  is  the  following  correct?  

reduce  successively  operates  on  two  elements  of  rdd  

produces  single  aggregate  

Michael  Mathioudakis   57  

Page 58: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  ac6ons  

aggregate  generalizes  reduce  

the  user  provides  a  zero  value  

the  iden6ty  element  for  aggrega6on  a  sequen6al  opera6on  (func6on)  

to  update  aggrega6on  for  one  more  element  in  one  par66on    a  combining  opera6on  (func6on)  

to  combine  aggregates  from  different  par66ons  

Michael  Mathioudakis   58  

Page 59: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  ac6ons  

data = sc.parallelize([1,43,62,23,52]) aggr = data.aggregate(zeroValue = (0,0), seqOp = (lambda x, y: (x[0] + y, x[1] + 1)), combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))) aggr[0] / aggr[1]

what  does  the  following  compute?  zero  value  

sequen6al  opera6on  

combining  opera6on  

the  average  value  of  data  

aggregate  generalizes  reduce  

Michael  Mathioudakis   59  

Page 60: Lecture 07 -  CS-5040 - modern database systems

common  rdd  opera6ons  ac6ons  

opera5on   return  

collect   all  elements  

take(num)  num  elements  

tries  to  minimize  disk  access  (e.g.,  by  accessing  one  par66on)  

takeSample   a  random  sample  of  elements  

count   number  of  elements  

countByValue   number  of  6me  each  element  appears  first  on  each  par66on,  then  combines  par66on  results  

top(num)   num  maximum  elements  sorts  par66ons  and  merges  

Michael  Mathioudakis   60  

Page 61: Lecture 07 -  CS-5040 - modern database systems

all  opera6ons  we  have  described  so  far  apply  to  all  rdds  

 that’s  why  the  word  “common”  has  been  in  the  6tle  

common  rdd  opera6ons  

Michael  Mathioudakis   61  

Page 62: Lecture 07 -  CS-5040 - modern database systems

pair  rdds  

elements  are  key-­‐value  pairs  

pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2)) pairRDD.collect()[:5]

[(0,  0),  (1,  1),  (2,  4),  (3,  9),  (4,  16)]  

they  come  from  mapreduce  model  prac6cal  in  many  cases  

spark  provides  opera6ons  tailored  to  pair  rdds  

Michael  Mathioudakis   62  

Page 63: Lecture 07 -  CS-5040 - modern database systems

transforma6ons  on  pair  rdds  

pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2))

keys  and  values  

pairRDD.keys().collect()[:5]

[0,  1,  2,  3,  4]  

pairRDD.values().collect()[:5]

[0,  1,  4,  9,  16]  

Michael  Mathioudakis   63  

Page 64: Lecture 07 -  CS-5040 - modern database systems

transforma6ons  on  pair  rdds  reduceByKey  

pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])

volumePerKey = pairRDD.reduceByKey(lambda x, y: x + y) volumePerKey.collect()

[('$APPL',  201.16),  ('$AMZN',  1104.64),  ('$GOOG',  706.2)]  

reduceByKey  is  a  transforma6on  unlike  reduce  Michael  Mathioudakis   64  

Page 65: Lecture 07 -  CS-5040 - modern database systems

transforma6ons  on  pair  rdds  combineByKey  

generalizes  reduceByKey  

user  provides  createCombiner  func6on  

provides  the  zero  value  for  each  key  mergeValue  func6on  

combines  current  aggregate  in  one  par66on  with  new  value  mergeCombiner  

to  combine  aggregates  from  par66ons  

Michael  Mathioudakis   65  

Page 66: Lecture 07 -  CS-5040 - modern database systems

combineByKey  generalizes  reduceByKey  

pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])

aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1), mergeValue = lambda x, y: (x[0] + y, x[1] + 1), mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1])) avgPerKey = aggr.map(lambda x: (x[0], x[1][0]/x[1][1])) avgPerKey.collect()

what  does  the  following  produce  

transforma6ons  on  pair  rdds  

Michael  Mathioudakis   66  

Page 67: Lecture 07 -  CS-5040 - modern database systems

transforma6ons  on  pair  rdds  

sortByKey    

samples  values  from  rdd  to  es6mate  sorted  par66on  boundaries  

shuffles  data  sorts  by  external  sor6ng  

 used  to  implement  common  sortBy  

idea:  create  a  pair  RDD  with  (sort-­‐key,  item)  elements  apply  sortByKey  on  that  

Michael  Mathioudakis   67  

Page 68: Lecture 07 -  CS-5040 - modern database systems

transforma6ons  on  pair  rdds  

implemented  with  a  variant  of  hash  join  spark  also  has  func6ons  for  

le|  outer  join,  right  outer  join,  full  outer  join  

(inner)  join  

Michael  Mathioudakis   68  

Page 69: Lecture 07 -  CS-5040 - modern database systems

transforma6ons  on  pair  rdds  

(inner)  join  

course_a = sc.parallelize([ ("Antti", 8), ("Tuukka", 10), (“Leena", 9)]) course_b = sc.parallelize([ ("Leena", 10), ("Tuukka", 10)]) result = course_a.join(course_b) result.collect()

[('Tuukka',  (10,  10)),  ('Leena',  (9,  10))]  

Michael  Mathioudakis   69  

Page 70: Lecture 07 -  CS-5040 - modern database systems

transforma6ons  on  pair  rdds  

other  transforma6ons  

groupByKey  

pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])

pairRDD.groupByKey().collect() [('$APPL',  <  values  >),    ('$AMZN',  <  values  >),    ('$GOOG',  <  values  >)]  

for  grouping  together  mul6ple  rdds  cogroup  and  groupWith  

Michael  Mathioudakis   70  

Page 71: Lecture 07 -  CS-5040 - modern database systems

ac6ons  on  pair  rdds  

lookup(key)  

pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])

countByKey  

{'$AMZN':  2,  '$APPL':  2,  '$GOOG':  1}  

pairRDD.countByKey()

collectAsMap  

pairRDD.collectAsMap() {'$AMZN':  552.32,  '$APPL':  100.52,  '$GOOG':  706.2}  

pairRDD.lookup("$APPL") [100.64,  100.52]   Michael  Mathioudakis   71  

Page 72: Lecture 07 -  CS-5040 - modern database systems

shared  variables  

accumulators  write-­‐only  for  workers  

 broadcast  variables  read-­‐only  for  workers  

Michael  Mathioudakis   72  

Page 73: Lecture 07 -  CS-5040 - modern database systems

accumulators  

text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 return length llengthRDD = text.map(line_len) llengthRDD.count()

95  

long_lines.value

45  

lazy!  

Michael  Mathioudakis   73  

Page 74: Lecture 07 -  CS-5040 - modern database systems

accumulators  

fault  tolerance    

spark  executes  updates  in  ac6ons  only  once  e.g.,  foreach()  

foreach:  special  ac6on    

this  is  not  guaranteed  for  transforma=ons  in  transforma6ons,  use  accumulators  

only  for  debugging  purposes!  Michael  Mathioudakis   74  

Page 75: Lecture 07 -  CS-5040 - modern database systems

accumulators  +  foreach  

text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 text.foreach(line_len) long_lines.value

45  

Michael  Mathioudakis   75  

Page 76: Lecture 07 -  CS-5040 - modern database systems

broadcast  variables  

 sent  to  workers  only  once  

read-­‐only  even  if  you  change  its  value  on  a  worker,  

the  change  does  not  propagate  to  other  workers  (actually  broadcast  object  is  wriUen  to  file,  read  

from  there  by  each  worker)    

release  with  unpersist()  

Michael  Mathioudakis   76  

Page 77: Lecture 07 -  CS-5040 - modern database systems

broadcast  variables  def load_address_table(): return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2", "Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"} address_table = sc.broadcast(load_address_table()) def find_address(name): res = None if name in address_table.value: res = address_table.value[name] return res data = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"]) pairRDD = data.map(lambda name: (name, find_address(name))) pairRDD.collectAsMap()

Michael  Mathioudakis   77  

Page 78: Lecture 07 -  CS-5040 - modern database systems

par66oning  

certain  opera6ons  take  advantage  of  par66oning  

e.g.,  reduceByKey,  join  

anRDD.par66onBy(numPar66ons,  par66onFunc)  

users  can  set  number  of  par66ons  and  par66oning  func6on  

Michael  Mathioudakis   78  

Page 79: Lecture 07 -  CS-5040 - modern database systems

working  on  per-­‐par66on  basis  

spark  provides  opera6ons  that  operate  at  par66on  level  e.g.,  mapPar66on  

rdd = sc.parallelize(range(100), 4) def f(iterator): yield sum(iterator) rdd.mapPartitions(f).collect()

used  in  implementa6on  of  Spark  

Michael  Mathioudakis   79  

Page 80: Lecture 07 -  CS-5040 - modern database systems

see  implementa6on  of  spark  on  hUps://github.com/apache/spark/  

   

Michael  Mathioudakis   80  

Page 81: Lecture 07 -  CS-5040 - modern database systems

references    1.  Zaharia,  Matei,  et  al.  "Spark:  Cluster  Compu6ng  with  Working  Sets."  

HotCloud  10  (2010):  10-­‐10.  2.  Zaharia,  Matei,  et  al.  "Resilient  distributed  datasets:  A  fault-­‐tolerant  

abstrac6on  for  in-­‐memory  cluster  compu6ng."  Proceedings  of  the  9th  USENIX  conference  on  Networked  Systems  Design  and  Implementa=on.  

3.  Learning  Spark:  Lightning-­‐Fast  Big  Data  Analysis,  by  Holden  Karau,  Andy  Konwinski,  Patrick  Wendell,  Matei  Zaharia  

4.  Spark  programming  guide  hUps://spark.apache.org/docs/latest/programming-­‐guide.html  

5.  Spark  implementa6on  hUps://github.com/apache/spark/  6.  "Making  Big  Data  Processing  Simple  with  Spark,"  Matei  Zaharia,  

hUps://youtu.be/d9D-­‐Z3-­‐44F8      

Michael  Mathioudakis   81