Lecture 07 - CS-5040 - modern database systems

Modern Database Systems Lecture 7

Aris6des Gionis Michael Mathioudakis

Spring 2016

logis6cs

assignment 1 currently marking

might upload best student solu6ons? will provide annotated pdfs with marks

virtualbox has same path & filename as previous one

Michael Mathioudakis 2

con6nuing from last lecture original paper on spark


previously... on modern database systems

generaliza6on of mapreduce more suitable for itera6ve workflows itera6ve algorithms and repe66ve querying

rdd: resilient distributed dataset

read-‐only, lazily evaluated, easily re-‐created ephemeral, unless we need to keep them in memory


example: text search

suppose that a web service is experiencing errors you want to search over terabytes of logs to find the cause

the logs are stored in Hadoop Filesystem (HDFS) errors are wriUen in the logs as lines that

start with the keyword “ERROR”


example: text search

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














in Scala...

rdd

rdd

from a file

transforma6on

hint: keep in memory!

no work on the cluster so far

ac6on! lines is not loaded to ram!


example -‐ text search ctd.

let us find errors related to “MySQL”


example -‐ text search ctd.

lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














transforma6on ac6on


example -‐ text search ctd. again

let us find errors related to “HDFS” and extract their 6me field

assuming 6me is field no. 3 in tab-‐separated format


example -‐ text search ctd. again

lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














transforma6ons

ac6on


example: text search lineage of 6me fields

lines

errors


HDFS errors

time fields







errors.count()





.collect()







Difficult

Work placement














cached

pipelined transforma6ons if a par66on of errors is lost,

filter is applied only the corresponding par66on of lines


represen6ng rdds

internal informa6on about rdds

par66ons & par66oning scheme dependencies on parent RDDs

func6on to compute it from parents


rdd dependencies

narrow dependencies each par66on of the parent rdd is used by

at most one par66on of the child rdd

otherwise, wide dependencies


rdd dependencies

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.


scheduling

when an ac6on is performed... (e.g., count() or save())

... the scheduler examines the lineage graph builds a DAG of stages to execute

each stage is a maximal pipeline of

transforma6ons over narrow dependencies


scheduling

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.

rdd

par66on

already in ram


memory management

when not enough memory apply LRU evic6on policy at rdd level

evict par66on from least recently used rdd


performance

logis6c regression and k-‐means amazon EC2

10 itera6ons on 100GB datasets 100 node-‐clusters


performance

them simpler to checkpoint than general shared mem-ory. Because consistency is not a concern, RDDs can bewritten out in the background without requiring programpauses or distributed snapshot schemes.

6 EvaluationWe evaluated Spark and RDDs through a series of exper-iments on Amazon EC2, as well as benchmarks of userapplications. Overall, our results show the following:• Spark outperforms Hadoop by up to 20⇥ in itera-

tive machine learning and graph applications. Thespeedup comes from avoiding I/O and deserializationcosts by storing data in memory as Java objects.

• Applications written by our users perform and scalewell. In particular, we used Spark to speed up an an-alytics report that was running on Hadoop by 40⇥.

• When nodes fail, Spark can recover quickly by re-building only the lost RDD partitions.

• Spark can be used to query a 1 TB dataset interac-tively with latencies of 5–7 seconds.

We start by presenting benchmarks for iterative ma-chine learning applications (§6.1) and PageRank (§6.2)against Hadoop. We then evaluate fault recovery in Spark(§6.3) and behavior when a dataset does not fit in mem-ory (§6.4). Finally, we discuss results for user applica-tions (§6.5) and interactive data mining (§6.6).

Unless otherwise noted, our tests used m1.xlarge EC2nodes with 4 cores and 15 GB of RAM. We used HDFSfor storage, with 256 MB blocks. Before each test, wecleared OS buffer caches to measure IO costs accurately.

6.1 Iterative Machine Learning ApplicationsWe implemented two iterative machine learning appli-cations, logistic regression and k-means, to compare theperformance of the following systems:• Hadoop: The Hadoop 0.20.2 stable release.

• HadoopBinMem: A Hadoop deployment that con-verts the input data into a low-overhead binary formatin the first iteration to eliminate text parsing in laterones, and stores it in an in-memory HDFS instance.

• Spark: Our implementation of RDDs.We ran both algorithms for 10 iterations on 100 GB

datasets using 25–100 machines. The key difference be-tween the two applications is the amount of computationthey perform per byte of data. The iteration time of k-means is dominated by computation, while logistic re-gression is less compute-intensive and thus more sensi-tive to time spent in deserialization and I/O.

Since typical learning algorithms need tens of itera-tions to converge, we report times for the first iterationand subsequent iterations separately. We find that shar-ing data via RDDs greatly speeds up future iterations.

80!

139!

46!

115!

182!

82!

76!

62!

3!

106!

87!

33!

0!40!80!

120!160!200!240!

Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!

Logistic Regression! K-Means!

Itera

tion

time

(s)!

First Iteration!Later Iterations!

Figure 7: Duration of the first and later iterations in Hadoop,HadoopBinMem and Spark for logistic regression and k-meansusing 100 GB of data on a 100-node cluster.

184!

111!

76!11

6!

80!

62!

15!

6! 3!

0!50!100!150!200!250!300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop!HadoopBinMem!Spark!

(a) Logistic Regression

274!

157!

106!

197!

121!

87!

143!

61!

33!

0!

50!

100!

150!

200!

250!

300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop !HadoopBinMem!Spark!

(b) K-Means

Figure 8: Running times for iterations after the first in Hadoop,HadoopBinMem, and Spark. The jobs all processed 100 GB.

First Iterations All three systems read text input fromHDFS in their first iterations. As shown in the light barsin Figure 7, Spark was moderately faster than Hadoopacross experiments. This difference was due to signal-ing overheads in Hadoop’s heartbeat protocol betweenits master and workers. HadoopBinMem was the slowestbecause it ran an extra MapReduce job to convert the datato binary, it and had to write this data across the networkto a replicated in-memory HDFS instance.

Subsequent Iterations Figure 7 also shows the aver-age running times for subsequent iterations, while Fig-ure 8 shows how these scaled with cluster size. For lo-gistic regression, Spark 25.3⇥ and 20.7⇥ faster thanHadoop and HadoopBinMem respectively on 100 ma-chines. For the more compute-intensive k-means appli-cation, Spark still achieved speedup of 1.9⇥ to 3.2⇥.

Understanding the Speedup We were surprised tofind that Spark outperformed even Hadoop with in-memory storage of binary data (HadoopBinMem) by a20⇥ margin. In HadoopBinMem, we had used Hadoop’sstandard binary format (SequenceFile) and a large blocksize of 256 MB, and we had forced HDFS’s data di-rectory to be on an in-memory file system. However,Hadoop still ran slower due to several factors:1. Minimum overhead of the Hadoop software stack,

2. Overhead of HDFS while serving data, and


performance Example: Logistic Regression

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop

Spark

110 s / iteration

first iteration 80 s further iterations 1 s

logis6c regression 2015


spark programming with python


spark programming

crea6ng rdds

transforma6ons & ac6ons

lazy evalua6on & persistency

passing custom func6ons

working with key-‐value pairs

data par66oning

accumulators & broadcast variables

pyspark


driver program

contains the main func6on of the applica6on defines rdds

applies opera6ons on them

e.g., the spark shell itself is a driver program

driver programs access spark through a SparkContext object


example

import pyspark sc = pyspark.SparkContext(master = "local", appName = "tour") text = sc.textFile(“myfile.txt”) # load data text.count() # count lines

opera6on create rdd

this part is assumed from now on

if we are running on a cluster on machines, different machines might count different parts of the file

SparkContext automa6cally created in

spark shell


example

text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines

opera6on with custom func6on

on a cluster, Spark ships the func6on to all workers


lambda func6ons in python

f =( lambda line: 'Spark' in line ) f("we are learning Spark”)

def f(line): return 'Spark' in line f("we are learning Spark")


stopping

text = sc.textFile("myfile.txt") # load data # keep only lines that mention "Spark" spark_lines = text.filter(lambda line: 'Spark' in line) spark_lines.count() # count lines sc.stop()


rdds

resilient distributed datasets

resilient easy to recover

distributed different par66ons materialize on different nodes

read-‐only (immutable)

but can be transformed to other rdds


crea6ng rdds

loading an external dataset text = sc.textFile("myfile.txt")

distribu6ng a collec6on of objects data = sc.parallelize( [0,1,2,3,4,5,6,7,8,9] )

transforming other rdds text_spark = text.filter(lambda line: 'Spark' in line) data_length = data.map(lambda num: num ** 2)


rdd opera6ons

transforma6ons return a new rdd

ac6ons extract informa6on from an rdd or save it to disk

errorsRDD = inputRDD.filter(lambda x: "error" in x) warningsRDD = inputRDD.filter(lambda x: "warning" in x) badlinesRDD = errorsRDD.union(warningsRDD)

print("Input had", badlinesRDD.count(), "concerning lines.") print("Here are some of them:") for line in badlinesRDD.take(10): print(line)

inputRDD = sc.textFile(”logfile.txt")


rdd opera6ons

def is_prime(num): if num < 1 or num % 1 != 0: raise Excep6on("invalid argument") for d in range(2, int(np.sqrt(num) + 1)): if num % d == 0: return False return True

numbersRDD = sc.parallelize(list(range(1, 1000000))) primesRDD = numbersRDD.filter(is_prime) primes = primesRDD.collect() print(primes[:100])

create RDD

transforma6on

ac6on

opera6on in driver what if primes does not fit in memory?




rdd opera6ons

def is_prime(num): if num < 1 or num % 1 != 0: raise Excep6on("invalid argument") for d in range(2, int(np.sqrt(num) + 1)): if num % d == 0: return False return True

numbersRDD = sc.parallelize(list(range(1, 1000000))) primesRDD = numbersRDD.filter(is_prime) primesRDD.saveAsTextFile("primes.txt")




rdds

evaluated lazily ephemeral

can persist in memory (or disk) if we ask


lazy evalua6on

numbersRDD = sc.parallelize(range(1, 1000000)) primesRDD = numbersRDD.filter(is_prime) primesRDD.saveAsTextFile("primes.txt")

no cluster ac6vity un6l here

numbersRDD = sc.parallelize(range(1, 1000000)) primesRDD = numbersRDD.filter(is_prime) primes = primesRDD.collect() print(primes[:100])

no cluster ac6vity un6l here


persistence

RDDs can persist in memory, if we ask politely

numbersRDD = sc.parallelize(list(range(1, 1000000))) primesRDD = numbersRDD.filter(is_prime) primesRDD.persist() primesRDD.count() primesRDD.take(10)

RDD already in memory

causes RDD to materialize


persistence

why?

screenshot from jupyter notebook


persistence

we can ask Spark to maintain rdds on disk or even keep replicas on different nodes

data.persist(pyspark.StorageLevel(useDisk = True, useMemory = True, replication=2))

to cease persistence

data.unpersist()

removes rdd from memory and disk


passing func6ons

lambda func6ons

func6on references

text = sc.textFile("myfile.txt") text_spark = text.filter(lambda line: 'Spark' in line)

def f(line): return 'Spark' in line text = sc.textFile("myfile.txt") text_spark = text.filter(f)


passing func6ons

warning! if func6on is member of an object (self.method) or references fields of an object (e.g., self.field)... Spark serializes and sends the en6re object

to worker nodes this can be very inefficient


passing func6ons

class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x)

where is the problem in the code below?


passing func6ons

class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd_v1(self, rdd): return rdd.filter(self.is_match) def get_matches_in_rdd_v2(self, rdd): return rdd.filter(lambda x: self.query in x)

where is the problem in the code below?

reference to object method

reference to object field


passing func6ons

beUer implementa6on

class SearchFunctions(object): def __init__(self, query): self.query def is_match(self, s): return self.query in s def get_matches_in_rdd(self, rdd): query = self.query return rdd.filter(lambda x: query in x)


common rdd opera6ons

element-‐wise transforma6ons map and filter

inputRDD {1,2,3,4}

mappedRDD {1,2,3,4}

filteredRDD {2,3,4}

.map(lambda x: x**2) .filter(lambda x: x!=1)

map’s return type can be different that its input’s Michael Mathioudakis 43


element-‐wise transforma6ons produce mul6ple elements per input element

flatMap

phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.count()

9



phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.flatMap(lambda phrase: phrase.split(" ")) words.collect()

phrases = sc.parallelize(["hello world", "how are you", "how do you do"]) words = phrases.map(lambda phrase: phrase.split(" ")) words.collect()

how is the result different?

[['hello', 'world'], ['how', 'are', 'you'], ['how', 'do', 'you', 'do']]

['hello', 'world', 'how', 'are', 'you', 'how', 'do', 'you', 'do'] Michael Mathioudakis 45


oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()

(pseudo) set opera6ons


common rdd opera6ons (pseudo) set opera6ons

union

oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()

oneRDD.union(otherRDD).collect()

[1, 1, 1, 2, 3, 3, 4, 4, 1, 4, 4, 7]



subtrac6on oneRDD.subtract(otherRDD).collect()

[2, 3, 3]

(pseudo) set opera6ons oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4]) oneRDD.persist() otherRDD = sc.parallelize([1, 4, 4, 7]) otherRDD.persist()



duplicate removal oneRDD.distinct().collect()

[1, 2, 3, 4]




intersec6on oneRDD.intersection(otherRDD).collect()

[1, 4]


removes duplicates!



cartesian product oneRDD.cartesian(otherRDD).collect()[:5]

[(1, 1), (1, 4), (1, 4), (1, 7), (1, 1)]



common rdd opera6ons (pseudo) set opera6ons

union subtrac6on duplicate removal

intersec6on cartesian product

big difference in implementa6on (and efficiency)

par66on shuffling

yes no


sortBy


how is sortBy implemented? we’ll see later...

data = sc.parallelize(np.random.rand(10)) data.sortBy(lambda x: x)


common rdd opera6ons ac6ons

reduce successively operates on two elements of rdd

returns new element of same type

data.reduce(lambda x, y: x + y)

181

data = sc.parallelize([1,43,62,23,52])

data.reduce(lambda x, y: x * y)

3188536

commuta6ve & associa6ve func6ons


commuta6ve & associa6ve func6on f(x, y)

e.g., add(x, y) = x + y

commuta6ve f(x, y) = f(y, x)

e.g., add(x, y) = x + y = y + x = add(y, x)

associa6ve f(x, f(y, z)) = f((x, y), z)

e.g., add(x, add(y, z)) = x + (y + z) = (x + y) + z = add(add(x, y), z)




data.reduce(lambda x, y: x**2 + y**2)

137823683725010149883130929

compute sum of squares of data is the following correct?

no -‐ why?


produces single aggregate

56



data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2

8927.0

yes -‐ why?

compute sum of squares of data is the following correct?


produces single aggregate



aggregate generalizes reduce

the user provides a zero value

the iden6ty element for aggrega6on a sequen6al opera6on (func6on)

to update aggrega6on for one more element in one par66on a combining opera6on (func6on)

to combine aggregates from different par66ons



data = sc.parallelize([1,43,62,23,52]) aggr = data.aggregate(zeroValue = (0,0), seqOp = (lambda x, y: (x[0] + y, x[1] + 1)), combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))) aggr[0] / aggr[1]

what does the following compute? zero value

sequen6al opera6on

combining opera6on

the average value of data

aggregate generalizes reduce



opera5on return

collect all elements

take(num) num elements

tries to minimize disk access (e.g., by accessing one par66on)

takeSample a random sample of elements

count number of elements

countByValue number of 6me each element appears first on each par66on, then combines par66on results

top(num) num maximum elements sorts par66ons and merges


all opera6ons we have described so far apply to all rdds

that’s why the word “common” has been in the 6tle



pair rdds

elements are key-‐value pairs

pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2)) pairRDD.collect()[:5]

[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16)]

they come from mapreduce model prac6cal in many cases

spark provides opera6ons tailored to pair rdds


transforma6ons on pair rdds

pairRDD = sc.parallelize(range(100)).map(lambda x: (x, x**2))

keys and values

pairRDD.keys().collect()[:5]

[0, 1, 2, 3, 4]

pairRDD.values().collect()[:5]

[0, 1, 4, 9, 16]


transforma6ons on pair rdds reduceByKey

pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])

volumePerKey = pairRDD.reduceByKey(lambda x, y: x + y) volumePerKey.collect()

[('$APPL', 201.16), ('$AMZN', 1104.64), ('$GOOG', 706.2)]

reduceByKey is a transforma6on unlike reduce Michael Mathioudakis 64

transforma6ons on pair rdds combineByKey

generalizes reduceByKey

user provides createCombiner func6on

provides the zero value for each key mergeValue func6on

combines current aggregate in one par66on with new value mergeCombiner

to combine aggregates from par66ons


combineByKey generalizes reduceByKey


aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1), mergeValue = lambda x, y: (x[0] + y, x[1] + 1), mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1])) avgPerKey = aggr.map(lambda x: (x[0], x[1][0]/x[1][1])) avgPerKey.collect()

what does the following produce




sortByKey

samples values from rdd to es6mate sorted par66on boundaries

shuffles data sorts by external sor6ng

used to implement common sortBy

idea: create a pair RDD with (sort-‐key, item) elements apply sortByKey on that



implemented with a variant of hash join spark also has func6ons for

le| outer join, right outer join, full outer join

(inner) join



(inner) join

course_a = sc.parallelize([ ("Antti", 8), ("Tuukka", 10), (“Leena", 9)]) course_b = sc.parallelize([ ("Leena", 10), ("Tuukka", 10)]) result = course_a.join(course_b) result.collect()

[('Tuukka', (10, 10)), ('Leena', (9, 10))]



other transforma6ons

groupByKey


pairRDD.groupByKey().collect() [('$APPL', < values >), ('$AMZN', < values >), ('$GOOG', < values >)]

for grouping together mul6ple rdds cogroup and groupWith


ac6ons on pair rdds

lookup(key)


countByKey

{'$AMZN': 2, '$APPL': 2, '$GOOG': 1}

pairRDD.countByKey()

collectAsMap

pairRDD.collectAsMap() {'$AMZN': 552.32, '$APPL': 100.52, '$GOOG': 706.2}

pairRDD.lookup("$APPL") [100.64, 100.52] Michael Mathioudakis 71

shared variables

accumulators write-‐only for workers

broadcast variables read-‐only for workers


accumulators

text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 return length llengthRDD = text.map(line_len) llengthRDD.count()

95

long_lines.value

45

lazy!


accumulators

fault tolerance

spark executes updates in ac6ons only once e.g., foreach()

foreach: special ac6on

this is not guaranteed for transforma=ons in transforma6ons, use accumulators

only for debugging purposes! Michael Mathioudakis 74

accumulators + foreach

text = sc.textFile("myfile.txt") long_lines = sc.accumulator(0) def line_len(line): global long_lines length = len(line) if length > 30: long_lines += 1 text.foreach(line_len) long_lines.value

45


broadcast variables

sent to workers only once

read-‐only even if you change its value on a worker,

the change does not propagate to other workers (actually broadcast object is wriUen to file, read

from there by each worker)

release with unpersist()


broadcast variables def load_address_table(): return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2", "Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"} address_table = sc.broadcast(load_address_table()) def find_address(name): res = None if name in address_table.value: res = address_table.value[name] return res data = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"]) pairRDD = data.map(lambda name: (name, find_address(name))) pairRDD.collectAsMap()


par66oning

certain opera6ons take advantage of par66oning

e.g., reduceByKey, join

anRDD.par66onBy(numPar66ons, par66onFunc)

users can set number of par66ons and par66oning func6on


working on per-‐par66on basis

spark provides opera6ons that operate at par66on level e.g., mapPar66on

rdd = sc.parallelize(range(100), 4) def f(iterator): yield sum(iterator) rdd.mapPartitions(f).collect()

used in implementa6on of Spark


see implementa6on of spark on hUps://github.com/apache/spark/


references 1.  Zaharia, Matei, et al. "Spark: Cluster Compu6ng with Working Sets."

HotCloud 10 (2010): 10-‐10. 2.  Zaharia, Matei, et al. "Resilient distributed datasets: A fault-‐tolerant

abstrac6on for in-‐memory cluster compu6ng." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementa=on.

3.  Learning Spark: Lightning-‐Fast Big Data Analysis, by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia

4.  Spark programming guide hUps://spark.apache.org/docs/latest/programming-‐guide.html

5.  Spark implementa6on hUps://github.com/apache/spark/ 6.  "Making Big Data Processing Simple with Spark," Matei Zaharia,

hUps://youtu.be/d9D-‐Z3-‐44F8


Lecture 07 - CS-5040 - modern database systems

Education

Transcript of Lecture 07 - CS-5040 - modern database systems