Lecture 06 - CS-5040 - modern database systems

100
Modern Database Systems Lecture 6 Aris6des Gionis Michael Mathioudakis Spring 2016

Transcript of Lecture 06 - CS-5040 - modern database systems

Page 1: Lecture 06  - CS-5040 - modern database systems

Modern  Database  Systems  Lecture  6  

Aris6des  Gionis  Michael  Mathioudakis  

 Spring  2016  

Page 2: Lecture 06  - CS-5040 - modern database systems

logis6cs  

•  tutorial  on  monday,  TU6@2:15pm  •  assignment  2  is  out  -­‐  due  by  march  14th  •  for  programming  part,  check  updated  tutorial  •  total  of  5  late  days  are  allowed  

michael  mathioudakis   2  

Page 3: Lecture 06  - CS-5040 - modern database systems

today  

mapreduce  &  

spark    

as  they  were  introduced  emphasis  on  high  level  concepts  

michael  mathioudakis   3  

Page 4: Lecture 06  - CS-5040 - modern database systems

introduc6on  

michael  mathioudakis   4  

Page 5: Lecture 06  - CS-5040 - modern database systems

intro  recap  

structured  data,  semi-­‐structured  data,  text  query  op6miza6on  vs  flexibility  of  data  model  

disk  access  a  central  issue  indexing  

 

now:  big  data  scale  so  big,  that  new  issues  take  front  seat:  

distributed,  parallel  computa6on  fault  tolerance  

how  to  accommodate  those  within  a  simple  computa6onal  model?  

michael  mathioudakis   5  

Page 6: Lecture 06  - CS-5040 - modern database systems

remember  this  task  from  lecture  0...  data records that contain information about products viewed or purchased from an online store task for each pair of Games products, count the number of customers that have purchased both

6  

Product   Category   Customer   Date   Price   Ac8on   other...  

Portal  2   Games   Michael  M.   12/01/2015   10€   Purchase  

...  

FLWR  Plant  Food   Garden   Aris  G.   19/02/2015   32€   View  

Chase  the  Rabbit   Games   Michael  M.   23/04/2015   1€   View  

Portal  2   Games   Ores6s  K.   13/05/2015   10€   Purchase  

...  

> what challenges does case B pose compared to case A? hint limited main memory, disk access, distributed setting

case  A   10,000 records (0.5MB per record, 5GB total disk space) 10GB of main memory  

case  B   10,000,000 records (~5TB total disk space) stored across 100 nodes (50GB per node), 10GB of main memory per node  

Page 7: Lecture 06  - CS-5040 - modern database systems

mapreduce  

michael  mathioudakis   7  

Page 8: Lecture 06  - CS-5040 - modern database systems

michael  mathioudakis   8  

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

[email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 2004 1

appeared  at  the  Symposium  on  Opera6ng  Systems  Design  &  Implementa6on,  2004  

Page 9: Lecture 06  - CS-5040 - modern database systems

some  context  

in  early  2000s,  google  was  developing  systems  to  

accommodate  storage  and  processing  of  big  data  volumes  

michael  mathioudakis   9  

google  file  system  (gfs)  “a  scalable  distributed  file  system  for  large  distributed  data-­‐intensive  applica6ons”  

“provides  fault  tolerance  while  running  on  inexpensive  commodity  hardware”  

bigtable  “distributed  storage  system  for  managing  

structured  data  that  is  designed  to  scale  to  a  very  large  size:  petabytes  of  data  across  thousands  of  

commodity  servers”  

mapreduce  “programming  model  and  implementa6on  for  processing  and  genera6ng  large  data  sets”  

Page 10: Lecture 06  - CS-5040 - modern database systems

mo6va6on  

hundreds  of  special-­‐purpose  computa6ons  over  raw  data  crawled  webpages  &  documents,  search  &  web  request  logs  

inverted  indexes,  web  graphs,  document  summaries,  frequent  queries    

conceptually  straighforward  computa6on  however...  

a  lot  of  data,  distributed  over  many  machines  hundreds  or  thousands  of  machines...  

a  lot  of  prac6cal  issues  arise,    that  obscure  the  simplicity  of  computa6on  

michael  mathioudakis   10  

at  google  in  early  2000s...  

Page 11: Lecture 06  - CS-5040 - modern database systems

developed  solu6on  

programming  model  simple  

based  on  the  map  and  reduce  primi6ves  found  in  func6onal  languages  (e.g.,  Lisp)  

 system  

hides  the  messy  details  in  a  library  paralleliza6on,  fault-­‐tolerance,  data  distribu6on,  load  balancing  

michael  mathioudakis   11  

mapreduce  

programming  model   system  

Page 12: Lecture 06  - CS-5040 - modern database systems

programming  model  

input  a  set  of  (key,value)  pairs  

 computa8on  

two  func6ons:  map  and  reduce  wrigen  by  the  user  

 output  

a  set  of  (key,value)  pairs  

michael  mathioudakis   12  

Page 13: Lecture 06  - CS-5040 - modern database systems

map  func6on  

input  one  (key,value)  pair  

 output  

set  of  intermediate  (key,value)  pairs    

mapreduce  groups  together  pairs  with  same  key  and  passes  them  to  reduce  func6on  

michael  mathioudakis   13  

Page 14: Lecture 06  - CS-5040 - modern database systems

michael  mathioudakis   14  

map  func6on  

key,  value  

key,  value  

key,  value  

...  key,  value  

key,  value  

map  

key,  value   key,  value   key,  value   key,  value  

key,  value   key,  value  

key,  value   key,  value   key,  value  

key,  value   key,  value   key,  value   key,  value  

key,  value  

typeof(key/value)  generally  ≠  

typeof(key/value)  

key,  value  key,  value  key,  value  key,  value  le

gend

 

different  key  value  

Page 15: Lecture 06  - CS-5040 - modern database systems

reduce  func6on  

input  (key,  list(values))  

intermediate  key  and  set  of  values  for  that  key  list(values)  supplied  as  iterator,  

convenient  when  not  enough  memory    

output  list(values)  

typically  only  0  or  1  values  are  output  per  invoca6on  

michael  mathioudakis   15  

Page 16: Lecture 06  - CS-5040 - modern database systems

reduce  func6on  

michael  mathioudakis   16  

key,  value  

key,  value  

key,  value  

key,  value  key,  value  

key,  value  

key,  value  

reduce   key,  [value1,  value2,  ...]  

reduce   key,  [value1,  value2,  ...]  

same  key  

same  key  

Page 17: Lecture 06  - CS-5040 - modern database systems

programming  model  input  

a  set  of  (key,value)  pairs    

map  (key,value)  è  list(  (key,value)  )  

 reduce  

(key,  list(values))  è  (key,  list(values))    

output  list(  (key,  list(values))  )  

michael  mathioudakis   17  

Page 18: Lecture 06  - CS-5040 - modern database systems

example  task  

count  the  number  of  occurrences  of  each  word  in  a  collec6on  of  documents  

 input  

a  set  of  (key,value)  pairs  key:  document  file  loca6on  (id)  

value:  document  contents  (list  of  words)    

how  would  you  approach  this?    

michael  mathioudakis   18  

map  (key,value)  è  list(  (key,value)  )  

reduce  (key,  list(values))  è  (key,  list(values))  

Page 19: Lecture 06  - CS-5040 - modern database systems

example  -­‐  solu6on  

michael  mathioudakis   19  

doc1,  value  

doc2,  value  

doc3,  value  

map   word1,  1   word2,  1   word3,  1   word4,  1  

word4,  1   word2,  1  

word2,  1   word1,  1   word4,  1  

word1,  1  word1,  1  

word1,  1  

word1,  1  

word1,  [4]  

redu

ce  

word2,  1  

word2,  1  

word2,  1  

word2,  [3]  

Page 20: Lecture 06  - CS-5040 - modern database systems

example  -­‐  solu6on  

michael  mathioudakis   20  

for a rewrite of our production indexing system. Sec-tion 7 discusses related and future work.

2 Programming Model

The computation takes a set of input key/value pairs, andproduces a set of output key/value pairs. The user ofthe MapReduce library expresses the computation as twofunctions: Map and Reduce.Map, written by the user, takes an input pair and pro-duces a set of intermediate key/value pairs. The MapRe-duce library groups together all intermediate values asso-ciated with the same intermediate key I and passes themto the Reduce function.The Reduce function, also written by the user, acceptsan intermediate key I and a set of values for that key. Itmerges together these values to form a possibly smallerset of values. Typically just zero or one output value isproduced per Reduce invocation. The intermediate val-ues are supplied to the user’s reduce function via an iter-ator. This allows us to handle lists of values that are toolarge to fit in memory.

2.1 ExampleConsider the problem of counting the number of oc-currences of each word in a large collection of docu-ments. The user would write code similar to the follow-ing pseudo-code:

map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);

Emit(AsString(result));

The map function emits each word plus an associatedcount of occurrences (just ‘1’ in this simple example).The reduce function sums together all counts emittedfor a particular word.In addition, the user writes code to fill in a mapreducespecification object with the names of the input and out-put files, and optional tuning parameters. The user theninvokes the MapReduce function, passing it the specifi-cation object. The user’s code is linked together with theMapReduce library (implemented in C++). Appendix Acontains the full program text for this example.

2.2 Types

Even though the previous pseudo-code is written in termsof string inputs and outputs, conceptually the map andreduce functions supplied by the user have associatedtypes:map (k1,v1) → list(k2,v2)reduce (k2,list(v2)) → list(v2)

I.e., the input keys and values are drawn from a differentdomain than the output keys and values. Furthermore,the intermediate keys and values are from the same do-main as the output keys and values.Our C++ implementation passes strings to and fromthe user-defined functions and leaves it to the user codeto convert between strings and appropriate types.

2.3 More Examples

Here are a few simple examples of interesting programsthat can be easily expressed as MapReduce computa-tions.

Distributed Grep: The map function emits a line if itmatches a supplied pattern. The reduce function is anidentity function that just copies the supplied intermedi-ate data to the output.

Count of URL Access Frequency: The map func-tion processes logs of web page requests and outputs⟨URL,1⟩. The reduce function adds together all valuesfor the same URL and emits a ⟨URL,total count⟩pair.

Reverse Web-Link Graph: The map function outputs⟨target,source⟩ pairs for each link to a targetURL found in a page named source. The reducefunction concatenates the list of all source URLs as-sociated with a given target URL and emits the pair:⟨target, list(source)⟩

Term-Vector per Host: A term vector summarizes themost important words that occur in a document or a setof documents as a list of ⟨word, frequency⟩ pairs. Themap function emits a ⟨hostname,term vector⟩pair for each input document (where the hostname isextracted from the URL of the document). The re-duce function is passed all per-document term vectorsfor a given host. It adds these term vectors together,throwing away infrequent terms, and then emits a final⟨hostname,term vector⟩ pair.

To appear in OSDI 2004 2

Page 21: Lecture 06  - CS-5040 - modern database systems

programming  model  -­‐  types  

michael  mathioudakis   21  

map  (key,value)  è  list(  (key,value)  )  

reduce  (key,  list(values))  è  (key,  list(values))  

intermediate  (key,  value)  pairs  

input  (key,  value)  pairs   output  (key,  value)  pairs  

type  of   type  of  ≠

Page 22: Lecture 06  - CS-5040 - modern database systems

more  examples  

michael  mathioudakis   22  

grep  search  a  set  of  documents  for  a  string  pagern  in  a  line  

 input  

a  set  of  (key,value)  pairs  key:  document  file  loca6on  (id)  

value:  document  contents  (lines  of  characters)  

Page 23: Lecture 06  - CS-5040 - modern database systems

more  examples  

michael  mathioudakis   23  

map  emits  a  line  if  it  matches  the  pagern  

(document  file  loca6on,  line)    

reduce  iden6ty  func6on  

Page 24: Lecture 06  - CS-5040 - modern database systems

more  examples  

count  of  URL  access  frequency    

process  logs  of  web  page  requests  logs  are  stored  in  documents,  one  line  per  request,  

each  line  contains  URL  of  requested  page    

input  a  set  of  (key,value)  pairs  key:  log  file  loca6on  

value:  log  contents  (lines  of  requests)  

michael  mathioudakis   24  

Page 25: Lecture 06  - CS-5040 - modern database systems

more  examples  

map  process  logs  of  web  page  requests  

output  (URL,  1)  pairs    

reduce  add  together  counts  for  same  URL  

 

michael  mathioudakis   25  

Page 26: Lecture 06  - CS-5040 - modern database systems

more  examples  

reverse  web-­‐link  graph  process  a  set  of  webpages  

for  each  link  to  target  webpage,  find  a  list  [source]  of  all  webpages  source  that  contain  a  link  to  target  

 input  

a  set  of  (key,value)  pairs  key:  webpage  URL  

value:  webpage  contents  (html)  

michael  mathioudakis   26  

Page 27: Lecture 06  - CS-5040 - modern database systems

more  examples  

map  output  (target,  source)  pairs  for  each  link  to  a  target  URL  found  

in  a  page  named  source    

reduce  concatenate  list  of  sources  per  target  output  (target,  list(source))  pairs  

 

michael  mathioudakis   27  

Page 28: Lecture 06  - CS-5040 - modern database systems

more  examples  

term  vector  per  host  process  logs  of  webpages  

each  webpage  has  a  URL  of  the  form  [host]/[page  address]  hgp://www.aalto.fi/en/current/news/2016-­‐03-­‐02/  

find  a  term  vector  per  host    

input  a  set  of  (key,value)  pairs  

key:  webpage  URL  value:  webpage  contents  (html-­‐stripped  text)  

michael  mathioudakis   28  

Page 29: Lecture 06  - CS-5040 - modern database systems

more  examples  

map  emit  a  (hostname,  term  vector)  pair  for  each  webpage,  

hostname  is  extracted  from  document  URL    

reduce  adds  (hostname,  frequency  vector)  pair  per  hostname  

michael  mathioudakis   29  

Page 30: Lecture 06  - CS-5040 - modern database systems

more  examples  

simple  inverted  index  (no  counts)  process  a  collec6on  of  documents    to  construct  an  inverted  index  

for  each  word,  have  a  list  of  documents  in  which  it  occurs    

input  a  set  of  (key,value)  pairs  

key:  document  file  loca6on  (id)  value:  document  contents  (list  of  words)  

michael  mathioudakis   30  

Page 31: Lecture 06  - CS-5040 - modern database systems

more  examples  

map  parse  each  document,  emit  a  sequence  (word,  document  ID)  

 reduce  

output  (word,  list(document  ID))  pair  for  each  word  

michael  mathioudakis   31  

Page 32: Lecture 06  - CS-5040 - modern database systems

system  

at  google  (back  in  2004)  large  clusters  of  commodity  PCs,  connected  with  ethernet  

 dual-­‐processor  x86,  linux,  2-­‐4gb  of  memory  per  machine  

100  Mbit/s  or  1Gbit/s  network  100’s  or  1000’s  pf  machines  per  cluster  

storage  inexpensive  IDE  disks  agached  to  the  machines  google  file  system  (GFS)  -­‐  uses  replica6on  users  submit  jobs  to  scheduling  system  

michael  mathioudakis   32  

Page 33: Lecture 06  - CS-5040 - modern database systems

execu6on  

a  job  is  submiged,  then  what?  map  and  reduce  invoca6ons  are  distributed  over  machines  

 input  data  is  automa6cally  par66oned  into  a  set  of  M  splits  

the  M  splits  are  fed  each  into  a  map  instance    

intermediate  results  are  par66oned  into  R  par66ons  according  to  hash  func6on  -­‐-­‐  provided  by  user  

michael  mathioudakis   33  

Page 34: Lecture 06  - CS-5040 - modern database systems

execu6on  

michael  mathioudakis   34  

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

Page 35: Lecture 06  - CS-5040 - modern database systems

execu6on  

michael  mathioudakis   35  

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

(1)  split  input  files  into  M  pieces  (16-­‐64MB  each)  and  fork  many  copies  of  the  user  program  

Page 36: Lecture 06  - CS-5040 - modern database systems

execu6on  

michael  mathioudakis   36  

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

(1)  split  input  files  into  M  pieces  (16-­‐64MB  each)  and  fork  many  copies  of  the  user  program  

(2)  master  assigns  M  +  R  tasks  to  idle  workers  

Page 37: Lecture 06  - CS-5040 - modern database systems

execu6on  

michael  mathioudakis   37  

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

(3)  worker  assigned  to  map  task  reads  corresponding  split,  passes  input  data  to  map  func6on,  stores  intermediate  results  in  memory  

(2)  master  assigns  M  +  R  tasks  to  idle  workers  

Page 38: Lecture 06  - CS-5040 - modern database systems

execu6on  

michael  mathioudakis   38  

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

(4)  periodically,  buffered  intermediate  results  are  wrigen  to  local  disk,  into  R  par66ons,  according  to  hash  func6on;  their  loca6ons  are  passed  to  master  

(2)  master  assigns  M  +  R  tasks  to  idle  workers  

Page 39: Lecture 06  - CS-5040 - modern database systems

execu6on  

michael  mathioudakis   39  

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

(4)  periodically,  buffered  intermediate  results  are  wrigen  to  local  disk,  into  R  par66ons,  according  to  hash  func6on;  their  loca6ons  are  passed  to  master  

(5)  master  no6fies  reduce  workers;  reduce  worker  collects  intermediate  data  for  one  par66on  from  local  disks  of  map  workers;  sorts  by  intermediate  key;  

Page 40: Lecture 06  - CS-5040 - modern database systems

execu6on  

michael  mathioudakis   40  

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

(6)  reduce  worker  passes  each  intermediate  key  and  corresponding  values  to  reduce  func6on;  output  appended  to  file  for  this  reduce  par66on  

(5)  master  no6fies  reduce  workers;  reduce  worker  collects  intermediate  data  for  one  par66on  from  local  disks  of  map  workers;  sorts  by  intermediate  key;  

Page 41: Lecture 06  - CS-5040 - modern database systems

execu6on  

michael  mathioudakis   41  

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of ⟨word,document ID⟩pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a⟨word, list(document ID)⟩ pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a ⟨key,record⟩ pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

(6)  reduce  worker  passes  each  intermediate  key  and  corresponding  values  to  reduce  func6on;  output  appended  to  file  for  this  reduce  par66on  

(7)  arer  all  tasks  are  completed,  the  master  wakes  up  the  user  program  

final  output:  R  files  

Page 42: Lecture 06  - CS-5040 - modern database systems

master  data  structures  

state  for  each  map  &  reduce  task  idle,  in-­‐progress,  completed  

+  iden6ty  of  assigned  worker    

for  each  completed  map  task  loca6on  and  sizes  of  R  intermediate  file  regions  

received  as  map  tasks  are  completed  pushed  incrementally  to  reduce  workers  with  in-­‐progress  tasks  

michael  mathioudakis   42  

Page 43: Lecture 06  - CS-5040 - modern database systems

fault  tolerance  

worker  failure  master  pings  worker  periodically  

if  no  response,  then  worker  has  failed  completed  map  tasks  reset  to  idle  (why?)  

in-­‐progress  tasks  set  to  idle  idle  tasks:  up  for  grabs  by  other  workers  

michael  mathioudakis   43  

Page 44: Lecture 06  - CS-5040 - modern database systems

fault  tolerance  

master  failure    

master  writes  periodic  checkpoints  with  master  data  structures  (state)  

new  master  re-­‐starts  from  last  check-­‐point  

michael  mathioudakis   44  

Page 45: Lecture 06  - CS-5040 - modern database systems

 “stragglers”  

tasks  that  take  too  long  to  complete    

solu6on  when  a  mapreduce  opera6on  is  close  to  comple6on,  schedule  backup  tasks  for  

remaining  tasks  

michael  mathioudakis   45  

fault  tolerance  

Page 46: Lecture 06  - CS-5040 - modern database systems

locality  

master  tries  to  assign  tasks  to  nodes  that  contain  a  replica  of  the  input  data  

michael  mathioudakis   46  

Page 47: Lecture 06  - CS-5040 - modern database systems

task  granularity  

M  map  tasks  and  R  reduce  tasks  ideally,  M  and  R  should  be  

much  larger  than  number  of  workers  

 why?  

load-­‐balancing  &  speedy  recovery  

michael  mathioudakis   47  

Page 48: Lecture 06  - CS-5040 - modern database systems

ordering  guarantees  

intermediate  key/value  pairs  are  processed  in  increasing  key  order  

 makes  it  easy  to  generate  a  sorted  

output  file  per  par66on  (why?)  

 

michael  mathioudakis   48  

Page 49: Lecture 06  - CS-5040 - modern database systems

combiner  func6ons  op6onal  user-­‐defined  func6on  

executed  on  machines  that  perform  map  tasks  “combines”  results  before  passed  to  the  reducer  

 what  would  the  combiner  be  for  the  

word-­‐count  example?    

typically  the  combiner  is  the  same  as  the  reducer  only  difference:  output  

reducer  writes  to  final  output  combiner  writes  to  intermediate  output  

michael  mathioudakis   49  

Page 50: Lecture 06  - CS-5040 - modern database systems

counters  

objects  updated  within  map  and  reduce  func6ons  periodically  propagated  to  master  

 useful  for  debugging  

michael  mathioudakis   50  

Page 51: Lecture 06  - CS-5040 - modern database systems

counters  -­‐  example  

Counter* uppercase;uppercase = GetCounter("uppercase");

map(String name, String contents):for each word w in contents:if (IsCapitalized(w)):uppercase->Increment();

EmitIntermediate(w, "1");

The counter values from individual worker machinesare periodically propagated to the master (piggybackedon the ping response). The master aggregates the countervalues from successful map and reduce tasks and returnsthem to the user code when the MapReduce operationis completed. The current counter values are also dis-played on the master status page so that a human canwatch the progress of the live computation. When aggre-gating counter values, the master eliminates the effects ofduplicate executions of the same map or reduce task toavoid double counting. (Duplicate executions can arisefrom our use of backup tasks and from re-execution oftasks due to failures.)Some counter values are automatically maintainedby the MapReduce library, such as the number of in-put key/value pairs processed and the number of outputkey/value pairs produced.Users have found the counter facility useful for san-ity checking the behavior of MapReduce operations. Forexample, in some MapReduce operations, the user codemay want to ensure that the number of output pairsproduced exactly equals the number of input pairs pro-cessed, or that the fraction of German documents pro-cessed is within some tolerable fraction of the total num-ber of documents processed.

5 Performance

In this section we measure the performance of MapRe-duce on two computations running on a large cluster ofmachines. One computation searches through approxi-mately one terabyte of data looking for a particular pat-tern. The other computation sorts approximately one ter-abyte of data.These two programs are representative of a large sub-set of the real programswritten by users of MapReduce –one class of programs shuffles data from one representa-tion to another, and another class extracts a small amountof interesting data from a large data set.

5.1 Cluster ConfigurationAll of the programs were executed on a cluster thatconsisted of approximately 1800 machines. Each ma-chine had two 2GHz Intel Xeon processors with Hyper-Threading enabled, 4GB of memory, two 160GB IDE

20 40 60 80 100Seconds

0

10000

20000

30000

Inpu

t (M

B/s)

Figure 2: Data transfer rate over time

disks, and a gigabit Ethernet link. The machines werearranged in a two-level tree-shaped switched networkwith approximately 100-200 Gbps of aggregate band-width available at the root. All of the machines werein the same hosting facility and therefore the round-triptime between any pair of machines was less than a mil-lisecond.Out of the 4GB of memory, approximately 1-1.5GBwas reserved by other tasks running on the cluster. Theprograms were executed on a weekend afternoon, whenthe CPUs, disks, and network were mostly idle.

5.2 Grep

The grep program scans through 1010 100-byte records,searching for a relatively rare three-character pattern (thepattern occurs in 92,337 records). The input is split intoapproximately 64MB pieces (M = 15000), and the en-tire output is placed in one file (R = 1).Figure 2 shows the progress of the computation overtime. The Y-axis shows the rate at which the input data isscanned. The rate gradually picks up as more machinesare assigned to this MapReduce computation, and peaksat over 30 GB/s when 1764 workers have been assigned.As the map tasks finish, the rate starts dropping and hitszero about 80 seconds into the computation. The entirecomputation takes approximately 150 seconds from startto finish. This includes about a minute of startup over-head. The overhead is due to the propagation of the pro-gram to all worker machines, and delays interacting withGFS to open the set of 1000 input files and to get theinformation needed for the locality optimization.

5.3 Sort

The sort program sorts 1010 100-byte records (approxi-mately 1 terabyte of data). This program is modeled afterthe TeraSort benchmark [10].The sorting program consists of less than 50 lines ofuser code. A three-line Map function extracts a 10-bytesorting key from a text line and emits the key and the

To appear in OSDI 2004 8

michael  mathioudakis   51  

Page 52: Lecture 06  - CS-5040 - modern database systems

performance  

1800  machines  each  machine  had  two  2GHz  Xeon  processors  

4GB  of  memory  (2.5-­‐3GB  available)  two  160GB  disks  gigabit  Ethernet  

michael  mathioudakis   52  

Page 53: Lecture 06  - CS-5040 - modern database systems

performance  grep  

1010  100-­‐byte  records  search  for  a  pagern  found  in  <105  records  

 

M  =  15000,  R  =  1    

150  seconds  from  start  to  finish    

exercise:  today,  how  big  a  file  would  you  grep  on  

one  machine  in  150  seconds?  

michael  mathioudakis   53  

Page 54: Lecture 06  - CS-5040 - modern database systems

performance  sort  

1010  100-­‐byte  records  extract  10  byte  sor6ng-­‐key  from  each  record  (line)  

 

M  =  15000,  R  =  4000    

850  seconds  from  start  to  finish    

exercise:  how  would  you  implement  sort?  

michael  mathioudakis   54  

Page 55: Lecture 06  - CS-5040 - modern database systems

summary    

original  mapreduce  paper    

simple  programming  model  based  on  func6onal  language  primi6ves  

 system  takes  care  of  

scheduling  and  fault-­‐tolerance    

great  impact  for  cluster  compu6ng  

michael  mathioudakis   55  

Page 56: Lecture 06  - CS-5040 - modern database systems

hadoop  

michael  mathioudakis   56  

Page 57: Lecture 06  - CS-5040 - modern database systems

map  reduce  and  hadoop  

michael  mathioudakis   57  

mapreduce  implemented  into  apache  hadoop  

sorware  ecosystem  for  distributed  data  storage  and  processing  

open  source  

Page 58: Lecture 06  - CS-5040 - modern database systems

hadoop  

michael  mathioudakis   58  

common  

hdfs  

mapreduce  

yarn  

scheduling  &  resource  

management  

hadoop  distributed  filesystem  

Page 59: Lecture 06  - CS-5040 - modern database systems

hadoop  

michael  mathioudakis   59  

common  

hdfs  

mapreduce  

yarn  

scheduling  &  resource  

management  

hadoop  distributed  filesystem  

mahout  

machine  learning  library  

hive  

data  warehouse,  sql-­‐like  querying  

pig  data-­‐flow  language  and  system  for  

parallel  computa6on  

spark  and  a  lot  of  other  

projects!!  

cluster-­‐compu6ng  engine  

Page 60: Lecture 06  - CS-5040 - modern database systems

spark  

michael  mathioudakis   60  

Page 61: Lecture 06  - CS-5040 - modern database systems

michael  mathioudakis   61  

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion StoicaUniversity of California, Berkeley

AbstractMapReduce and its variants have been highly successfulin implementing large-scale data-intensive applicationson commodity clusters. However, most of these systemsare built around an acyclic data flow model that is notsuitable for other popular applications. This paper fo-cuses on one such class of applications: those that reusea working set of data across multiple parallel operations.This includes many iterative machine learning algorithms,as well as interactive data analysis tools. We propose anew framework called Spark that supports these applica-tions while retaining the scalability and fault tolerance ofMapReduce. To achieve these goals, Spark introduces anabstraction called resilient distributed datasets (RDDs).An RDD is a read-only collection of objects partitionedacross a set of machines that can be rebuilt if a partitionis lost. Spark can outperform Hadoop by 10x in iterativemachine learning jobs, and can be used to interactivelyquery a 39 GB dataset with sub-second response time.

1 IntroductionA new model of cluster computing has become widelypopular, in which data-parallel computations are executedon clusters of unreliable machines by systems that auto-matically provide locality-aware scheduling, fault toler-ance, and load balancing. MapReduce [11] pioneered thismodel, while systems like Dryad [17] and Map-Reduce-Merge [24] generalized the types of data flows supported.These systems achieve their scalability and fault toleranceby providing a programming model where the user createsacyclic data flow graphs to pass input data through a set ofoperators. This allows the underlying system to managescheduling and to react to faults without user intervention.

While this data flow programming model is useful for alarge class of applications, there are applications that can-not be expressed efficiently as acyclic data flows. In thispaper, we focus on one such class of applications: thosethat reuse a working set of data across multiple paralleloperations. This includes two use cases where we haveseen Hadoop users report that MapReduce is deficient:• Iterative jobs: Many common machine learning algo-

rithms apply a function repeatedly to the same datasetto optimize a parameter (e.g., through gradient de-scent). While each iteration can be expressed as a

MapReduce/Dryad job, each job must reload the datafrom disk, incurring a significant performance penalty.

• Interactive analytics: Hadoop is often used to runad-hoc exploratory queries on large datasets, throughSQL interfaces such as Pig [21] and Hive [1]. Ideally,a user would be able to load a dataset of interest intomemory across a number of machines and query it re-peatedly. However, with Hadoop, each query incurssignificant latency (tens of seconds) because it runs asa separate MapReduce job and reads data from disk.

This paper presents a new cluster computing frame-work called Spark, which supports applications withworking sets while providing similar scalability and faulttolerance properties to MapReduce.

The main abstraction in Spark is that of a resilient dis-tributed dataset (RDD), which represents a read-only col-lection of objects partitioned across a set of machines thatcan be rebuilt if a partition is lost. Users can explicitlycache an RDD in memory across machines and reuse itin multiple MapReduce-like parallel operations. RDDsachieve fault tolerance through a notion of lineage: if apartition of an RDD is lost, the RDD has enough infor-mation about how it was derived from other RDDs to beable to rebuild just that partition. Although RDDs arenot a general shared memory abstraction, they representa sweet-spot between expressivity on the one hand andscalability and reliability on the other hand, and we havefound them well-suited for a variety of applications.

Spark is implemented in Scala [5], a statically typedhigh-level programming language for the Java VM, andexposes a functional programming interface similar toDryadLINQ [25]. In addition, Spark can be used inter-actively from a modified version of the Scala interpreter,which allows the user to define RDDs, functions, vari-ables and classes and use them in parallel operations on acluster. We believe that Spark is the first system to allowan efficient, general-purpose programming language to beused interactively to process large datasets on a cluster.

Although our implementation of Spark is still a proto-type, early experience with the system is encouraging. Weshow that Spark can outperform Hadoop by 10x in itera-tive machine learning workloads and can be used interac-tively to scan a 39 GB dataset with sub-second latency.

This paper is organized as follows. Section 2 describes

1

appeared  at  HotCloud,  2010  

Page 62: Lecture 06  - CS-5040 - modern database systems

michael  mathioudakis   62  

appeared  at  the  USENIX  conference  on  networked  systems  design  and  implementa6on,  2010  

Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica

University of California, Berkeley

AbstractWe present Resilient Distributed Datasets (RDDs), a dis-tributed memory abstraction that lets programmers per-form in-memory computations on large clusters in afault-tolerant manner. RDDs are motivated by two typesof applications that current computing frameworks han-dle inefficiently: iterative algorithms and interactive datamining tools. In both cases, keeping data in memorycan improve performance by an order of magnitude.To achieve fault tolerance efficiently, RDDs provide arestricted form of shared memory, based on coarse-grained transformations rather than fine-grained updatesto shared state. However, we show that RDDs are expres-sive enough to capture a wide class of computations, in-cluding recent specialized programming models for iter-ative jobs, such as Pregel, and new applications that thesemodels do not capture. We have implemented RDDs in asystem called Spark, which we evaluate through a varietyof user applications and benchmarks.

1 IntroductionCluster computing frameworks like MapReduce [10] andDryad [19] have been widely adopted for large-scale dataanalytics. These systems let users write parallel compu-tations using a set of high-level operators, without havingto worry about work distribution and fault tolerance.

Although current frameworks provide numerous ab-stractions for accessing a cluster’s computational re-sources, they lack abstractions for leveraging distributedmemory. This makes them inefficient for an importantclass of emerging applications: those that reuse interme-diate results across multiple computations. Data reuse iscommon in many iterative machine learning and graphalgorithms, including PageRank, K-means clustering,and logistic regression. Another compelling use case isinteractive data mining, where a user runs multiple ad-hoc queries on the same subset of the data. Unfortu-nately, in most current frameworks, the only way to reusedata between computations (e.g., between two MapRe-duce jobs) is to write it to an external stable storage sys-tem, e.g., a distributed file system. This incurs substantialoverheads due to data replication, disk I/O, and serializa-

tion, which can dominate application execution times.Recognizing this problem, researchers have developed

specialized frameworks for some applications that re-quire data reuse. For example, Pregel [22] is a system foriterative graph computations that keeps intermediate datain memory, while HaLoop [7] offers an iterative MapRe-duce interface. However, these frameworks only supportspecific computation patterns (e.g., looping a series ofMapReduce steps), and perform data sharing implicitlyfor these patterns. They do not provide abstractions formore general reuse, e.g., to let a user load several datasetsinto memory and run ad-hoc queries across them.

In this paper, we propose a new abstraction called re-silient distributed datasets (RDDs) that enables efficientdata reuse in a broad range of applications. RDDs arefault-tolerant, parallel data structures that let users ex-plicitly persist intermediate results in memory, controltheir partitioning to optimize data placement, and ma-nipulate them using a rich set of operators.

The main challenge in designing RDDs is defining aprogramming interface that can provide fault toleranceefficiently. Existing abstractions for in-memory storageon clusters, such as distributed shared memory [24], key-value stores [25], databases, and Piccolo [27], offer aninterface based on fine-grained updates to mutable state(e.g., cells in a table). With this interface, the only waysto provide fault tolerance are to replicate the data acrossmachines or to log updates across machines. Both ap-proaches are expensive for data-intensive workloads, asthey require copying large amounts of data over the clus-ter network, whose bandwidth is far lower than that ofRAM, and they incur substantial storage overhead.

In contrast to these systems, RDDs provide an inter-face based on coarse-grained transformations (e.g., map,filter and join) that apply the same operation to manydata items. This allows them to efficiently provide faulttolerance by logging the transformations used to build adataset (its lineage) rather than the actual data.1 If a parti-tion of an RDD is lost, the RDD has enough informationabout how it was derived from other RDDs to recompute

1Checkpointing the data in some RDDs may be useful when a lin-eage chain grows large, however, and we discuss how to do it in §5.4.

Page 63: Lecture 06  - CS-5040 - modern database systems

why  not  mapreduce?  

mapreduce  flows  are  acyclic    

not  efficient  for  some  applica6ons  

michael  mathioudakis   63  

Page 64: Lecture 06  - CS-5040 - modern database systems

why  not  mapreduce?  

itera8ve  jobs  many  common  machine  learning  algorithms    

repeatedly  apply  the  same  func6on  on  the  same  dataset    (e.g.,  gradient  descent)  

 mapreduce  repeatedly  reloads  

(reads  &  writes)  data  

michael  mathioudakis   64  

Page 65: Lecture 06  - CS-5040 - modern database systems

why  not  mapreduce?  

interac8ve  analy8cs  load  data  in  memory  and  query  repeatedly  

 mapreduce  would  re-­‐read  data  

michael  mathioudakis   65  

Page 66: Lecture 06  - CS-5040 - modern database systems

spark’s  proposal  

generalize  mapreduce  model  to  accommodate  such  applica6ons  

 allow  us  treat  data  as  available  

across  repeated  queries  and  updates    

resilient  distributed  datasets  (rdds)  

michael  mathioudakis   66  

Page 67: Lecture 06  - CS-5040 - modern database systems

resilient  distributed  datasets  (rdd)  

read-­‐only  collec6on  of  objects  par66oned  across  machines  

 users  can  explicitly  cache  rdds  in  memory  

re-­‐use  across  mapreduce-­‐like  parallel  opera6ons  

michael  mathioudakis   67  

Page 68: Lecture 06  - CS-5040 - modern database systems

main  challenge  

efficient  fault-­‐tolerance    

to  treat  data  as  available  in-­‐memory  should  be  easy  to  re-­‐build  

if  part  of  data  (e.g.,  a  par66on)  is  lost    

achieved  through  course-­‐grained  transforma3ons  and  lineage  

michael  mathioudakis   68  

Page 69: Lecture 06  - CS-5040 - modern database systems

fault-­‐tolerance  coarse  transforma8ons  

e.g.,  map  opera6ons  applied  to  many  (even  all)  data  items  

 lineage  

the  series  of  transforma6ons  that  led  to  a  dataset    

if  a  par66on  is  lost,  there  is  enough  informa6on  to  re-­‐apply  the  transforma6ons  and  re-­‐compute  it  

 

michael  mathioudakis   69  

Page 70: Lecture 06  - CS-5040 - modern database systems

programming  model  

developers  write  a  drive  program  high-­‐level  control  flow  

 think  of  rdds  as  ‘variables’  that  represent  datasets  

on  which  you  apply  parallel  opera3ons    

can  also  use  restricted  types  of  shared  variables  

michael  mathioudakis   70  

Page 71: Lecture 06  - CS-5040 - modern database systems

spark  run6me  

Worker tasks

results RAM

Input Data

Worker RAM

Input Data

Worker RAM

Input Data

Driver

Figure 2: Spark runtime. The user’s driver program launchesmultiple workers, which read data blocks from a distributed filesystem and can persist computed RDD partitions in memory.

ule tasks based on data locality to improve performance.Second, RDDs degrade gracefully when there is notenough memory to store them, as long as they are onlybeing used in scan-based operations. Partitions that donot fit in RAM can be stored on disk and will providesimilar performance to current data-parallel systems.

2.4 Applications Not Suitable for RDDs

As discussed in the Introduction, RDDs are best suitedfor batch applications that apply the same operation toall elements of a dataset. In these cases, RDDs can ef-ficiently remember each transformation as one step in alineage graph and can recover lost partitions without hav-ing to log large amounts of data. RDDs would be lesssuitable for applications that make asynchronous fine-grained updates to shared state, such as a storage sys-tem for a web application or an incremental web crawler.For these applications, it is more efficient to use systemsthat perform traditional update logging and data check-pointing, such as databases, RAMCloud [25], Percolator[26] and Piccolo [27]. Our goal is to provide an efficientprogramming model for batch analytics and leave theseasynchronous applications to specialized systems.

3 Spark Programming InterfaceSpark provides the RDD abstraction through a language-integrated API similar to DryadLINQ [31] in Scala [2],a statically typed functional programming language forthe Java VM. We chose Scala due to its combination ofconciseness (which is convenient for interactive use) andefficiency (due to static typing). However, nothing aboutthe RDD abstraction requires a functional language.

To use Spark, developers write a driver program thatconnects to a cluster of workers, as shown in Figure 2.The driver defines one or more RDDs and invokes ac-tions on them. Spark code on the driver also tracks theRDDs’ lineage. The workers are long-lived processesthat can store RDD partitions in RAM across operations.

As we showed in the log mining example in Sec-tion 2.2.1, users provide arguments to RDD opera-

tions like map by passing closures (function literals).Scala represents each closure as a Java object, andthese objects can be serialized and loaded on anothernode to pass the closure across the network. Scala alsosaves any variables bound in the closure as fields inthe Java object. For example, one can write code likevar x = 5; rdd.map(_ + x) to add 5 to each elementof an RDD.5

RDDs themselves are statically typed objectsparametrized by an element type. For example,RDD[Int] is an RDD of integers. However, most of ourexamples omit types since Scala supports type inference.

Although our method of exposing RDDs in Scala isconceptually simple, we had to work around issues withScala’s closure objects using reflection [33]. We alsoneeded more work to make Spark usable from the Scalainterpreter, as we shall discuss in Section 5.2. Nonethe-less, we did not have to modify the Scala compiler.

3.1 RDD Operations in Spark

Table 2 lists the main RDD transformations and actionsavailable in Spark. We give the signature of each oper-ation, showing type parameters in square brackets. Re-call that transformations are lazy operations that define anew RDD, while actions launch a computation to returna value to the program or write data to external storage.

Note that some operations, such as join, are only avail-able on RDDs of key-value pairs. Also, our functionnames are chosen to match other APIs in Scala and otherfunctional languages; for example, map is a one-to-onemapping, while flatMap maps each input value to one ormore outputs (similar to the map in MapReduce).

In addition to these operators, users can ask for anRDD to persist. Furthermore, users can get an RDD’spartition order, which is represented by a Partitionerclass, and partition another dataset according to it. Op-erations such as groupByKey, reduceByKey and sort au-tomatically result in a hash or range partitioned RDD.

3.2 Example Applications

We complement the data mining example in Section2.2.1 with two iterative applications: logistic regressionand PageRank. The latter also showcases how control ofRDDs’ partitioning can improve performance.

3.2.1 Logistic Regression

Many machine learning algorithms are iterative in naturebecause they run iterative optimization procedures, suchas gradient descent, to maximize a function. They canthus run much faster by keeping their data in memory.

As an example, the following program implements lo-gistic regression [14], a common classification algorithm

5We save each closure at the time it is created, so that the map inthis example will always add 5 even if x changes.

michael  mathioudakis   71  

Page 72: Lecture 06  - CS-5040 - modern database systems

rdd  read-­‐only  collec6on  of  objects  par66oned  across  a  set  of  machines,  that  can  be  re-­‐built  if  a  par66on  is  lost  

 constructed  in  the  following  ways:  

from  a  file  in  a  shared  file  system  (e.g.,  hdfs)  parallelizing  a  collec8on  (e.g.,  an  array)  divide  into  par66ons  and  send  to  mul6ple  nodes  

transforming  an  exis8ng  rdd  e.g.,  applying  a  map  opera6on  

changing  the  persistence  of  an  exis6ng  rdd  hint  to  cache  rdd  or  save  to  filesystem  

michael  mathioudakis   72  

Page 73: Lecture 06  - CS-5040 - modern database systems

rdd  

need  not  exist  physically  at  all  6mes  instead,  there  is  enough  informa6on  

to  compute  the  rdd    

rdds  are  lazily-­‐created  and  ephemeral    

lazy  materialized  only  when  informa6on  is  extracted  from  

them  (through  ac3ons!)  ephemeral  

discarded  arer  use  

michael  mathioudakis   73  

Page 74: Lecture 06  - CS-5040 - modern database systems

transforma6ons  and  ac6ons  

transforma6ons  lazy  opera6ons  that  define  a  new  rdd  

 ac6ons  

launch  computa6on  on  rdd    to  return  a  value  to  the  program    or  write  data  to  external  storage  

michael  mathioudakis   74  

Page 75: Lecture 06  - CS-5040 - modern database systems

shared  variables      

broadcast  variables  read-­‐only  variables,  sent  to  all  workers  

 typical  use-­‐case  

large  read-­‐only  piece  of  data  (e.g.,  lookup  table)  that  is  used  across  mul6ple  parallel  opera6ons  

michael  mathioudakis   75  

Page 76: Lecture 06  - CS-5040 - modern database systems

shared  variables  

accumulators  write-­‐only  variables,  that  workers  can  update  

using  an  opera6on  that  is  commuta6ve  and  associa6ve  

only  the  driver  can  read    

typical  use-­‐case  counters  

michael  mathioudakis   76  

Page 77: Lecture 06  - CS-5040 - modern database systems

example:  text  search  

suppose  that  a  web  service  is  experiencing  errors  and  you  want  to  search  over  terabytes  of  

logs  to  find  the  cause  the  logs  are  stored  in  Hadoop  Filesystem  (HDFS)  errors  are  wrigen  in  the  logs  as  lines  that  start  

with  the  keyword  “ERROR”  

michael  mathioudakis   77  

Page 78: Lecture 06  - CS-5040 - modern database systems

example:  text  search  

michael  mathioudakis   78  

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

in  Scala...  

rdd  

rdd  

from  a  file  

transforma6on  

hint:  keep  in  memory!  

no  work  on  the  cluster  so  far  

ac6on!  lines  is  not  loaded  to  ram!  

Page 79: Lecture 06  - CS-5040 - modern database systems

example  -­‐  text  search  ctd.  

let  us  find  errors  related  to  “MySQL”  

michael  mathioudakis   79  

Page 80: Lecture 06  - CS-5040 - modern database systems

example  -­‐  text  search  ctd.  

michael  mathioudakis   80  

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

transforma6on   ac6on  

Page 81: Lecture 06  - CS-5040 - modern database systems

example  -­‐  text  search  ctd.  again  

let  us  find  errors  related  to  “HDFS”  and  extract  their  6me  field  

assuming  6me  is  field  no.  3  in  tab-­‐separated  format  

michael  mathioudakis   81  

Page 82: Lecture 06  - CS-5040 - modern database systems

example  -­‐  text  search  ctd.  again  

michael  mathioudakis   82  

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

transforma6ons  

ac6on  

Page 83: Lecture 06  - CS-5040 - modern database systems

example:  text  search  lineage  of  6me  fields  

michael  mathioudakis   83  

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")errors = lines.filter(_.startsWith("ERROR"))errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

cached  

pipelined  transforma6ons  if  a  par66on  of  errors  is  lost,  

filter  is  applied  only  the  corresponding  par66on  of  lines  

Page 84: Lecture 06  - CS-5040 - modern database systems

transforma6ons  and  ac6ons  

Transformations

map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]

flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)

groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]

union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]

cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]

mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]

Actions

count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]

reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.

3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r

n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:

// Load graph as an RDD of (URL, outlinks) pairs

ranks0 input file map

contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .

Figure 3: Lineage graph for datasets in PageRank.

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number

6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.

michael  mathioudakis   84  

Page 85: Lecture 06  - CS-5040 - modern database systems

example:  pagerank  se|ng  

N  documents  that  contain  links  to  other  documents  (e.g.,  webpages)    

 pagerank  itera6vely  updates  a  rank  score  for  each  document  by  

adding  up  contribu6ons  from  documents  that  link  to  it    

itera6on  each  document  sends  a  contribu6on  of  rank/n  to  its  neighbors  

rank:  own  document  rank,  n:  number  of  neighbors  updates  its  rank  to    α/Ν  +  (1-­‐α)Σci  

ci:  contribu6on  received    

michael  mathioudakis   85  

Page 86: Lecture 06  - CS-5040 - modern database systems

example:  pagerank  

michael  mathioudakis   86  

Transformations

map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]

flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)

groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]

union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]

cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]

mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]

Actions

count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]

reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.

3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r

n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:

// Load graph as an RDD of (URL, outlinks) pairs

ranks0 input file map

contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .

Figure 3: Lineage graph for datasets in PageRank.

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number

6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.

Transformations

map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]

flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)

groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]

union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]

cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]

mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]

Actions

count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]

reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.

3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r

n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:

// Load graph as an RDD of (URL, outlinks) pairs

ranks0 input file map

contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .

Figure 3: Lineage graph for datasets in PageRank.

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number

6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.

Page 87: Lecture 06  - CS-5040 - modern database systems

example:  pagerank  -­‐  lineage  

Transformations

map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]

flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)

groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]

union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]

cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]

mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]

Actions

count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]

reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.

3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r

n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:

// Load graph as an RDD of (URL, outlinks) pairs

ranks0 input file map

contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .

Figure 3: Lineage graph for datasets in PageRank.

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number

6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.

michael  mathioudakis   87  

Page 88: Lecture 06  - CS-5040 - modern database systems

represen6ng  rdds  

internal  informa6on  about  rdds    

par66ons  &  par66oning  scheme  dependencies  on  parent  RDDs  

func6on  to  compute  it  from  parents    

michael  mathioudakis   88  

Page 89: Lecture 06  - CS-5040 - modern database systems

rdd  dependencies  

narrow  dependencies  each  par66on  of  the  parent  rdd  is  used  by  at  

most  one  par66on  of  the  child  rdd    

otherwise,  wide  dependencies  

michael  mathioudakis   89  

Page 90: Lecture 06  - CS-5040 - modern database systems

rdd  dependencies  

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.

michael  mathioudakis   90  

Page 91: Lecture 06  - CS-5040 - modern database systems

scheduling  

when  an  ac6on  is  performed...  (e.g.,  count()  or  save())  

...  the  scheduler  examines  the  lineage  graph  builds  a  DAG  of  stages  to  execute  

 each  stage  is  a  maximal  pipeline  of  

transforma6ons  over  narrow  dependencies  

michael  mathioudakis   91  

Page 92: Lecture 06  - CS-5040 - modern database systems

scheduling  

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job SchedulingSpark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.

michael  mathioudakis   92  

rdd  

par66on  

already  in  ram  

Page 93: Lecture 06  - CS-5040 - modern database systems

memory  management  

when  not  enough  memory  apply  LRU  evic6on  policy  at  rdd  level  

evict  par66on  from  least  recently  used  rdd  

michael  mathioudakis   93  

Page 94: Lecture 06  - CS-5040 - modern database systems

performance  

logis6c  regression  and  k-­‐means  amazon  EC2  

10  itera6ons  on  100GB  datasets  100  node-­‐clusters  

michael  mathioudakis   94  

Page 95: Lecture 06  - CS-5040 - modern database systems

performance  

them simpler to checkpoint than general shared mem-ory. Because consistency is not a concern, RDDs can bewritten out in the background without requiring programpauses or distributed snapshot schemes.

6 EvaluationWe evaluated Spark and RDDs through a series of exper-iments on Amazon EC2, as well as benchmarks of userapplications. Overall, our results show the following:• Spark outperforms Hadoop by up to 20⇥ in itera-

tive machine learning and graph applications. Thespeedup comes from avoiding I/O and deserializationcosts by storing data in memory as Java objects.

• Applications written by our users perform and scalewell. In particular, we used Spark to speed up an an-alytics report that was running on Hadoop by 40⇥.

• When nodes fail, Spark can recover quickly by re-building only the lost RDD partitions.

• Spark can be used to query a 1 TB dataset interac-tively with latencies of 5–7 seconds.

We start by presenting benchmarks for iterative ma-chine learning applications (§6.1) and PageRank (§6.2)against Hadoop. We then evaluate fault recovery in Spark(§6.3) and behavior when a dataset does not fit in mem-ory (§6.4). Finally, we discuss results for user applica-tions (§6.5) and interactive data mining (§6.6).

Unless otherwise noted, our tests used m1.xlarge EC2nodes with 4 cores and 15 GB of RAM. We used HDFSfor storage, with 256 MB blocks. Before each test, wecleared OS buffer caches to measure IO costs accurately.

6.1 Iterative Machine Learning ApplicationsWe implemented two iterative machine learning appli-cations, logistic regression and k-means, to compare theperformance of the following systems:• Hadoop: The Hadoop 0.20.2 stable release.

• HadoopBinMem: A Hadoop deployment that con-verts the input data into a low-overhead binary formatin the first iteration to eliminate text parsing in laterones, and stores it in an in-memory HDFS instance.

• Spark: Our implementation of RDDs.We ran both algorithms for 10 iterations on 100 GB

datasets using 25–100 machines. The key difference be-tween the two applications is the amount of computationthey perform per byte of data. The iteration time of k-means is dominated by computation, while logistic re-gression is less compute-intensive and thus more sensi-tive to time spent in deserialization and I/O.

Since typical learning algorithms need tens of itera-tions to converge, we report times for the first iterationand subsequent iterations separately. We find that shar-ing data via RDDs greatly speeds up future iterations.

80!

139!

46!

115!

182!

82!

76!

62!

3!

106!

87!

33!

0!40!80!

120!160!200!240!

Hadoop! HadoopBM! Spark! Hadoop! HadoopBM! Spark!

Logistic Regression! K-Means!

Itera

tion

time

(s)!

First Iteration!Later Iterations!

Figure 7: Duration of the first and later iterations in Hadoop,HadoopBinMem and Spark for logistic regression and k-meansusing 100 GB of data on a 100-node cluster.

184!

111!

76!

116!

80!

62!

15!

6! 3!

0!50!100!150!200!250!300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop!HadoopBinMem!Spark!

(a) Logistic Regression

274!

157!

106!

197!

121!

87!

143!

61!

33!

0!

50!

100!

150!

200!

250!

300!

25! 50! 100!

Itera

tion

time

(s)!

Number of machines!

Hadoop !HadoopBinMem!Spark!

(b) K-Means

Figure 8: Running times for iterations after the first in Hadoop,HadoopBinMem, and Spark. The jobs all processed 100 GB.

First Iterations All three systems read text input fromHDFS in their first iterations. As shown in the light barsin Figure 7, Spark was moderately faster than Hadoopacross experiments. This difference was due to signal-ing overheads in Hadoop’s heartbeat protocol betweenits master and workers. HadoopBinMem was the slowestbecause it ran an extra MapReduce job to convert the datato binary, it and had to write this data across the networkto a replicated in-memory HDFS instance.

Subsequent Iterations Figure 7 also shows the aver-age running times for subsequent iterations, while Fig-ure 8 shows how these scaled with cluster size. For lo-gistic regression, Spark 25.3⇥ and 20.7⇥ faster thanHadoop and HadoopBinMem respectively on 100 ma-chines. For the more compute-intensive k-means appli-cation, Spark still achieved speedup of 1.9⇥ to 3.2⇥.

Understanding the Speedup We were surprised tofind that Spark outperformed even Hadoop with in-memory storage of binary data (HadoopBinMem) by a20⇥ margin. In HadoopBinMem, we had used Hadoop’sstandard binary format (SequenceFile) and a large blocksize of 256 MB, and we had forced HDFS’s data di-rectory to be on an in-memory file system. However,Hadoop still ran slower due to several factors:1. Minimum overhead of the Hadoop software stack,

2. Overhead of HDFS while serving data, and

michael  mathioudakis   95  

Page 96: Lecture 06  - CS-5040 - modern database systems

performance  

Example: Logistic Regression

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop

Spark

110 s / iteration

first iteration 80 s further iterations 1 s

michael  mathioudakis   96  

logis6c  regression  2015  

Page 97: Lecture 06  - CS-5040 - modern database systems

summary  

spark  generalized  map-­‐reduce  

tailored  to  itera6ve  computa6on  and  interac6ve  querying  

 simple  programming  model  

centered  on  rdds  

michael  mathioudakis   97  

Page 98: Lecture 06  - CS-5040 - modern database systems

references  1.  Dean,  Jeffrey,  and  Sanjay  Ghemawat.  "MapReduce:  Simplified  Data  

Processing  on  Large  Clusters.”  OSDI  2004.  2.  Zaharia,  Matei,  et  al.  "Spark:  Cluster  Compu6ng  with  Working  Sets."  

HotCloud  10  (2010):  10-­‐10.  3.  Zaharia,  Matei,  et  al.  "Resilient  distributed  datasets:  A  fault-­‐tolerant  

abstrac6on  for  in-­‐memory  cluster  compu6ng."  Proceedings  of  the  9th  USENIX  conference  on  Networked  Systems  Design  and  Implementa3on.  

4.  Learning  Spark:  Lightning-­‐Fast  Big  Data  Analysis,  by  Holden  Karau,  Andy  Konwinski,  Patrick  Wendell,  Matei  Zaharia  

5.  Chang  F,  Dean  J,  Ghemawat  S,  Hsieh  WC,  Wallach  DA,  Burrows  M,  Chandra  T,  Fikes  A,  Gruber  RE.  Bigtable:  A  distributed  storage  system  for  structured  data.  ACM  Transac6ons  on  Computer  Systems  (TOCS).  2008  Jun  1;26(2):4.  

6.  Ghemawat,  Sanjay,  Howard  Gobioff,  and  Shun-­‐Tak  Leung.  "The  Google  file  system."  ACM  SIGOPS  opera3ng  systems  review.  Vol.  37.  No.  5.  ACM,  2003.  

 

michael  mathioudakis   98  

Page 99: Lecture 06  - CS-5040 - modern database systems

next  week  spark  programming  

michael  mathioudakis   99  

Page 100: Lecture 06  - CS-5040 - modern database systems

spark  programming  •  crea6ng  rdds  •  transforma6ons  •  ac6ons  •  lazy  evalua6on  •  persistence  •  passing  custom  func6ons  •  working  with  key-­‐value  pairs  

–  crea6on,  transforma6ons,  ac6ons  •  advanced  data  par66oning  •  global  variables  

–  accumulators  (write-­‐only)  –  broadcast  (read-­‐only)  

•  reading  and  wri6ng  data    

michael  mathioudakis   100